Detecting Failures in HPC Storage Nodes

Detecting Failures in HPC Storage Nodes [ READ ]
Anas A. Hadi, Fadi F. Fouz

Future High-Performance Computing (HPC) systems are expected to include more components than current HPC systems. This increase in components, decreases the reliability of the system. Recovering from failure is one of the hardest problem in future HPC. In this research, storage node failures will be considered, as they are the least reliable hardware components due to their mechanical aspects. Ensemble Learning was proposed as a prediction algorithm to predict the failure in these nodes. According to our evaluation, acceptable prediction could be obtained with sufficient lead time window.