Preview

Proceedings of the National Academy of Sciences of Belarus. Physical-technical series

Advanced search

Reliability assessment of cluster supercomputer configuration

https://doi.org/10.29235/1561-8358-2019-64-3-347-358

Abstract

The study of reliability indicators was carried out on the example of a cluster supercomputer configuration of “SKIF-GEO” (further cluster) worked out within the framework of the scientific and technical program “SKIF-Nedra” (2015–2018, Program of the Union State of Russia and Belarus). The cluster is a stationary supercomputer configuration designed to solve resource-intensive applications in data processing centers (DPC). Computing platforms and other cluster modules are located in the same 19′′ rack height of 42U. Theoretical peak performance of cluster – 100 Tflop/s. The basic architectural principles implemented in the cluster, the composition and structural-functional scheme of the cluster are given.
A methodological support for calculating the reliability of the cluster, based on previous studies by the authors, is proposed. Taking into account these studies, the structural scheme of reliability (SSR) of the cluster, consisting of two parts – the cluster core and the combination of computing facilities (nodes) (CCF), is substantiated. The component parts (CP) include components of the cluster, the failure of which leads to a decrease in performance to zero. CCF includes CP of cluster, the failures of which lead to a decrease in cluster performance. The choice of the main indicators of the reliability of the cluster core and CCF is grounded and formulas for calculating these indicators are given. The analysis of the consequences of failures of cluster components is made. Taking into account the analysis, the SSR of the cluster core is determined, which allows to derive a formula for calculating the cluster core reliability indicators. A mathematical model of reliability (state graph) of an CCF cluster is proposed, which allows one to derive formulas for calculating the mean time to failure and the mean time for a failure of the CCF of cluster.
An assessment of the reliability of CP cluster, for which there is no reliable information on their reliability, is determined based on the SSR of these CP. An assessment of the reliability of the cluster as a whole, based on the calculation of reliability indicators based on reference data on the reliability of components and components, as well as on data from the operation of supercomputers of family “SKIF” has been carried out. Taking into account this estimation and the calculated ratios obtained, the cluster reliability indicators for two options were calculated (in the presence and absence of a reserve of computing nodes).
High values of cluster reliability indicators were achieved due to the architectural and structural solutions adopted in the process of its development, aimed at increasing its survivability.

About the Authors

L. I. Kulbak
United Institute of Informatics Problems of the National Academy of Sciences of Belarus
Belarus

Leonid I. Kulbak – Ph. D. (Engineering), Associate Professor, Leading Researcher

6, Surganov Str., 220012, Minsk



O. P. Tchij
United Institute of Informatics Problems of the National Academy of Sciences of Belarus
Belarus

Oleg P. Tchij – Ph. D. (Physics and Mathematics), Head of the Laboratory of High-Performance Systems

6, Surganov Str., 220012, Minsk



N. N. Paramonov
United Institute of Informatics Problems of the National Academy of Sciences of Belarus
Belarus

Nikolaj N. Paramonov – Ph. D. (Engineering), Associate Professor, Leading Researcher

6, Surganov Str., 220012, Minsk



A. G. Rymarchuk
United Institute of Informatics Problems of the National Academy of Sciences of Belarus
Belarus

Aleksandr G. Rymarchuk – Chief Designer of the project

6, Surganov Str., 220012, Minsk



T. S. Martinovich
United Institute of Informatics Problems of the National Academy of Sciences of Belarus
Belarus

Tatyana S. Martinovich – Researcher

6, Surganov Str., 220012, Minsk



References

1. Anishchenko V. V., Kulbak L. I., Martinovich T.S. Reliability models of cluster computing systems. Vestsi Natsyyanal’nai akademii navuk Belarusi. Seryya fzika-technichnych navuk = Proceedings of the National Academy of Sciences of Belarus. Physical-technical series, 2008, no. 1, pp. 89–99 (in Russian).

2. Cisco SFS M7000E InfniBand Blade Switch for Dell M1000E. Available at: https://www.cisco.com/c/en/us/products/collateral/switches/sfs-m7000e-infniband-switch (accessed 11 January 2018).

3. Comparing the reliability cluster and the normal server. Available at: http://www.team.ru/server/stbl_compare.shtml (accessed 24 August 2018).

4. Kozlov B. A., Ushakov I.A. A Short Guide to Calculating the Reliability of Electronic Equipment. Moscow, Sovetskoe radio Publ., 1975. 472 p. (in Russian).

5. Anishchenko V. V., Kulbak L. I., Martinovich T.S. Strategy choice of working capacity restoration for cluster computer systems. Informatika = Informatics, 2007, no. 1 (13), pp. 114–122 (in Russian).


Review

Views: 858


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1561-8358 (Print)
ISSN 2524-244X (Online)