“Govorun” supercomputer, an actively developing platform for scientific computing
News, 22 August 2023
Following the results of last year, the team of authors of the series of papers “Hyperconverged “Govorun” supercomputer for the implementation of the JINR scientific programme”: Dmitry Belyakov, Alexey Vorontsov, Egor Druzhinin, Maxim Zuev, Vladimir Korenkov, Yuri Migal, Andrey Moshkin, Dmitry Podgainy, Tatyana Strizh, Oksana Streltsova, was awarded the JINR First Prize of 2022 for the best papers in the field of applied physics research. Dmitry Podgainy, Head of the Sector of Heterogeneous Computing and Quantum Information at MLIT JINR, talked about the work performed.
The supercomputer named after N.N. Govorun at the JINR Meshcheryakov Laboratory of Information Technologies is a powerful computing machine that enables to simultaneously perform a multitude of computations and simulate various physics processes. Primarily, it is used as a heavy ion collision simulator within the MPD experiment of the NICA accelerator complex. On the supercomputer, a polygon for quantum computing is deployed, the theoretical research of exotic and superheavy nuclei is conducted, experimental data from the Laboratory of Radiation Biology is processed, and other applied tasks are carried out. For the “Govorun” supercomputer, flexible architecture and direct liquid cooling technologies were used for the first time in the world, a hierarchical data processing and storage system was implemented, and thanks to annual modernizations, the range of tasks to be solved is continuously enlarging.
The supercomputer was created at the JINR Meshcheryakov Laboratory of Information Technologies in 2018 on top of the experience gained during the operation of the HybriLIT heterogeneous cluster, which is part of the MLIT Multifunctional Information and Computing Complex. The creation of this unique machine is an essential technological achievement being of great importance for the implementation of the JINR scientific programme and international cooperation.
By 2018, the Institute experienced an urgent need for its own supercomputer. The permanent growth in the number of users and the expansion of the range of tasks entailed the development and implementation of new technologies. At that time, JINR scientists employed the capabilities of the HybriLIT cluster, as well as the resources of partner organizations’ supercomputers.
Before the creation of the supercomputer, which is now part of HybriLIT, the computing cluster showed its relevance in solving tasks of lattice quantum chromodynamics (QCD), radiation biology, applied research, etc. However, for example, QCD computations are one of the most resource-intensive studies of JINR Bogoliubov Laboratory of Theoretical Physics, and the computing power to perform them was previously insufficient. Unique results in this area were obtained on the “Govorun” supercomputer.
At present, the resources of the “Govorun” supercomputer are used by all the Laboratories of the Institute within 25 themes of the JINR Topical Plan. 323 people, of which 262 are JINR staff members, the rest are the Member States’ representatives, are involved in computations. The enlargement of computing power on the “Govorun” supercomputer is planned annually, since the number of users and tasks to be solved is growing.
“In addition to QCD, in terms of computing power, the “Govorun” supercomputer is becoming one of the world leaders in modelling the dynamics of the electron shells of superheavy nuclei at the Flerov Laboratory of Nuclear Reactions. Our supercomputer is one of the major computing resources for this task on a global scale,” Dmitry Podgainy said. The theoretical research of exotic nuclei is also conducted on the “Govorun” supercomputer by specialists from the Frank Laboratory of Neutron Physics. The results obtained using the resources of the “Govorun” supercomputer from the moment it was put into operation in July 2018 to September 2022 are reflected in 204 scientific papers, two of them were published in the Nature Physics journal.
The architecture of the “Govorun” supercomputer allows one not only to carry out computations, but also to employ it as a research polygon for elaborating software-hardware and IT solutions. The resources of the “Govorun” supercomputer were integrated into a unified heterogeneous environment based on the DIRAC platform for the NICA project, which made it possible to implement the programme of runs of data mass simulation within the MPD experiment. It is noteworthy that some simulation tasks for MPD can only be performed on the “Govorun” supercomputer.
Clouds of the JINR Member States’ organizations integrated into a distributed information and computing environment based on the DIRAC platform
In 2022, data generation using the Monte Carlo method and MPD event reconstruction became the first joint task solved within the National Research Computer Network of Russia (NIKS). In addition to the “Govorun” supercomputer, the network infrastructure combines the supercomputers of the Interdepartmental Supercomputer Centre of the Russian Academy of Sciences and Peter the Great St. Petersburg Polytechnic University. The first experiment on the use of the unified infrastructure was successfully completed. In total, 3,000 tasks were launched. As a result, 3 million events were generated and reconstructed. Dmitry Podgainy highlighted that the “Govorun” supercomputer would continue to be used only for JINR tasks, and the joining of partner computer centres was considered as an enhancement of the JINR supercomputer’s capabilities.
On the “Govorun” supercomputer, the efficiency of modelling the dynamics of heavy ion collisions was qualitatively increased, computations of the radiation safety of JINR experimental facilities were carried out, and the efficiency of solving applied tasks was enhanced. The technologies implemented on the “Govorun” supercomputer enabled the development of the ML/DL/HPC ecosystem, which provides opportunities not only for tasks in the field of machine and deep learning, but also for the convenient organization of calculations and the analysis of results. Examples of such solutions are the developed information and computing system for a joint project with BLTP to investigate theoretical models of Josephson junctions and the information system for a joint project with LRB for processing, analyzing and visualizing data of radiobiological studies. The representatives of scientific organizations of the Republic of Serbia, JINR’s Associate Member, take an active part in this LRB-MLIT project.
The “Govorun” supercomputer underwent three stages of modernization, within which its architecture was upgraded, and new components were introduced. The modernization was performed primarily in the interests of the NICA MPD collaboration. “At present, we have a completed architectural solution, which we will scale in the future, namely, enlarge its computing power, the data storage and processing system,” Dmitry Podgainy noted.
Today, the “Govorun” supercomputer represents a high-performance scalable system. Its current configuration involves computing modules containing GPU (graphics processing unit) and CPU (central processing unit) components, as well as a hierarchical data processing and storage system. The total peak performance reaches 1.7 PFlops for double-precision calculations (3.4 PFlops for single-precision calculations) with a read/write speed of 300 Gb/s for the hierarchical data processing and storage system.
“With the latest modernization, we added a new component, 32 hyperconverged nodes with a large amount of RAM, which enabled not only to enhance the “Govorun” supercomputer performance, but also to solve tasks that were previously impossible, as well as to introduce the DAOS advanced storage technology,” the scientist commented. It has become possible to solve tasks that entail a large amount of RAM per computing core, primarily for the NICA megascience project. The new nodes are also employed in quantum computing simulators, as well as in the joint project of FLNR and MLIT to investigate the electron shells of superheavy elements.
The DAOS (Distributed Asynchronous Object Storage) technology, which has demonstrated its promise for deep and machine learning tasks and the operation of quantum simulators, is essential for processing a large volume of heterogeneous data and is applied on the “Govorun” supercomputer as a layer of so-called very hot data.
The hierarchical data processing and storage system with a software-defined architecture was implemented on the “Govorun” supercomputer. According to the speed of accessing data, the system is divided into layers, namely, very hot data, the most demanded data, to which it is currently required to provide the fastest access, hot data and warm data. Each layer can be used both independently and together with the others. The fastest memory layer is limited in size. Tasks that do not require very high speed are solved using the middle layer. Finally, there are cases when data needs to be stored for a very long time. The coldest storage to which the “Govorun” supercomputer is connected is a tape robot, which writes and retrieves information extremely slowly, but enables its storage for a long time; the manufacturer provides a guarantee for forty years. For the high-speed data processing and storage system, the “Govorun” supercomputer received the prestigious Russian DC Awards 2020 in “the Best IT Solution for Data Centres” nomination.
When creating the “Govorun” supercomputer, two technologies were applied for the first time in the world.
The technology of direct liquid cooling of CJSC “RSC Technologies”, which has a number of its own innovative developments, was chosen for the CPU component of the supercomputer. Thanks to the introduction of these technologies the “Govorun” supercomputer managed to achieve a record density of placement of compute nodes per rack (153 nodes vs 25 nodes for air cooling), and the operation in the “hot water” cooling mode made it possible to use the year-round free cooling mode (24x7x365). In addition to high-energy efficiency, this approach enabled to significantly simplify the infrastructure of the supercomputer centre, i.e., the cooling system of the “Govorun” supercomputer was created using only dry cooling towers that cool the liquid using ambient air. As a result, less than 6% of the total electricity consumed by the “Govorun” supercomputer is spent on cooling, which is an outstanding result for the HPC industry. The given system is the world’s first system with 100% liquid cooling.
Another peculiarity of the “Govorun” supercomputer is the technology of a hyperconverged (flexible) architecture for compute nodes, which was also created and implemented for supercomputers for the first time worldwide.
“Supercomputers, as a rule, are tailored to one type of task with a rigid architecture. For example, the weather forecast is calculated using hydrodynamic equations. Roshydromet’s supercomputer surpasses all the resources that MLIT has, however, it solves only one task around the clock. For the forecast to be more accurate, the supercomputer must have more and more resources. At the same time, if you want to compute tasks for MPD on this machine, you will not be able to, or you will get something completely different, since its computing architecture is not reconfigurable,” Dmitry Podgainy remarked. He explained that the “Govorun” supercomputer was reconfigured programmatically at a very high speed for various types of user tasks, i.e., there is no need to physically change compute nodes.
Hyperconvergence allows orchestrating computing resources and data storage elements, as well as creating computing systems on demand, with the help of the RSC BasIS software. The notion “orchestration” means the logical disintegration of a compute node into separate components, such as compute nodes, data storage elements, with their subsequent integration into the required configuration. Computing elements (CPU cores and graphics accelerators) and data storage elements (SSDs) form independent sets of resources (pools). Due to orchestration, the user can allocate for his task the required number and type of compute nodes (including the required number of graphics accelerators), the required volume and type of data storage systems, as well as automatically configure the software. After the task is completed, the compute nodes and storage elements are returned to their pools and are ready for the next use.
The hyperconvergence feature not only increases the efficiency of solving user tasks of various types, but also enhances the level of confidentiality of working with data and helps to avoid system errors that occur when crossing the resources for different user tasks.
List of publications:
- A. Baginyan, A. Balandin, N. Balashov, A. Dolbilov, A. Gavrish, A. Golunov,
N. Gromova, I. Kashunin, V. Korenkov, N. Kutovskiy, V. Mitsyn, I. Pelevanyuk,
D. Podgainy, O. Streltsova, T. Strizh, V. Trofimov, A. Vorontsov, N. Voytishin, and
M. Zuev: “Current Status of the MICC: an Overview” // CEUR Workshop proceedings, 2021, Vol. 3041, pp. 1-8. - Gh. Adam, M. Bashashin, D. Belyakov, M. Kirakosyan, M. Matveev, D. Podgainy,
T. Sapozhnikova, O. Streltsova, Sh. Torosyan, M. Vala, L. Valova, A. Vorontsov,
T. Zaikina, E. Zemlyanaya, and M. Zuev: “IT-ecosystem of the HybriLIT heterogeneous platform for high- performance computing and training of IT-specialists” // CEUR Workshop proceedings, 2018, Vol. 2267, pp. 638-644. - D.V. Podgainy, D.V. Belaykov, A.V. Nechaevsky, O.I. Streltsova, A.V. Vorontsov, and M.I. Zuev: “IT Solutions for JINR Tasks on the “GOVORUN” Supercomputer” // CEUR Workshop proceedings, 2021, Vol. 3041, pp. 612-618.
- E.A. Druzhinin, A.B. Shmelev, A.A. Moskovsky, V.V. Mironov, A. Semin, “Server Level Liquid Cooling: Do Higher System Temperatures Improve Energy Efficiency?” // Supercomputing frontiers and innovations, 2016, Vol. 3, № 1, pp. 67-73, DOI: 10.14529/jsfi160104
- E. Druzhinin, A. Shmelev, A. Moskovsky, Yu. Migal, V. Mironov, A. Semin, “High temperature coolant demonstrated for a computational cluster” // Proc. of 2016 International Conference on High Performance Computing & Simulation (HPCS), DOI: 10.1109/HPCSim.2016.7568418
- D. Belyakov, A. Nechaevskiy, I. Pelevanuk, D. Podgainy, A. Stadnik, O. Streltsova,
A. Vorontsov, M. Zuev: “Govorun” Supercomputer for JINR Tasks” // CEUR Workshop proceedings, 2020, Vol. 2772, pp. 1-12. - V. Korenkov, A. Dolbilov, V. Mitsyn, I. Kashunin, N. Kutovskiy, D. Podgainy,
O. Streltsova, T. Strizh, V. Trofimov, and P. Zrelov: “The JINR distributed computing environment” // EPJ Web of Conferences, 2019, Vol. 214, p. 03009, DOI: https://doi.org/10.1051/epjconf/201921403009 - В.В. Кореньков “Тенденции и перспективы развития распределенных вычислений и аналитики больших данных для поддержки проектов класса мегасайенс” // Ядерная физика, 2020, том 83, № 6, с. 534-538.
- D.V. Belyakov, A.G. Dolbilov, A.A. Moshkin, I.S. Pelevanyuk, D.V. Podgainy,
O.V. Rogachevsky, O.I. Streltsova, and M.I. Zuev: “Using the “Govorun” Supercomputer for the NICA Megaproject” // CEUR Workshop proceedings, 2018, Vol. 2507, pp. 316-320. - N. Kutovskiy, V. Mitsyn, A. Moshkin, I. Pelevanyuk, D. Podgayny, O. Rogachevsky,
B. Shchinov, V. Trofimov, and A. Tsaregorodtsev: “Integration of Distributed Heterogeneous Computing Resources for the MPD Experiment with DIRAC Interware”// Physics of Particles and Nuclei, 2021, Vol. 52 (4), pp. 835-841, DOI:10.1134/S1063779621040419 - A.A. Moshkin, I.S. Pelevanyuk, D.V. Podgainy, O.V. Rogachevsky, O.I. Streltsova, and M.I. Zuev: “Approaches, services, and monitoring in a distributed heterogeneous computing environment for the MPD experiment” // Russian Supercomputing Days: Proceedings of the International Conference, 2021, pp. 4-11, DOI: https://doi.org/10.29003/m2454.RussianSCDays2021.
- Ю.А. Бутенко, М.И. Зуев, М. Чосич, А.В. Нечаевский, Д.В. Подгайный,
И.Р. Рахмонов, А.В. Стадник, О.И. Стрельцова “Экосистема ML/DL/HPC платформы HybriLIT (ЛИТ ОИЯИ): новые возможности для прикладных исследований”, 2022. - I.A. Kolesnikova, A.V. Nechaevskiy, D.V. Podgainy, A.V. Stadnik, A.I. Streltsov, and O.I. Streltsova: “Information System for Radiobiological Studies” // CEUR Workshop proceedings, 2020, Vol. 2743, pp. 1-6.