Contribution to high-performance simulation and highly scalable numerical schemes

Guillaume Latu

Fri, May. 18th 2018, 14:00-19:00

Salle des conférences, IRMA, 7 rue René-Descartes, Strasbourg

Avis de Soutenance

Guillaume LATU soutiendra publiquement son Habilitation à Diriger des Recherches intitulée

**Contribution to high-performance simulation and highly scalable numerical schemes**

le vendredi 18 mai 2018 à 14h00

Salle des conférences, IRMA, 7 rue René-Descartes, Strasbourg

Jury

Raymond Namyst, Professeur, Université de Bordeaux

Frédéric Desprez, Directeur de Recherche, INRIA, Grenoble

Rémi Abgrall, Professeur, Université de Zurich, Suisse

Eric Sonnendrücker, Professeur, Max-Planck Institut für Plasmaphysik, Munich, Allemagne

Jean Roman, Professeur, INP (Enseirb-Matmeca) & INRIA, Bordeaux

Stéphane Genaud, Professeur, ENSIIE, Strasbourg

Résumé

Numerous scientific domains express a need for high-performance computing (HPC), which has intensified in recent decades. At the same time, the size of available supercomputers has grown steadily. Yet, parallel simulations make it possible to perform experiments numerically without carrying out full-scale real-world experiments whose costs can be prohibitive. My contributions concern the improvement of computational methods from the point of view of parallel algorithms, but also on the upgrade of numerical schemes in several simulations codes.

Although my scientific work is not limited to contributions to the GYSELA simulation code, a part of it relates to this application. The GYSELA code treats the Gyrokinetic Vlasov equation in a five-dimensional space coupled to a Poisson solver and some other additional operators. While in 2006, a reduced version of the code was using only 128 cores, several algorithmic improvements permitted to achieve runs on 8000 cores in 2010, and 459000 cores in 2013. Some of the largest supercomputers in Europe have been used to conduct these numerical experiments. Thanks to very good scalability and portability, everyday GYSELA runs use from 8000 to 32000 cores. However, it was found that whenever doubling the number of cores for a given case, the memory footprint was far from halved, as it should ideally. As a consequence, many very large physical cases were impossible to run because the memory was exhausted. By introducing more sophisticated algorithms, this bottleneck was wiped out and the memory scalability was significantly improved. Recently, works have been carried out to adapt the code for the next generations of machines; some of the key components are: vectorization, avoiding synchronizations induced by the management of parallelism, and overlapping communications by calculations.

Along with the efforts for achieving good parallelization, this is meaningful to improve the numerical methods to boost the precision and the realism of the simulations. Indeed, parallel algorithms and numerical schemes are tightly coupled. Thus, a specific operator splitting method in the Vlasov solver and improvement of the initial equilibrium function make it possible to better preserve certain mathematical invariants. This contribution helped improving the precision and the robustness of the code as well. A series of theoretical studies have established that the alignment of the main physical structures around the magnetic field lines can be used to reduce the number of mesh points necessary in the direction which is parallel to the field lines. I figured out a new numerical method with *aligned interpolation* for GYSELA. This approach saves a lot of meshing points and thus reduces the cost of simulations. I also managed to get a much better modeling of the poloidal plane that improves the realism of the simulations in suppressing a boundary condition near the *r=0* location.

As time goes on, accelerator devices have seen increasing success in the HPC field. Some of my researches were devoted to designing algorithms for clusters of such computing devices. A parallel solution for petroleum exploitation was developed on cluster of GPUs (Reverse Time Migration method). The memory access patterns and the management of both CPU-GPU and MPI communications were the main bottleneck to tackle there. In addition, the development of very fine-grained algorithms was important to achieve good performance. Besides, I realized some optimization works on some of the Gysela computation kernels on the Intel KNC and KNL manycores (also called Xeon Phi). A major problem here is to adequately vectorize, because it is an essential condition to shorten execution times. Some memory-bound and compute-bound kernels have shown good performance compared to more conventional computing devices, but it was not straightforward to achieve. Again, the access patterns to the memory and cache-friendliness represent a real challenge, a lot more than for a standard processor. Auto-tuning techniques were also helpful to address some of the issues related to performance portability and sensitivity both to architectural features and to application dependent parameters.

One of the constant problems facing the parallel application designer is to find solutions to increase efficiency, portability and code readability at the same time. The complexities of hardware, of scientific applications, of numerical schemes and the difficulty to choose a programming model are all together contributing to this multi-faceted problem. However, possible tracks should be discussed to cross over the obstacles and to end up soon running large applications on the upcoming Exascale machines.

Contact : vi214773