• 1 Keldysh Institute of Applied Mathematics, Miusskaya sq. 4, Moscow, Russia


The application of a rigorous CFD method and an all-encompassing algorithmic performance optimization method can make possible the CFD simulation of the extremely large-scale problems, which allows simulation of either larger systems, or more detailed simulation of systems that are already simulated. The CFD code has to show both efficient one-node performance and excellent parallel scaling. The record breaking performance on one node has been achieved before with application of the LRnLA algorithm and making use of many core parallelism as well as the vectorization. In the current work, the algorithm is extended for many-node parallelism. The algorithms is characterized by high parallelization degree, small number of node communication events, and may be concisely described and programmed on the base of the previously implemented one-node solution, which is a rare feature among the algorithms with temporal blocking in all four of the spatial and time dimensions.



  1. Geier, M., Schönherr, M.: Esoteric twist: an efficient in-place streaming algorithmus for the lattice boltzmann method on massively parallel hardware. Computation 5(2), 19 (2017)
  2. Godenschwager, C., et al.: A framework for hybrid parallel flow simulations with a trillion cells in complex geometries. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. p. 35. ACM (2013)
  3. Korneev B., Levchenko V. Numerical simulation of increasing initial perturbations of a bubble in the bubble–shock interaction problem // Fluid Dynamics Research. – 2016. – V. 48. – No. 6. – P. 061412.
  4. Levchenko, V. D. Asynchronous parallel algorithms as a way to archive effectiveness of computations. J. of Inf. Tech. and Comp. Systems 1 (2005): 68.
  5. Levchenko V., Perepelkina A. Locally recursive non-locally asynchronous algorithms for stencil computation // Lobachevskii Journal of Mathematics. – 2018. – V. 39. – No. 4. – P. 552-561.
  6. Levchenko, V., et al. GPU Implementation of ConeTorre Algorithm for Fluid Dynamics Simulation." International Conference on Parallel Computing Technologies. Springer, Cham, 2019.
  7. Levchenko V., et al. LRnLA Lattice Boltzmann Method: A Performance Comparison of Implementations on GPU and CPU. In: Sokolinsky L., Zymbler M. (eds) Parallel Computational Technologies. PCT 2019. Communications in Computer and Information Science, vol 1063. Springer, Cham
  8. Nguyen, A., et al.: 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In: High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–13. IEEE (2010).
  9. Osheim, Nissa, et al. "Smashing: Folding space to tile through time." International Workshop on Languages and Compilers for Parallel Computing. Springer, Berlin, Heidelberg, 2008.
  10. Perepelkina, A., Levchenko, V.: LRnLA algorithm ConeFold with non-local vectorization for LBM implementation. Commun. Comput. Inf. Sci. 965, 101–113 (2019)
  11. Perepelkina, A.Y., Levchenko, V.D., Goryachev, I.A.: Implementation of the kinetic plasma code with locally recursive non-locally asynchronous algorithms. In: Journal of Physics: Conference Series. vol. 510, p. 012042. IOP Publishing (2014)
  12. Riesinger, C., Bakhtiari, A., Schreiber, M., Neumann, P., Bungartz, H.J.: A holistic scalable implementation approach of the lattice Boltzmann method for CPU/GPU heterogeneous clusters. Computation 5(4), 48 (2017)
  13. Shimokawabe, T., Endo, T., Onodera, N., Aoki, T.: A stencil framework to realize large-scale computations beyond device memory capacity on GPU supercomputers. In: Cluster Computing (CLUSTER). pp. 525–529. IEEE (2017)
  14. Succi, S.: The Lattice Boltzmann Equation: for Fluid Dynamics and Beyond. Oxford University Press, Oxford (2001)
  15. Orozco, Daniel, and Guang Gao. Diamond tiling: A tiling framework for time-iterated scientific applications. CAPSL Technical Memo 091, 2009.

Article full text

Download PDF