Parallelizing FLOW-3D using OpenMP
This article describes the approach taken in FLOW-3D to parallelization on shared-memory computers using the OpenMP programming paradigm. These developments will be released in FLOW-3D version 10.0 in 2011.
OpenMP typically works at the DO loop level, achieving a fine-grain parallelization. It provides a fork-and-join model in which the program begins execution as a single process, or thread (Fig. 1). The thread executes sequentially until a parallel directive is found. The single master thread then creates multiple threads. All threads execute the statements until the end of the parallel region. The data used by the threads resides in the shared memory and is, therefore, readily available to each thread (Fig. 2). Care must taken, however, to prevent conflicts, such as when a thread tries to read from a location in memory that is being updated at the same time by another thread.
Figure 1: Fork and join model in an OpenMP program.
Figure 2: A schematic of the shared-memory architecture.
- allows for incremental parallelization;
- relatively easy to maintain good load balance between threads;
- implementation is relatively simple and easy to maintain.
- memory limited to that available on a single computer node;
- parallel performance limited to the number of cores on the node which is not easily expandable;
- all parts of the code should be parallelized to achieve scaling on a growing number of cores. According to Amdahl’s Law (Fig. 3), for example, a 10% serialization left in the code limits the speed-up to 10x.
OpenMP supports cross-platform shared-memory parallel programming and is supported by all current FORTRAN and C/C++ compilers.
Figure 3: Amdahl’s law illustrating the impact of serialization on speedup
OpenMP in FLOW-3D
OpenMP has been in use in the FLOW-3D solver since 2002. Despite its flexibility, it was a challenge to introduce parallelization into a code that had been developed and optimized for over twenty years for serial execution. Initially, only the most commonly used and computationally intensive regions, such as pressure iterations and VOF advection, were parallelized. The introduction of the Generalized Minimum Residual (GMRES) implicit pressure solver in version 9.0 opened the door to much more efficient parallel performance since this method had been developed specifically for shared-memory parallelization.
This was sufficient for two- and four-core computers available at the time. With the rapid development of multi-core architectures, the need to further the shared-memory parallelization in FLOW-3D became apparent. Scaling needed to be pushed to computers with eight and more cores.
Several advances have been made for the upcoming version 10.0 to address these demands. Many models added in the last few years have been fully parallelized, such as:
- General Moving Objects (GMO)
- Split and Unsplit Lagrangian VOF models
- Air entrainment
- Successive Under-Relaxation (SUR) and ADI implicit viscous solvers
- Adiabatic bubble
- Evaporation/condensation model
- Porous media
The previously parallelized sections have been revisited to recalibrate the code for the recently introduced Unstructured Memory Allocation (UMA).
Figure 4 shows the scaling to twelve cores for three simulations: an impact of a steel sphere into water at high speed (GMO impact), sloshing of liquid fuel in a cylindrical tank (fuel slosh) and metal solidification in a sand mold (solidification).
Figure 4: Scaling to twelve cores for three simulations.
The simulations were run on three different computers (one for simulations using 2, 3 4 cores, another for 6 and 8 cores, and the third for 12 cores), which may explain the greater variation in the scaling for 6 and 8 cores.
All three cases show reasonable scaling up to twelve cores, with a good indication that scaling continues beyond twelve cores.
It is also clear that the efficiency of the parallel performance strongly depends on the selection of physical and numerical models. The properties of the computational domain is also a factor (please see the Hints and Tips: Rules for a Parallel Universe - Parallel Processing article for the best practices). Ideally, the domain should contain a large three-dimensional mesh (> 100,000 active cells), all of which perform the same work. Empty and blocked cells generally take less CPU time to process and skew the load balance. For example, mold filling simulations may exhibit limited scaling because the domain is initially empty.
FLOW-3D has built-in timers to report the amount of time (in wall clock seconds) spent in various parts of the solver. If activated, the timer report is written to the end of the solver summary file, hd3out. This can be useful to determine the best and worst performing models and adjust the settings accordingly for better performance.
Finally, the ability to change the number of threads at runtime has also been added to version 10.0, allowing the user to adjust the number of cores used by the solver without the need to stop or even pause it. This feature can be useful when, for example, resources need to be temporarily freed for another task.