Shared- Memory Computing: Rules for a Parallel Universe - Parallel Processing
This is a general discussion on the best way to use (or not to use) the shared-memory parallel (SMP) version of FLOW-3D.
Multi-processor, multi-core computers running Windows or Linux have nowadays become standard, allowing for efficient multitasking and parallelism. The standard FLOW-3D installation includes both serial and parallel solver capabilities. It is, therefore, important to know how to best utilize the parallel capabilities of your hardware and software.
To run the code in parallel the user needs only to obtain a proper license file. When a license file with a parallel solver token is available on a multi-processor or multi-core computer, the user can choose whether to run the serial or parallel version of the code using the Runtime menu under Preference. The number of threads for parallel execution can also be selected in this menu, with the maximum number of threads defined by the total number of processors available on the system.
FLOW-3D has been programmed for parallel execution on shared memory workstations using OpenMP programming. There are limits to parallelization using this route, and it is important to understand how these limits affect users wishing to speed up their simulations by upgrading to the parallel version.
FLOW-3D Solver Components Parallelized for SMP Execution
FLOW-3D Solver Components Executed in Serial
What Affects the Performance of the Parallel Code?
In addition to parts of the code being executed serially, there are other factors that can affect the performance of the parallel code:
- Data output. As mentioned above, output data edits to the flsgrf file are carried out in serial. Moreover, Input/Output is typically much slower than the numerical processing. As a result, frequent spatial data edits (to Restart and Selected data catalogs) will generally degrade the parallel performance.
- Pressure solvers. Even though all pressure solvers have been parallelized, the GMRES pressure solver has superior parallel performance compared to the SOR and ADI solvers. The general explanation for this is that the SOR and ADI solvers were developed before parallelism became available and, therefore, have been optimized mainly for serial performance. The GMRES solver was originally developed with parallelism in mind and scales linearly for up to eight processors (the most we have tested for at this point). This is not to say the GMRES is not suitable for serial execution; it has excellent convergence properties for a wide range of flow problems.
- Parallel overhead. There is a certain level of overhead associated with execution of simulations in parallel mode. This overhead includes both memory and CPU usage.
The multiple threads used by FLOW-3D all contend for the same memory bandwidth effectively reducing memory bandwidth available for each thread. Also, additional memory is required to store copies of the so-called private variables for each thread.
The parallel CPU overhead is associated with scheduling and distributing work load between different threads. This is done automatically during solver execution based on the size of the mesh and the numerical task at hand.
As a result of the parallel overhead, we do not usually recommend using more than four processors. The performance starts leveling off at that point, and the cost of additional processors may not be justified.
- Mesh size. The total number of cells in the mesh and the number of cells in each direction affects the performance in several ways. An ideal mesh is a large grid (> 100,000 cells) with the same number of cells in each direction. A large number of cells ensures that the parallel overhead is relatively small compared to the CPU time spent on executing the actual code. Only the y and z directions are parallelized to optimize the cache performance; the x-direction is always executed serially. Therefore, it is important to have a sufficient number of cells in the y- and z-directions to allow for parallelism. It is also better to have one large block than several smaller ones because each block is parallelized individually. If you are planning to use 5+ million cells, you will definitely need a 64-bit system with at least 4 Gb of memory, although using the single precision version of the solver may just get you by on a 32-bit system. Even that may not be sufficient if the number of cells exceeds 7 million.
- Sparse domain. Even in an “ideal” mesh, if most of the cells are blocked or empty, the parallelization will not be very efficient since only a small subset of cells requires actual processor work in this case, limiting the opportunities for an efficient parallel execution of the DO loops in the code.
- Running other programs. When running the solver, it is critical not to have any other CPU-intensive programs, even word processing or email programs, running at the same time. The allocation of work between the processors in FLOW-3D does not take into account that the processors may be busy with something else. An application that takes half of a processor time on a four-processor system, may cause FLOW-3D to run more than 50% slower since the three processors running only FLOW-3D will be waiting half of their time on the one running both applications.
- Small/slow cache/memory. Finally, it is important that the memory (RAM) is fast and cache is large! When the threads are accessing and competing for the data in the common memory pipeline the data traffic in memory can become very intense. Frequency of cache refreshing can be reduced by having a larger L2 cache. The speed of loading data into the cache from RAM is reduced by using RAM with high bandwidth and low latency. Since parallel threads share cache, it is important to pay attention to cache size on a per core basis.