FLOW-3D/MP 4.0 - Performance ImprovementsThis article highlights developments to be released in FLOW-3D/MP version 4.0 later this year.
This Development Note highlights the parallel performance improvements on distributed memory systems made in FLOW-3D/MP version 4.0, which is scheduled for release later this year.
The development effort to improve the efficiency of the parallel calculations consists of three components:
- reduced amount of data sent via MPI communication
- reduced number of synchronization points
- improved computational load balancing using the automatic domain decomposition tool (ADT)
Reduced amount of MPI communication
Flow through a dam, with color showing the distance
traveled by water from the start of the simulation.
Currently, FLOW-3D/MP uses mesh blocks to divide the computational domain for parallel calculations. The blocks are then divided between the ranks on a cluster. Communication between ranks includes an interpolation of the solution variables between adjacent mesh blocks and sending it from one rank to another using an Interconnect.
The communication procedure has been optimized by consolidating the parts of the mesh that take part in the data exchange between mesh blocks and the associated algorithms into a highly optimized procedure. Instead of sending the raw solution data to the receiving rank for interpolation, the interpolation is now performed by the sending rank first, then the result is sent to the receiving rank. This reduces the amount of data transfer up to eightfold. The transferred data only includes active mesh cells, e.g., cells blocked by the geometry are excluded. Redundant data communication was also eliminated.
Removal of unnecessary synchronization
There are several instances in the execution of the parallel solver where all MPI ranks must pause and wait to exchange a small amount of data with each other before proceeding. These events are called synchronization points.
FLOW-3D/MP can process more than one mesh block on an MPI rank. The domain decomposition is done considering the combined size of all mesh blocks processed by an MPI rank. Previously, some of the synchronization points were placed on a per block basis. In cases, when one of the ranks had a small mesh block followed by a big mesh block, all the ranks would synchronize after the small block and then again after the big mesh block was processed. This added potentially significant idling time to other ranks with differently sized mesh blocks. In some cases, the amount of time needed to process the blocks was doubled.
The synchronization procedure has been reorganized by making each MPI rank process all its blocks before synchronizing with other ranks, greatly reducing the wait time on each rank.
Better load balance using ADT
ADT decomposes the domain for balanced processor
loads in this dam simulation example.
The Automatic Decomposition Tool (ADT) uses the number of active cells to distribute the computational load between ranks. A beta version of ADT was provided with FLOW-3D/MP version 3.2. ADT allows the user to decompose the domain quickly and efficiently, instead of relying on a tedious and often inaccurate manual procedure.
In FLOW-3D/MP version 4.0, ADT has been incorporated into the Graphical User Interface, further simplifying the setup of simulation for parallel execution. The ADT algorithm has also been improved to include only the real cells, rather than using the mesh boundary cells. This improvement primarily affected cases that used more than eight ranks. Most eight rank simulations divide the domain once in each coordinate direction, converting a single mesh block into eight blocks. This type of division gives a plane of boundary cells to each new block. However some eight rank cases and all cases with more than eight ranks require more than one division in a given coordinate direction. Blocks which are completely surrounded by other blocks were then left without boundary cells that could be counted during the division. The ADT algorithm was modified to divide the domain based on interior cells only. This resulted in a better load balance for most problems.
Overall performance improvement
As with most incremental improvements in efficiency, the overall performance improvement in version 4.0 varies from case to case. It is most significant for simulations that were greatly affected by the issues in version 3.2 described in this note. An important result of the improvements is that parallel scaling has been extended from 16 to 64 ranks for some cases. Scaling is also more consistent and is achieved for a wider range of simulations. Some of the improvements are more pronounced at a higher number of ranks.
Figure 1: Speedup improvement for a simulation of flow through a dam,
which was performing poorly with FLOW-3D/MP version 3.2.
The performance improvement for a case of flow of water through a dam, which was affected by all the performance problems described here, is presented in Fig. 1. The scaling for version 3.2 in this case was quite poor, making the comparison with version 4.0 more impressive.