High Performance Computing (HPC)

The High Performance Computing (HPC) version of FLOW-3D  is the distributed memory version of FLOW-3D designed to run on high-performance computer clusters allowing engineers to tackle problems with very large computational domains or long simulation runtimes. It uses a hybrid MPI-OpenMP methodology to parallelize and consequently speed up calculations on multiple CPU cores across the compute nodes of a cluster. The simulation domain is decomposed into multiple sub-domains which are distributed across the compute nodes of the cluster, thus dividing the computational work between them. The synchronization of the solution on different sub-domains is accomplished by exchanging data between nodes using a message passing interface (MPI) library. Within each sub-domain, OpenMP threads are spawned to further parallelize the computation. This combination of MPI and OpenMP parallelization results in enhanced performance of the solver with significantly reduced runtimes for large simulations.

Why use HPC?

The current HPC hardware consists of multi-core, multi-CPU nodes (ccNUMA shared memory) connected over a fast network infrastructure such as Infiniband. With advantages of better computational performance and efficiency, decreased power consumption, reduced costs and superior flexibility, multi-core clusters are all-pervasive in the field of scientific computing.

Multi-core clusters have empowered users with the ability to increase grid resolutions to improve solution accuracy and resolve more features in the flow. The HPC version of FLOW-3D has been designed and optimized to exploit the best features of such clusters, providing significantly reduced runtimes while retaining solution accuracy. Finally, memory limitations of stand-alone workstations can be overcome by the distributed memory approach of HPC.

What kind of performance can I expect?

While the actual performance of HPC varies between simulations, the solver has shown scaling up to 600 cores for a range of applications including metal casting, water and environmental, microfluidics and aerospace. Presented on the benchmarks page are the details along with performance plots for several cases. For an ideal case, which entails having a fully fluid-filled computational domain, the HPC version has shown scaling to 1200 cores.

How to use the HPC version?

The HPC version is typically installed and run on a compute cluster. The compute cluster can be a stand-alone cluster or part of a supercomputing facility. The graphical user interface provided with the installation allows the user to easily set up and run simulations. For large scale clusters where simulations are run using a job scheduler like PBS, Torque, SGE, etc., users have access to a job submission utility that is highly configurable and scheduler-independent.

For users with limited access to hardware resources, outdated CPUs, large simulations and/or parametric studies requirements, the HPC version is available on the cloud. FLOW-3D CLOUD is a cloud computing service that allows users to expand their available hardware resources to thousands of CPU cores without having to worry about acquiring and maintaining a cluster.

What’s in the latest HPC version?

All physical models are compatible with the hybrid MPI-OpenMP methodology. Computational load balance is a critical aspect of the HPC solver and greatly affects the performance of the solver. Load balancing can be categorized as static (before the simulation starts) and dynamic (during the course of the simulation). In order to achieve static load balancing, an Automatic Decomposition Tool is provided that subdivides computational domains into multiple sub-domains (MPI domains), distributing the active cells as evenly as possible between them. This minimizes the synchronization time between the sub-domains and enhances performance.

In order to achieve dynamic load balancing, the dynamic thread balancing feature can be used to adjust the OpenMP threads during the course of the simulation. For one-fluid, free-surface simulations, performance gains of up to 20% have been achieved using this feature.