Running OpenFOAM in parallel is straightforward once you understand the decomposition workflow. The mesh and fields are split across processor directories, the solver runs with MPI, and results are reassembled with reconstructPar. This guide covers the full parallel workflow, decomposition method selection, and the most common errors.
blockMesh or snappyHexMesh (serial)decomposeParmpirunreconstructPar# Full parallel workflow
blockMesh
decomposePar
mpirun -np 8 simpleFoam -parallel > log.simpleFoam 2>&1
reconstructPar
The system/decomposeParDict file controls how the mesh is split. The most important setting is numberOfSubdomains, which must match the -np argument to mpirun.
// system/decomposeParDict
numberOfSubdomains 8;
method scotch;
scotchCoeffs
{
// processorWeights (1 1 1 1 1 1 1 1); // optional: load balancing
}
Scotch uses graph partitioning to minimise the number of processor-to-processor faces (inter-processor communication). It produces the most balanced decomposition automatically with no geometry input needed. Use scotch unless you have a specific reason not to.
Similar to scotch — graph-based partitioning with good load balancing. Requires the METIS library to be compiled into OpenFOAM. On some HPC systems, metis may be faster than scotch for very large meshes (>50M cells). The configuration is identical:
method metis;
metisCoeffs
{
processorWeights (1 1 1 1);
}
Divides the domain into a regular Cartesian grid of subdomains. Only appropriate for structured blockMesh cases where the geometry is a simple box. Fast to set up but produces poor load balancing for complex geometries.
method simple;
simpleCoeffs
{
n (4 2 1); // 4x2x1 = 8 subdomains
delta 0.001;
}
Like simple but applies the subdivision in a specified order (e.g. x first, then y, then z). Useful for pipe flows or channel flows where decomposition in the streamwise direction reduces inter-processor communication.
method hierarchical;
hierarchicalCoeffs
{
n (4 2 1);
delta 0.001;
order xyz;
}
# Standard mpirun command — number of processes must match numberOfSubdomains
mpirun -np 8 simpleFoam -parallel > log.simpleFoam 2>&1
# On a cluster with hostfile
mpirun -np 32 --hostfile hosts.txt simpleFoam -parallel > log.simpleFoam 2>&1
# With OpenMPI on SLURM clusters
srun --mpi=pmix -n 32 simpleFoam -parallel > log.simpleFoam 2>&1
Always redirect stdout to a log file with > log.SOLVER 2>&1. The solver log contains residuals and timing information needed for diagnosis and post-processing.
On clusters using SLURM with InfiniBand, use srun instead of mpirun for better MPI process placement:
# SLURM job script excerpt
#SBATCH --ntasks=32
#SBATCH --ntasks-per-node=16
# Load OpenFOAM environment
source /opt/openfoam11/etc/bashrc
decomposePar
srun --mpi=pmix rhoPimpleFoam -parallel > log.rhoPimpleFoam 2>&1
reconstructPar -latestTime
The --mpi=pmix flag tells SLURM to use the PMIx process management interface, which is compatible with most OpenMPI and MPICH builds on modern HPC clusters. On older clusters, use --mpi=pmi2 instead.
# Reconstruct all time steps (can be slow for many time steps)
reconstructPar
# Reconstruct latest time only (fastest for checking current state)
reconstructPar -latestTime
# Reconstruct specific time range
reconstructPar -time '0.5:1.0'
# Reconstruct specific fields only
reconstructPar -fields '(U p)'
Reconstruction reads from all processorN/ directories and writes merged field data to the top-level time directories. It does not delete the processor directories — you can re-run reconstruction after the fact.
For very large cases with many time steps, the disk I/O during reconstruction can be the bottleneck. Use -fields to reconstruct only the fields you need for post-processing. Keep the processor directories until post-processing is complete — they are the authoritative copy of the parallel results.
If mpirun -np 8 but the case was decomposed into 4 processors, OpenFOAM throws:
// Error: number of processor directories != MPI processes
FOAM FATAL ERROR: number of processor directories = 4 is not equal to the number of processors = 8
Fix: re-run decomposePar with numberOfSubdomains 8 matching your -np 8. If processor directories already exist, remove them first: rm -rf processor*
If snappyHexMesh was run in parallel (itself decomposed), the mesh lives in processor directories. Decompose after snappyHexMesh reconstruction, or use decomposePar -copyZero to also copy the initial conditions:
# Decompose and copy 0/ directory to each processor
decomposePar -copyZero
# checkMesh on parallel decomposed mesh
mpirun -np 8 checkMesh -parallel > log.checkMesh 2>&1
Before decomposing, running renumberMesh reorders cell labels to improve cache locality and can reduce solve time by 10–20% on large meshes:
# Reorder mesh cells for better cache performance
renumberMesh -overwrite
# Then decompose and run as normal
decomposePar
mpirun -np 8 simpleFoam -parallel > log.simpleFoam 2>&1
Run renumberMesh after the mesh is complete (after snappyHexMesh) and before decomposePar. The -overwrite flag writes renumbered data back into the constant/polyMesh directory in place.
The optimal number of parallel processes depends on the mesh size, the interconnect speed, and the computational cost per cell. A practical guideline:
For a 2-million cell mesh on a single workstation with 16 cores, using 8–16 cores is reasonable (125,000–250,000 cells per core). On a cluster with fast InfiniBand, you can scale to 32 cores (62,500 cells per core) while maintaining good efficiency.
Check load balance after decomposing with scotch by examining how evenly cells are distributed:
# Check load balance after decomposePar
decomposePar -noFields 2>&1 | grep "Processor"
Each processor line shows the number of cells assigned. Scotch typically achieves within 5% of perfect balance. If the imbalance exceeds 20%, the geometry may have isolated regions — try increasing numberOfSubdomains or use scotchCoeffs with processorWeights to compensate.
For large meshes (10M+ cells), snappyHexMesh can be run in parallel to reduce meshing time significantly:
# Step 1: Create background blockMesh
blockMesh
# Step 2: Decompose the background mesh
decomposePar -copyZero
# Step 3: Run snappyHexMesh in parallel
mpirun -np 8 snappyHexMesh -parallel > log.snappyHexMesh 2>&1
# Step 4: Reconstruct the mesh (not the fields)
reconstructParMesh -constant
# Step 5: Remove processor directories from the mesh step
rm -rf processor*
# Step 6: Re-decompose for the solver run
decomposePar -copyZero
mpirun -np 8 simpleFoam -parallel > log.simpleFoam 2>&1
Use reconstructParMesh (not reconstructPar) to merge the mesh from processor directories into constant/polyMesh. The -constant flag means it merges only the constant (mesh) data, not the time directory fields.
Upload your case and CFDpilot checks your decomposePar settings, load balance, and flags common processor boundary issues.
Check my decomposition →Use scotch for almost all cases. It uses graph partitioning to minimise processor-to-processor communication faces and automatically balances the load with no geometry input required. Use metis as an alternative on very large meshes (50M+ cells) if scotch is slow. Use simple or hierarchical only for structured blockMesh cases with simple box geometries.
Target 50,000 to 200,000 cells per core. Below 50,000 cells per core, MPI communication overhead dominates and efficiency drops. For a 1-million cell mesh, 8 to 20 cores is typically optimal. On HPC clusters with fast InfiniBand, you can go lower (20,000–50,000 cells per core) while maintaining acceptable parallel efficiency.
The -np argument to mpirun does not match the numberOfSubdomains in decomposeParDict, or processor directories from a previous decomposition exist. Fix: delete existing processor directories with rm -rf processor*, update numberOfSubdomains in decomposeParDict, re-run decomposePar, then relaunch mpirun with the matching -np value.
No. ParaView can read decomposed cases directly using Case Type "Decomposed Case". However, many post-processing tools require reconstructed data. Use reconstructPar -latestTime for quick monitoring during the run and a full reconstructPar at the end for complete post-processing analysis.
renumberMesh reorders cell labels to improve cache locality, reducing solve time by 10–20% on large meshes. Run it with renumberMesh -overwrite after mesh generation (blockMesh or snappyHexMesh) and before decomposePar. It writes the reordered data back to constant/polyMesh in place.
Decompose the background blockMesh with decomposePar -copyZero, then run mpirun -np N snappyHexMesh -parallel. After completion, use reconstructParMesh -constant (not reconstructPar) to merge the mesh from processor directories into constant/polyMesh. Then remove the processor directories and re-run decomposePar for the solver run.