HPC Guide

OpenFOAM Parallel Running: decomposePar and MPI Setup

Running OpenFOAM in parallel is straightforward once you understand the decomposition workflow. The mesh and fields are split across processor directories, the solver runs with MPI, and results are reassembled with reconstructPar. This guide covers the full parallel workflow, decomposition method selection, and the most common errors.

By CFDpilot · Updated April 2026

1. The parallel workflow overview

  1. Generate mesh with blockMesh or snappyHexMesh (serial)
  2. Decompose the case with decomposePar
  3. Run the solver with mpirun
  4. Reconstruct results with reconstructPar
  5. Post-process with ParaView (on reconstructed or decomposed data)
# Full parallel workflow
blockMesh
decomposePar
mpirun -np 8 simpleFoam -parallel > log.simpleFoam 2>&1
reconstructPar

2. decomposeParDict: the key settings

The system/decomposeParDict file controls how the mesh is split. The most important setting is numberOfSubdomains, which must match the -np argument to mpirun.

// system/decomposeParDict
numberOfSubdomains  8;

method              scotch;

scotchCoeffs
{
    // processorWeights (1 1 1 1 1 1 1 1);  // optional: load balancing
}

3. Decomposition methods: scotch, metis, simple, hierarchical

scotch (recommended default)

Scotch uses graph partitioning to minimise the number of processor-to-processor faces (inter-processor communication). It produces the most balanced decomposition automatically with no geometry input needed. Use scotch unless you have a specific reason not to.

metis

Similar to scotch — graph-based partitioning with good load balancing. Requires the METIS library to be compiled into OpenFOAM. On some HPC systems, metis may be faster than scotch for very large meshes (>50M cells). The configuration is identical:

method  metis;
metisCoeffs
{
    processorWeights    (1 1 1 1);
}

simple (structured meshes only)

Divides the domain into a regular Cartesian grid of subdomains. Only appropriate for structured blockMesh cases where the geometry is a simple box. Fast to set up but produces poor load balancing for complex geometries.

method  simple;
simpleCoeffs
{
    n           (4 2 1);  // 4x2x1 = 8 subdomains
    delta       0.001;
}

hierarchical

Like simple but applies the subdivision in a specified order (e.g. x first, then y, then z). Useful for pipe flows or channel flows where decomposition in the streamwise direction reduces inter-processor communication.

method  hierarchical;
hierarchicalCoeffs
{
    n           (4 2 1);
    delta       0.001;
    order       xyz;
}

4. Running with mpirun

# Standard mpirun command — number of processes must match numberOfSubdomains
mpirun -np 8 simpleFoam -parallel > log.simpleFoam 2>&1

# On a cluster with hostfile
mpirun -np 32 --hostfile hosts.txt simpleFoam -parallel > log.simpleFoam 2>&1

# With OpenMPI on SLURM clusters
srun --mpi=pmix -n 32 simpleFoam -parallel > log.simpleFoam 2>&1

Always redirect stdout to a log file with > log.SOLVER 2>&1. The solver log contains residuals and timing information needed for diagnosis and post-processing.

On clusters using SLURM with InfiniBand, use srun instead of mpirun for better MPI process placement:

# SLURM job script excerpt
#SBATCH --ntasks=32
#SBATCH --ntasks-per-node=16

# Load OpenFOAM environment
source /opt/openfoam11/etc/bashrc

decomposePar
srun --mpi=pmix rhoPimpleFoam -parallel > log.rhoPimpleFoam 2>&1
reconstructPar -latestTime

The --mpi=pmix flag tells SLURM to use the PMIx process management interface, which is compatible with most OpenMPI and MPICH builds on modern HPC clusters. On older clusters, use --mpi=pmi2 instead.

5. Reconstructing parallel results

# Reconstruct all time steps (can be slow for many time steps)
reconstructPar

# Reconstruct latest time only (fastest for checking current state)
reconstructPar -latestTime

# Reconstruct specific time range
reconstructPar -time '0.5:1.0'

# Reconstruct specific fields only
reconstructPar -fields '(U p)'

Reconstruction reads from all processorN/ directories and writes merged field data to the top-level time directories. It does not delete the processor directories — you can re-run reconstruction after the fact.

For very large cases with many time steps, the disk I/O during reconstruction can be the bottleneck. Use -fields to reconstruct only the fields you need for post-processing. Keep the processor directories until post-processing is complete — they are the authoritative copy of the parallel results.

6. Common errors and fixes

Processor directory count mismatch

If mpirun -np 8 but the case was decomposed into 4 processors, OpenFOAM throws:

// Error: number of processor directories != MPI processes
FOAM FATAL ERROR: number of processor directories = 4 is not equal to the number of processors = 8

Fix: re-run decomposePar with numberOfSubdomains 8 matching your -np 8. If processor directories already exist, remove them first: rm -rf processor*

Decomposition not matching mesh after snappyHexMesh

If snappyHexMesh was run in parallel (itself decomposed), the mesh lives in processor directories. Decompose after snappyHexMesh reconstruction, or use decomposePar -copyZero to also copy the initial conditions:

# Decompose and copy 0/ directory to each processor
decomposePar -copyZero

Running checkMesh on a decomposed case

# checkMesh on parallel decomposed mesh
mpirun -np 8 checkMesh -parallel > log.checkMesh 2>&1

7. renumberMesh for performance

Before decomposing, running renumberMesh reorders cell labels to improve cache locality and can reduce solve time by 10–20% on large meshes:

# Reorder mesh cells for better cache performance
renumberMesh -overwrite

# Then decompose and run as normal
decomposePar
mpirun -np 8 simpleFoam -parallel > log.simpleFoam 2>&1

Run renumberMesh after the mesh is complete (after snappyHexMesh) and before decomposePar. The -overwrite flag writes renumbered data back into the constant/polyMesh directory in place.

8. How many cores to use: cells per core guideline

The optimal number of parallel processes depends on the mesh size, the interconnect speed, and the computational cost per cell. A practical guideline:

For a 2-million cell mesh on a single workstation with 16 cores, using 8–16 cores is reasonable (125,000–250,000 cells per core). On a cluster with fast InfiniBand, you can scale to 32 cores (62,500 cells per core) while maintaining good efficiency.

Check load balance after decomposing with scotch by examining how evenly cells are distributed:

# Check load balance after decomposePar
decomposePar -noFields 2>&1 | grep "Processor"

Each processor line shows the number of cells assigned. Scotch typically achieves within 5% of perfect balance. If the imbalance exceeds 20%, the geometry may have isolated regions — try increasing numberOfSubdomains or use scotchCoeffs with processorWeights to compensate.

9. Running snappyHexMesh in parallel

For large meshes (10M+ cells), snappyHexMesh can be run in parallel to reduce meshing time significantly:

# Step 1: Create background blockMesh
blockMesh

# Step 2: Decompose the background mesh
decomposePar -copyZero

# Step 3: Run snappyHexMesh in parallel
mpirun -np 8 snappyHexMesh -parallel > log.snappyHexMesh 2>&1

# Step 4: Reconstruct the mesh (not the fields)
reconstructParMesh -constant

# Step 5: Remove processor directories from the mesh step
rm -rf processor*

# Step 6: Re-decompose for the solver run
decomposePar -copyZero
mpirun -np 8 simpleFoam -parallel > log.simpleFoam 2>&1

Use reconstructParMesh (not reconstructPar) to merge the mesh from processor directories into constant/polyMesh. The -constant flag means it merges only the constant (mesh) data, not the time directory fields.

Diagnose your parallel decomposition setup — free

Upload your case and CFDpilot checks your decomposePar settings, load balance, and flags common processor boundary issues.

Check my decomposition →
Official documentation

Frequently Asked Questions

What decomposition method should I use — scotch, metis, or simple?

Use scotch for almost all cases. It uses graph partitioning to minimise processor-to-processor communication faces and automatically balances the load with no geometry input required. Use metis as an alternative on very large meshes (50M+ cells) if scotch is slow. Use simple or hierarchical only for structured blockMesh cases with simple box geometries.

How many cores should I use for an OpenFOAM parallel simulation?

Target 50,000 to 200,000 cells per core. Below 50,000 cells per core, MPI communication overhead dominates and efficiency drops. For a 1-million cell mesh, 8 to 20 cores is typically optimal. On HPC clusters with fast InfiniBand, you can go lower (20,000–50,000 cells per core) while maintaining acceptable parallel efficiency.

Why does OpenFOAM throw 'number of processor directories != MPI processes'?

The -np argument to mpirun does not match the numberOfSubdomains in decomposeParDict, or processor directories from a previous decomposition exist. Fix: delete existing processor directories with rm -rf processor*, update numberOfSubdomains in decomposeParDict, re-run decomposePar, then relaunch mpirun with the matching -np value.

Do I need to run reconstructPar after every parallel simulation?

No. ParaView can read decomposed cases directly using Case Type "Decomposed Case". However, many post-processing tools require reconstructed data. Use reconstructPar -latestTime for quick monitoring during the run and a full reconstructPar at the end for complete post-processing analysis.

What is renumberMesh and when should I run it?

renumberMesh reorders cell labels to improve cache locality, reducing solve time by 10–20% on large meshes. Run it with renumberMesh -overwrite after mesh generation (blockMesh or snappyHexMesh) and before decomposePar. It writes the reordered data back to constant/polyMesh in place.

How do I run snappyHexMesh in parallel?

Decompose the background blockMesh with decomposePar -copyZero, then run mpirun -np N snappyHexMesh -parallel. After completion, use reconstructParMesh -constant (not reconstructPar) to merge the mesh from processor directories into constant/polyMesh. Then remove the processor directories and re-run decomposePar for the solver run.