9. Parallelization

1. Parallelization using OpenMP

Note

We are using the following OpenMP directives:

  • #pragma omp parallel for : “The omp parallel directive explicitly instructs the compiler to parallelize the chosen block of code”[1].

  • #pragma omp parallel for reduction( reduction-identifier:list ) : “Performs a reduction on each data variable in list using the specified reduction-identifier”[1].

  • #pragma omp declare simd : “The omp declare simd directive is applied to a function to create one or more versions for being called in a SIMD loop”[2].

  • #pragma omp simd : “The omp simd directive is applied to a loop to indicate that multiple iterations of the loop can be executed concurrently by using SIMD instructions”[3].

In the main.cpp file we are using a reduction to determine the value of l_hMax.

/// File: main.cpp
[ ... ]
// set up solver
#pragma omp parallel for reduction(max: l_hMax)
for( tsunami_lab::t_idx l_cy = 0; l_cy < l_ny; l_cy++ )
{
    tsunami_lab::t_real l_y = l_cy * cellSize;

    for( tsunami_lab::t_idx l_cx = 0; l_cx < l_nx; l_cx++ )
    {
        tsunami_lab::t_real l_x = l_cx * cellSize;

        // get initial values of the setup
        tsunami_lab::t_real l_h = l_setup->getHeight( l_x,
                                                      l_y );
        l_hMax = std::max( l_h, l_hMax );
    [ ... ]
    }
}

In NetCdf.cpp we parallelized the averageSeveral function to increase the performance when using coarse output.

/// File: NetCdf.cpp
void tsunami_lab::io::NetCdf::averageSeveral( [ ... ] )
{
    [ ... ]
#pragma omp parallel for
    for( t_idx y = 0; y < m_ny; y++ )
    {
        t_idx singleY = y * m_k;
#pragma omp simd
        for( t_idx x = 0; x < m_nx; x++ )
        {
            [ ... ]
        }
    }

We used the omp declare simd directive for our functions in the FWave.cpp to create one or more versions when being called in a SIMD loop. These methods are called in WavePropagation2d.cpp.

/// File: FWave.cpp
#pragma omp declare simd
void tsunami_lab::solvers::FWave::computeEigenvalues( [ ... ] )
{
    [ ... ]
}

#pragma omp declare simd
void tsunami_lab::solvers::FWave::computeDeltaFlux( [ ... ] )
{
    [ ... ]
}

#pragma omp declare simd
void tsunami_lab::solvers::FWave::computeEigencoefficients( [ ... ] )
{
    [ ... ]
}

#pragma omp declare simd
void tsunami_lab::solvers::FWave::computeBathymetryEffects( [ ... ] )
{
    [ ... ]
}

// net update without bathymetry
#pragma omp declare simd
void tsunami_lab::solvers::FWave::netUpdates( [ ... ] )
{
    [ ... ]
    computeEigenvalues( i_hL, i_hR, l_uL, l_uR, eigenvalue1, eigenvalue2 );
    [ ... ]
    computeDeltaFlux( i_hL, i_hR, l_uL, l_uR, i_huL, i_huR, deltaFlux );
    [ ... ]
    computeEigencoefficients( eigenvalue1, eigenvalue2, deltaFlux, eigencoefficient1, eigencoefficient2 );
    [ ... ]
}

// net update with bathymetry
#pragma omp declare simd
void tsunami_lab::solvers::FWave::netUpdates( [ ... ] )
{
    [ ... ]
    computeEigenvalues( i_hL, i_hR, l_uL, l_uR, eigenvalue1, eigenvalue2 );
    [ ... ]
    computeDeltaFlux( i_hL, i_hR, l_uL, l_uR, i_huL, i_huR, deltaFlux );
    [ ... ]
    computeBathymetryEffects( i_hL, i_hR, i_bL, i_bR, bathymetry );
    [ ... ]
    computeEigencoefficients( eigenvalue1, eigenvalue2, bathymetryDeltaFlux, eigencoefficient1, eigencoefficient2 );
    [ ... ]
}

However, most of the parallelization takes place in WavePropagation2d.cpp.

/// File: WavePropagation2d.cpp
    // init new cell quantities
#pragma omp parallel for
    for( t_idx l_ce = 0; l_ce < totalCells; l_ce++ )
    {
        [ ... ]
    }
    [ ... ]
#pragma omp parallel for
        for( t_idx i = 0; i < m_yCells + 1; i++ )
        {
            // iterates along the row
#pragma omp simd
            for( t_idx j = 0; j < m_xCells + 1; j++ )
            {
                [ ... ]
            }
        }
    }
    else
    {
        [ ... ]
        // iterates through the row
#pragma omp parallel for
        for( t_idx i = 0; i < m_yCells + 1; i++ )
        {
            // iterates over along the row
#pragma omp simd
            for( t_idx j = 0; j < m_xCells + 1; j++ )
            {
                [ ... ]
            }
        }
    }
    [ ... ]
    // copy the calculated cell quantities
#pragma omp parallel for
    for( t_idx l_ce = 0; l_ce < totalCells; l_ce++ )
    {
        [ ... ]
    }

    // only possible for f-wave solver
    if( hasBathymetry )
    {
        //  iterates over the x direction
#pragma omp parallel for
        for( t_idx i = 1; i < full_xCells; i += ITERATIONS_CACHE )
        {
            // iterate over the rows i.e. y-coordinates
            for( t_idx j = 0; j < m_yCells + 1; j++ )
            {
                // iterations for more efficient cache usage
#pragma omp simd
                for( t_idx k = 0; k < ITERATIONS_CACHE; k++ )
                {
                    [ ... ]
                }
            }
        }

        // iterate over the rows i.e. y-coordinates
        for( t_idx j = 0; j < m_yCells + 1; j++ )
        {
            // remaining iterations for more efficient cache usage
#pragma omp simd
            for( t_idx k = 0; k < remaining_xCells; k++ )
            {
                [ ... ]
            }
        }
    }
    else
    {
        [ ... ]
        //  iterates over the x direction
#pragma omp parallel for
        for( t_idx i = 1; i < full_xCells; i += ITERATIONS_CACHE )
        {
            // iterate over the rows i.e. y-coordinates
            for( t_idx j = 1; j < m_yCells + 1; j++ )
            {
                // iterations for more efficient cache usage
#pragma omp simd
                for( t_idx k = 0; k < ITERATIONS_CACHE; k++ )
                {
                    [ ... ]
                }
            }
        }

        // iterate over the rows i.e. y-coordinates
        for( t_idx j = 1; j < m_yCells + 1; j++ )
        {
            // remaining iterations for more efficient cache usage
#pragma omp simd
            for( t_idx k = 0; k < remaining_xCells; k++ )
            {
                [ ... ]
            }
        }
    }
}

void tsunami_lab::patches::WavePropagation2d::setGhostOutflow()
{
    [ ... ]
#pragma omp parallel for
    for( t_idx i = 1; i < m_yCells + 1; i++ )
    {
        [ ... ]
    }

#pragma omp parallel for
    for( size_t i = 0; i < stride; i++ )
    {
        [ ... ]
    }
}

#pragma omp declare simd
tsunami_lab::patches::WavePropagation2d::Reflection tsunami_lab::patches::WavePropagation2d::calculateReflection( [ ... ] )
{
    [ ... ]
}

#pragma omp declare simd
tsunami_lab::patches::WavePropagation2d::Reflection tsunami_lab::patches::WavePropagation2d::calculateReflection( [ ... ] )
{
    [ ... ]
}

const tsunami_lab::t_real* tsunami_lab::patches::WavePropagation2d::getTotalHeight()
{
    if( isDirtyTotalHeight )
    {
#pragma omp parallel for
        for( t_idx i = 1; i < m_yCells + 1; i++ )
        {
#pragma omp simd
            for( t_idx j = 1; j < m_xCells + 1; j++ )
            {
                [ ... ]
            }
        }
    }
    [ ... ]
}

void tsunami_lab::patches::WavePropagation2d::updateWaterHeight()
{
    if( !hasBathymetry )
    {
        return;
    }

#pragma omp parallel for
    for( t_idx i = 1; i < m_yCells + 1; i++ )
    {
        [ ... ]
    }

2. Comparison serial and parallelized solver

Without parallelization

Start executing 'simulation 2700 1500 -B -w 60 -t 13000 -c 5':
#####################################################
###                  Tsunami Lab                  ###
###                                               ###
### https://scalable.uni-jena.de                  ###
### https://rivinhd.github.io/Tsunami-Simulation/ ###
#####################################################
Checking for Checkpoints: File IO is disabled!
Simulation is set to 2D
Bathymetry is Enabled
Set Solver: FWave
Activated Reflection on None side
Output format is set to netCDF
Writing the X-/Y-Axis in format meters
Simulation Time is set to 13000 seconds
Writing to the disk every 60 seconds of simulation time
Checkpointing every 5 minutes
runtime configuration
  number of cells in x-direction:       2700
  number of cells in y-direction:       1500
  cell size:                            1000
  number of cells combined to one cell: 1
Max speed 307.668
entering time loop
finished time loop
freeing memory
The Simulation took 1 h 28 min 28 sec to finish.
Time per iteration: 597 milliseconds.
Time per cell:      147 nanoseconds.
finished, exiting

With parallelization

start executing 'MP_NUM_THREADS=72 ./simulation 2700 1500 -B -w 60 -t 13000 -c 5':
finished writing to 'solutions/simulation/solution.nc'. Use ncdump to view its contents.
The Simulation took 0 h 3 min 39 sec to finish.
Time per iteration: 24 milliseconds.
Time per cell:      6 nanoseconds.
finished, exiting

You can see that we have a run time of around 1 h 28 min without parallelization and disabled file io and a run time of just 3 min 39 sec with parallelization and enabled file io. This results in a speedup of \(S_{72} = \frac{5308 \text{ sec}}{219 \text{ sec}} \approx 24.237\).

3. Parallelization of the outer vs. the inner loops

The outer loop can be easily parallelized, as there are no dependencies between the individual cells. An X sweep is parallelized in the Y direction and a Y sweep in the X direction, so the direction of parallelism remains independent of each other. It is also more advantageous for the X sweep as the data locality is utilized.

When the inner loop is parallelized, there are dependencies when updating the cells, as the current and the next iteration access the same cell. For this, critical sections [4] must be introduced, which significantly slows down the calculation, as processes have to wait.

In conclusion, the parallelization of the outer loop is better, as no dependencies arise, data locality is utilized and deadlocks are completely excluded. The parallelization of the outer loop is also sufficient as there are significantly more cells in x and y direction than processors. The parallelization of the two loops would reintroduce the cell dependency which slows down the computation.

4. Different scheduling and pinning strategies

All threads

Start executing 'MP_NUM_THREADS=72 ./simulation 2700 1500 -B -w 60 -t 13000 -c 5':
finished writing to 'solutions/simulation/solution.nc'. Use ncdump to view its contents.
The Simulation took 0 h 3 min 39 sec to finish.
Time per iteration: 24 milliseconds.
Time per cell:      6 nanoseconds.
finished, exiting

Every core without their second thread

Start executing 'OMP_NUM_THREADS=36 OMP_PLACES={0}:36:1 ./simulation 2700 1500 -B -w 60 -t 13000 -c 5':
finished writing to 'solutions/simulation/solution.nc'. Use ncdump to view its contents.
The Simulation took 0 h 2 min 45 sec to finish.
Time per iteration: 18 milliseconds.
Time per cell:      4 nanoseconds.
finished, exiting

Only half of the cores with their second thread

Start executing 'OMP_NUM_THREADS=36 OMP_PLACES={0}:36:2 ./simulation 2700 1500 -B -w 60 -t 13000 -c 5':
finished writing to 'solutions/simulation/solution.nc'. Use ncdump to view its contents.
The Simulation took 0 h 3 min 24 sec to finish.
Time per iteration: 22 milliseconds.
Time per cell:      5 nanoseconds.
finished, exiting

Only half of the cores without their second thread

Start executing 'OMP_NUM_THREADS=18 OMP_PLACES={0}:18:1 ./simulation 2700 1500 -B -w 60 -t 13000 -c 5':
finished writing to 'solutions/simulation/solution.nc'. Use ncdump to view its contents.
The Simulation took 0 h 4 min 9 sec to finish.
Time per iteration: 28 milliseconds.
Time per cell:      6 nanoseconds.
finished, exiting

You can see that the second strategy is the best in terms of runtime. It also makes sense that the simulation simulation time increases the fewer cores and threads are used. The reason for the faster runtime of the second run compared to the first run is not easy to clarify. It could be that when using both threads, switching between them causes an overhead, and since the simulation requires much more computing power than loading data, the time deteriorates. It is very unlikely but possible that when using both threads, one thread overwrites the cached data of the other thread. Communication only takes place when writing to the file system. This could explain why the third pass is faster than the first, where one more socket is used. Therefore, some communication must take place via NUMA during the first run when writing to a file. In the fourth configuration, half of the cores are used, which reduces the parallel performance and increases the runtime.

Contribution

All team members contributed equally to the tasks.