7. Checkpointing and Coarse Output
7.1. Checkpointing
1. Extended netCDF writer
The writer for a checkpoint must be extended by the following variables:
commandLine
, writeCount
and in addition to simplify things by hMax
.
The variable commandLine
stores the user input that is to be reapplied from a checkpoint when the simulation is started.
writeCount
stores the number of writes made to the solution file.
hMax
stores the hMax value that was calculated at the start of the simulation. This is needed to calculate the same time steps and the same time scaling.
First, the constructor must be changed to implement the functionality for the checkpoint. This involves defining further variables and setting data for all precalculated variables:
/// File: NetCdf.cpp
/// Header: NetCdf.h
/// Test: NetCdf.test.cpp
tsunami_lab::io::NetCdf::NetCdf( std::string filePath,
t_idx l_nx,
t_idx l_ny,
t_idx l_k,
t_real l_scaleX,
t_real l_scaleY,
t_idx l_stride,
const t_real* bathymetry,
bool useSpherical,
bool useMomenta,
const char* commandLine,
t_real hMax )
: isReadMode( false ), isCheckpoint( commandLine[0] != '\0' ), commandLine( commandLine )
{
[...]
if( isCheckpoint )
{
l_err = nc_def_var( m_ncId,
"commandLine",
NC_CHAR,
1,
&strDimID,
&commandLineID );
checkNcErr( l_err, "commandLine" );
l_err = nc_def_var( m_ncId,
"writeCount",
NC_INT,
1,
&m_dimTimeId,
&m_writeCountId );
checkNcErr( l_err, "writeCount" );
l_err = nc_def_var( m_ncId,
"hMax",
NC_FLOAT,
1,
&m_dimTimeId,
&m_hMaxID );
checkNcErr( l_err, "hMax" );
}
[...]
if( isCheckpoint )
{
start[0] = 0;
count[0] = std::strlen( commandLine ) + 1;
l_err = nc_put_vara( m_ncId,
commandLineID,
start,
count,
commandLine );
checkNcErr( l_err, "putCommandLine" );
index[0] = 0;
l_err = nc_put_var1_float( m_ncId,
m_hMaxID,
index,
&hMax );
checkNcErr( l_err, "putHMax" );
}
}
The write functions must also be updated.
It now takes an additional argument writeCount
and is intended to write all changing variables.
It always synchronizes with the file system with ny_sync
at every write call, so that the data is not damaged in the event of a crash.
/// File: NetCdf.cpp
void tsunami_lab::io::NetCdf::_write( const t_real simulationTime,
const t_real* totalHeight,
const t_real* momentumX,
const t_real* momentumY,
const t_idx nx,
const t_idx ny,
const t_idx stride,
const t_idx writeCount )
{
[...]
if( m_writeCountId >= 0 )
{
indexNC[0] = 0;
unsigned long long ullWriteCount = static_cast<unsigned long long>( writeCount );
l_err = nc_put_var1_ulonglong( m_ncId,
m_writeCountId,
indexNC,
&ullWriteCount );
checkNcErr( l_err, "putWriteCount" );
}
[...]
}
Both the constructor and the write function are combined in a constructor call for a checkpoint creation.
/// File: NetCdf.cpp
tsunami_lab::io::NetCdf::NetCdf( std::string filePath,
t_idx l_nx,
t_idx l_ny,
t_real l_scaleX,
t_real l_scaleY,
t_idx l_stride,
const t_real* bathymetry,
const char* commandLine,
t_real hMax,
const t_real* totalHeight,
const t_real* momentumX,
const t_real* momentumY,
t_real simulationTime,
const t_idx writeCount )
: NetCdf( filePath, l_nx, l_ny, 1, l_scaleX, l_scaleY, l_stride, bathymetry, false, true, commandLine, hMax )
{
_write( simulationTime, totalHeight, momentumX, momentumY, l_nx, l_ny, l_stride, writeCount );
}
2. Setup Checkpoint
The first step is to read the checkpoint netCdf file with the required variables to start the simulation.
For this purpose, an array with variables and an array with the same size for the read data are created.
Both definitions of variables
and data
can be found in CheckPoint.h.
/// File: CheckPoint.cpp
/// Header: CheckPoint.h
/// Test: CheckPoint.test.cpp
const char* tsunami_lab::setups::Checkpoint::variables[8]{ "totalHeight", "bathymetry", "momentumX", "momentumY", "time", "commandLine", "writeCount", "hMax" };
tsunami_lab::setups::Checkpoint::Checkpoint( const char* filepath,
t_idx& writeCount,
t_real& simulationTime,
t_real& hMax,
std::vector<char*>& argv )
{
// reading the checkpoint
tsunami_lab::io::NetCdf reader = tsunami_lab::io::NetCdf();
reader.read( filepath, variables, data );
[...]
Then the length and type of the read data are checked to confirm correct usage. In addition, the commandLine variable is broken down into C-like arguments that are stored in a vector to mimic the user input.
/// FILE: CheckPoint.cpp
[...]
// check totalHeight, bathymetry, momentumX, momentumY
if( !( data[0].length == data[1].length
&& data[0].length == data[2].length
&& data[0].length == data[3].length ) )
{
std::cerr << red << "ERROR: Size is not equal! The size of totalHeight, bathymetry, momentumX or momentumY should be the same. Aborting!" << reset << std::endl;
exit( EXIT_FAILURE );
}
if( !( data[0].type == tsunami_lab::io::NetCdf::FLOAT
&& data[1].type == tsunami_lab::io::NetCdf::FLOAT
&& data[2].type == tsunami_lab::io::NetCdf::FLOAT
&& data[3].type == tsunami_lab::io::NetCdf::FLOAT ) )
{
std::cerr << red << "ERROR: Not of type float! The type of totalHeight, bathymetry, momentumX or momentumY should be float. Aborting!" << reset << std::endl;
exit( EXIT_FAILURE );
}
[...]
// check and convert commandLine to C like argument list
if( data[5].length < 1 )
{
std::cerr << red << "ERROR: Could not read checkpoint because there are no values to read. Aborting!" << reset << std::endl;
exit( EXIT_FAILURE );
}
else if( data[5].type != tsunami_lab::io::NetCdf::CHAR )
{
std::cerr << red << "ERROR: Could not read checkpoint because the type is wrong. Aborting!" << reset << std::endl;
exit( EXIT_FAILURE );
}
else
{
char* text = static_cast<char*>( data[5].array );
size_t wordStart = 0;
size_t wordLength = 0;
for( size_t i = 0; i < data[5].length; i++ )
{
if( text[i] == ' ' )
{
text[i] = '\0';
if( wordLength == 0 ) // skip double spaces
{
wordStart++;
continue;
}
argv.push_back( &text[wordStart] );
wordStart = i + 1;
wordLength = 0;
continue;
}
wordLength++;
}
argv.push_back( &text[wordStart] );
}
}
In order to pass the data to patches::WavePropagation
without unnecessary conversions, the get functions expect index instead of coordinates.
To pass the index instead of the coordinates, the size of a cell in main.cpp is set to one.
/// File: CheckPoint.cpp
tsunami_lab::t_real tsunami_lab::setups::Checkpoint::getBathymetry( t_real indexX,
t_real indexY ) const
{
t_idx index = indexY * data[1].stride + indexX;
return static_cast<float*>( data[1].array )[index];
}
3. Test of checkpointing
When starting the solver, the following line is displayed before the simulation settings are printed:
Checking for Checkpoints: No checkpoint found!
When a checkpoint is created, the following lines are displayed in the console:
writing to 'solutions/new_checkpoint'
finished writing to 'solutions/new_checkpoint'. Use ncdump to view its contents.
In the event of a crash and a restart of the simulation from the checkpoint, the first line changes to the following. The settings are also printed out again for the user to check.
Checking for Checkpoints: Loading checkpoint!
4. Auto Loading a Checkpoint
The main program checks whether a checkpoint already exists next to the solution. If this is the case, the checkpoint is loaded and argv is overwritten so that any user input is discarded. Otherwise, the simulation is started normally with user-defined inputs. When a checkpoint is written, the old checkpoint is retained and is only deleted after the new checkpoint has been written. This ensures the stability of the checkpoint creation process.
/// File: main.cpp
std::string checkpointPath = SOLUTION_FOLDER + "/checkpoint.nc";
std::vector<char*> parsedArgv;
if( fs::exists( checkpointPath ) )
{
std::cout << green << "Loading checkpoint!" << reset << std::endl;
l_setup = new tsunami_lab::setups::Checkpoint( checkpointPath.c_str(),
l_writeCount,
l_simTime,
checkpointHMax,
parsedArgv );
i_argc = parsedArgv.size();
i_argv = parsedArgv.data();
useCheckpoint = true;
}
else
{
std::cout << green << "No checkpoint found!" << reset << std::endl;
}
7.2 Coarse Output
1. Modified output writer
To modify the output recorder so that it averages several neighbouring cells of the calculation grid into one cell of
the output file, we introduce the helper function averageSeveral
. The idea is to add the values of the l_k many
high-resolution cells in x and y direction and then average them by dividing the sum by l_k2 at the end. The whole thing
can then be done analogue for the height and both moments. (l_k is set in the constructor)
/// File: NetCdf.cpp
/// Header: NetCdf.h
/// Test: NetCdf.test.cpp
void tsunami_lab::io::NetCdf::averageSeveral( const t_real simulationTime,
const t_real* totalHeight,
const t_real* momentumX,
const t_real* momentumY )
{
t_idx l_size = m_nx * m_ny;
t_idx l_index = 0;
t_idx l_k2 = m_k * m_k;
[ ... ]
l_size is the new size of the arrays with the reduced number of cells
l_index is the current calculated index of the new array
l_k2 is the number of cells that are joined together
For the implementation, we iterate over our matrix of cells (taking the stride into account, of course). The two outer loops iterate roughly over the matrix. This means that these two loops represent cell blocks of size \(l_k \cdot l_k\). Once in x and once in y direction.
The inner two loops then iterate together over the elements in x and y direction of the cell block. The values of the individual cells of a cell block are then added together inside.
[ ... ]
for( t_idx y = 0; y < m_singleCellny; y += m_k )
{
for( t_idx x = 0; x < m_singleCellnx; x += m_k )
{
for( t_idx i_y = y; i_y < y + m_k; i_y++ )
{
for( t_idx i_x = x; i_x < x + m_k; i_x++ )
{
l_avgHeight += totalHeight[( i_y * m_singleCellStride ) + i_x];
l_avgMomentumX += momentumX[( i_y * m_singleCellStride ) + i_x];
l_avgMomentumY += momentumY[( i_y * m_singleCellStride ) + i_x];
}
[ ... ]
As you can see, we have the variables m_nx, m_ny and m_stride
. Similarly, we have
m_singleCellnx, m_singleCellny and m_singleCellStride
. This is because we have to distinguish between the number a
of high-resolution cells and the number of cells after we have made the summarization. The variables with the prefix
singleCell are the values for the original numbers and those without the prefix for the summarised numbers.
The average values must then be divided by the number of cells in a block (l_k2) and added to the corresponding array. It is important to zero the average values again and increase the index afterwards.
[ ... ]
for( t_idx i_x = x; i_x < x + m_k; i_x++ )
{
l_avgHeight += totalHeight[( i_y * m_singleCellStride ) + i_x];
l_avgMomentumX += momentumX[( i_y * m_singleCellStride ) + i_x];
l_avgMomentumY += momentumY[( i_y * m_singleCellStride ) + i_x];
}
}
// write combined cell to arrays
l_totalHeight[l_index] = l_avgHeight / l_k2;
l_momentumX[l_index] = l_avgMomentumX / l_k2;
l_momentumY[l_index] = l_avgMomentumY / l_k2;
l_index++;
// reset average values
l_avgHeight = 0;
l_avgMomentumX = 0;
l_avgMomentumY = 0;
}
}
[ ... ]
After we have also exited the fourth loop, we call the internal _write
function with our reduced arrays and pass them.
Of course, dont forget to delete the arrays.
[ ... ]
_write( simulationTime, l_totalHeight, l_momentumX, l_momentumY, m_nx, m_ny, m_stride, 0 );
delete[] l_totalHeight;
delete[] l_momentumX;
delete[] l_momentumY;
}
2. Tohoku earthquake and tsunami event
Cell size: 50m
Required cells in x-direction: \(\frac{2700000}{50}=54000\)
Required cells in y-direction: \(\frac{2700000}{50}=30000\)
Note
We have selected an l_k of 5 so that a total of 25 cells are merged into one in the visualisation. Accordingly, the cell size in the video is 250 metres.
The video is the visualisation of the first 10 seconds of our tsunami simulation. This period is of course too short to recognise changes in the tsunami. The reason for the short video is the simulation time. We needed just over 8 hours on LeChuck to calculate 10 seconds.
3. Bonus
To make the differences more visible, we simulated Tohoku with a cell size of \(1000\,m\) and visualised the whole thing with an \(l\_k \text{ of } 1, 5 \text{ and } 25\).
Required cells in x-direction: \(\frac{2700000}{1000}=2700\)
Required cells in y-direction: \(\frac{2700000}{1000}=1500\)
l_k = 1
l_k = 5
l_k = 25
Contribution
All team members contributed equally to the tasks.