From 8bda38e4c2b1410d981b56613f2f6bbccb44d6c0 Mon Sep 17 00:00:00 2001 From: Michael Zingale Date: Tue, 15 Jul 2025 10:45:56 -0400 Subject: [PATCH] update the NERSC job script docs this explains some of the option better and adds a note about write_plotfile_with_checkpoint --- sphinx_docs/source/nersc-workflow.rst | 36 ++++++++++++++++++++++----- 1 file changed, 30 insertions(+), 6 deletions(-) diff --git a/sphinx_docs/source/nersc-workflow.rst b/sphinx_docs/source/nersc-workflow.rst index fe5f641..71cb60c 100644 --- a/sphinx_docs/source/nersc-workflow.rst +++ b/sphinx_docs/source/nersc-workflow.rst @@ -10,17 +10,28 @@ Perlmutter GPU jobs ^^^^^^^^ -Perlmutter has 1536 GPU nodes, each with 4 NVIDIA A100 GPUs -- therefore it is best to use -4 MPI tasks per node. +Perlmutter has 1536 GPU nodes, each with 4 NVIDIA A100 +GPUs---therefore it is best to use 4 MPI tasks per node. -.. note:: +.. important:: you need to load the same modules used to compile the executable in your submission script, otherwise, it will fail at runtime because it can't find the CUDA libraries. -Below is an example that runs on 16 nodes with 4 GPUs per node, and also -includes the restart logic to allow for job chaining. +Below is an example that runs on 16 nodes with 4 GPUs per node. It also +does the following: + +* Includes logic for automatically restarting from the last checkpoint file + (useful for job-chaining). This is done via the ``find_chk_file`` function. + +* Installs a signal handler to create a ``dump_and_stop`` file shortly before + the queue window ends. This ensures that we get a checkpoint at the very + end of the queue window. + +* Can post to slack using the :download:`slack_job_start.py + <../../job_scripts/perlmutter/slack_job_start.py>` script---this + requires a webhook to be installed (in a file ``~/.slack.webhook``). .. literalinclude:: ../../job_scripts/perlmutter/perlmutter.submit :language: sh @@ -29,7 +40,12 @@ includes the restart logic to allow for job chaining. With large reaction networks, you may get GPU out-of-memory errors during the first burner call. If this happens, you can add - ``amrex.the_arena_init_size=0`` after ``${restartString}`` in the srun call + + :: + + amrex.the_arena_init_size=0 + + after ``${restartString}`` in the srun call so AMReX doesn't reserve 3/4 of the GPU memory for the device arena. .. note:: @@ -39,6 +55,14 @@ includes the restart logic to allow for job chaining. warning signal and the end of the allocation by adjusting the ``#SBATCH --signal=B:URG@`` line at the top of the script. + Also, by default, AMReX will output a plotfile at the same time as a checkpoint file, + which means you'll get one from the ``dump_and_stop``, which may not be at the same + time intervals as your ``amr.plot_per``. To suppress this, set: + + :: + + amr.write_plotfile_with_checkpoint = 0 + CPU jobs ^^^^^^^^