-
Notifications
You must be signed in to change notification settings - Fork 0
Add derecho-gpu machine config #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: current-0.082
Are you sure you want to change the base?
Conversation
|
Quick update. New The fix involved removing the compilers file and adding the
I suspect that this xmlchange doesnt make sense. |
|
Returning to this after a break: Some useful tips for future: During development I tested with the following: Checked how to run resubmit jobs - I believe jack ran 6 monthly jobs? Trying again: Further observations
Approach
TimingsSingle node run65 timesteps in 40 mins using 64 processors on a single gpudev node. DebugAttempting 2. we are not running on a GPU - cannot find id 0 when count is 0 error. Attempting to change Debug 2qstat -x -f 2449906 | sed -n '/exec_vnode/,/Resource_List/p' | tr '+' '\n' | sed 's/[()]//g' |
|
From https://ncar-hpc-docs.readthedocs.io/en/latest/compute-systems/derecho/#job-scripts |
|
I'd missed a few additional steps: Also noticed that ./xmlquery NGPUS_PER_NODE was set to 0 |
|
TomMelt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @ma595 .
I havent finished reviewing / testing yet. Just adding this so we can keep track for now.
- we should tag this commit and update documentation in https://github.com/DataWaveProject/CAM/blob/nonlocal-gws-global/README.md to modify
Externals.cfg
TomMelt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ma595 this actually worked for me with no issues whatsoever.
Here are the steps I took. I will write this up in the documentation on CAM so users can properly set up GPU runs when using the UNet model.
Steps to reproduce
manually git clone ccs_config into CAM
cd CAM
rm -rf ccs_config/
git clone https://github.com/DataWaveProject/ccs_config_cesm.git ccs_config
create new case
cime/scripts/create_newcase --case $HOME/cam-unet --compset FMTHIST --res ne30pg3_ne30pg3_mg17 --project USTN0009 --machine derecho-gpu
modify xml vars:
./xmlchange JOB_WALLCLOCK_TIME=01:00:00
./xmlchange JOB_QUEUE=main
./xmlchange NTASKS=64
./xmlchange MAX_CPUTASKS_PER_GPU_NODE=64
./xmlchange NGPUS_PER_NODE=1
./xmlchange MAX_GPUS_PER_NODE=1
|
I think we should merge this so that it can be tagged. Ready to include in CAM |
TomMelt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry I just realised though that this probably breaks the normal run. I was to keen. I will first test what happens when we run a cpu-only run 😬
TomMelt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the back and forth. I just tested on derecho and it looks like it correctly picks the config based on the machine config specified in the cli:
For example, without GPU
cime/scripts/create_newcase --case "${CASEDIR}" --compset FMTHIST --res ne30pg3_ne30pg3_mg17 --project USTN0009 --machine derecho
with GPU
cime/scripts/create_newcase --case "${CASEDIR}" --compset FMTHIST --res ne30pg3_ne30pg3_mg17 --project USTN0009 --machine derecho-gpu
This adds the
derecho-GPUmachine to theccs-config/machinesexternal which is a tagged release of https://github.com/ESMCI/ccs_config_cesm corresponding to https://github.com/ESMCI/ccs_config_cesm/releases/tag/ccs_config_cesm0.0.82This modifies:
config_batch.xmlconfig_machine.xmlconfig_compilers.xmlHardcoded paths toNETCDF_PATHandPNETCDF_PATH.The trick was to put the
derecho-gpumachine configuration above thederechomachine. This was because of theNODENAME_REGEXin derecho which autodetects the machine that is being run on. I suspect that commenting out this line might suffice.It's quite clear that there's a bit of a mismatch between the GPU configuration
*.xmlfiles provided (from Will Chapman) and the CIME build system we're using. More investigation is required. I copied only the relevant content from the files and the configuration for derecho's gpus seems to build, with a few caveats:Even after loading the following module set:
NETCDF could not be found. This is because in the jobs dir:
cmake_macros/CNL.cmakerequiresNETCDF_DIR.This is fixed by:
export NETCDF_DIR=/glade/u/apps/derecho/23.06/spack/opt/spack/netcdf/4.9.2/cray-mpich/8.1.25/oneapi/2023.0.0/wzolI also hardcoded some paths for debugging purposes.
Before merging:
config_compilers.xmlis used? This might be the way to fix the below.Test as follows: