Skip to content

Incompatibilities with Slurm on Virgo #490

@bsobol

Description

@bsobol

Hi,

I've been experimenting with DDS (I run the up-to-date master branch of DDS) on Virgo and encountered two issues:

  • When my custom batch script configuration contained #SBATCH --mem-per-cpu, I got the following error srun: fatal: cpus_per_task set by two different environment variables SLURM_CPUS_PER_TASK=2 != SLURM_TRES_PER_TASK=cpu:1
    This can be fixed by explicitly stating --cpus-per-task in srun invocation and I believe is connected with the change in slurm behavior described here: https://docs.icer.msu.edu/2023-05-04_LabNotebook_srun_threading_changes/
  • The second one I don't understand, but the command passed to srun via bash -c was not being executed. It's somehow being fixed by dumping the script into a file and passing it to bash -c.

Here's the diff with changes I made in job.slurm.in file:

@@ -16,8 +16,10 @@
 # continue waiting for child processes by any means
 trap -- '' SIGINT SIGTERM
 
+echo 'trap  '"'"'kill $PID && wait'"'"'  SIGINT SIGTERM; eval JOB_WRK_DIR=%DDS_AGENT_ROOT_WRK_DIR%/${SLURM_JOB_NAME}_${SLURM_JOBID}_${SLURMD_NODENAME}; mkdir -p $JOB_WRK_DIR; cd $JOB_WRK_DIR; cp %DDS_SCOUT% $JOB_WRK_DIR/; ./DDSWorker.sh & PID=$!;  wait'  > srun_script.sh
+
 # execute DDS Scoullt
-srun --no-kill --kill-on-bad-exit=0 --output=slurm-%j-%N.out /usr/bin/env bash -c 'trap  '"'"'kill $PID && wait'"'"'  SIGINT SIGTERM; eval JOB_WRK_DIR=%DDS_AGENT_ROOT_WRK_DIR%/${SLURM_JOB_NAME}_${SLURM_JOBID}_${SLURMD_NODENAME}; mkdir -p $JOB_WRK_DIR; cd $JOB_WRK_DIR; cp %DDS_SCOUT% $JOB_WRK_DIR/; ./DDSWorker.sh & PID=$!;  wait' &
+srun --cpus-per-task $SLURM_CPUS_PER_TASK --no-kill --kill-on-bad-exit=0 --output=slurm-%j-%N.out /usr/bin/env bash 'srun_script.sh' &
 
 wait

BTW, is there an intended way to reserve >1 cpu per slot for multithreaded tasks?

Regards,
Bartosz

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions