Skip to content

Conversation

@rbx
Copy link
Member

@rbx rbx commented Oct 21, 2025

Adds example for running the n-m topology on Slurm cluster with GSI vae25 using sbatch only.

@rbx rbx changed the title add pure vae25+fairm n-m example add pure vae25+fairmq n-m example Oct 21, 2025
@coderabbitai
Copy link

coderabbitai bot commented Oct 21, 2025

📝 Walkthrough

Walkthrough

Adds a comprehensive Slurm documentation guide and shell script for running the FairMQ n-m example on distributed clusters. The guide covers setup, prerequisites, topology configuration, troubleshooting, and monitoring. The script orchestrates a 1 synchronizer, 3 senders, and 4 receivers topology across allocated Slurm nodes.

Changes

Cohort / File(s) Summary
Slurm n-m Example Setup
examples/n-m/SLURM_README.md
New comprehensive guide documenting how to run the FairMQ n-m example on Slurm clusters, including access instructions for GSI vae25, prerequisites, quick start workflow, script customization, topology explanation, troubleshooting, and monitoring commands.
Slurm Orchestration Script
examples/n-m/fairmq-start-ex-n-m-slurm.sh
New Slurm-compatible bash script that configures job resources, sets up FairSoft binaries, and orchestrates a distributed n-m topology (1 synchronizer, 3 senders, 4 receivers) across allocated nodes using srun with PUB/SUB and PUSH/PULL channel configurations.

Sequence Diagram(s)

sequenceDiagram
    participant Slurm as Slurm Scheduler
    participant Sync as Synchronizer<br/>(Node 0)
    participant Recv as Receivers<br/>(Nodes 1-2)
    participant Send as Senders<br/>(Nodes 1-2)

    Slurm->>Sync: Launch (port 5555)
    Slurm->>Recv: Launch 4 receivers
    Recv->>Sync: Connect to PUB channel<br/>(receive state info)
    Sync->>Recv: Publish state<br/>(ready signal)
    Slurm->>Send: Launch 3 senders
    Send->>Recv: Connect to PUSH/PULL<br/>(send data)
    Recv->>Recv: Receive & process<br/>(in parallel)
    Send->>Send: Send data stream<br/>(in parallel)
    Note over Sync,Send: All components run<br/>until completion
    Sync->>Sync: Teardown
    Recv->>Recv: Teardown
    Send->>Send: Teardown
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

The additions consist of straightforward documentation and a standard Slurm orchestration script. The script follows conventional patterns for launching distributed processes without complex branching logic or intricate state management. Homogeneous review focus: documentation accuracy and script correctness for launching components in sequence.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "add pure vae25+fairmq n-m example" directly relates to the changeset, which adds two new files to support running the n-m FairMQ topology on a Slurm cluster specifically for the GSI vae25 system. The title clearly identifies the key components being added: vae25 (the specific cluster), FairMQ (the framework), and n-m example (the topology). While the word "pure" is somewhat unclear in intent, the overall title effectively communicates the primary change without being misleading or off-topic.
Description Check ✅ Passed The pull request description "Adds example for running the n-m topology on Slurm cluster with GSI vae25 using sbatch only" is clearly related to the changeset. It appropriately describes what is being added (an example for the n-m topology), where it runs (Slurm cluster with GSI vae25), and the specific approach (using sbatch only). The description provides meaningful context that aligns with both the SLURM_README.md guide and the accompanying shell script, and is neither vague nor off-topic.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
examples/n-m/fairmq-start-ex-n-m-slurm.sh (2)

29-29: Resolve shellcheck warnings for proper array handling and word splitting protection.

The script has multiple instances of unquoted array expansions and command substitutions that can cause word splitting if hostnames contain spaces or special characters. While unlikely on typical HPC clusters, these should be fixed to follow shell best practices.

Apply these corrections:

-# Get the list of allocated nodes
-NODELIST=($(scontrol show hostname $SLURM_NODELIST))
+# Get the list of allocated nodes
+mapfile -t NODELIST < <(scontrol show hostname $SLURM_NODELIST)
-# Assign nodes to devices
-SYNC_NODE=${NODELIST[0]}
-SENDER_NODES=(${NODELIST[1]} ${NODELIST[2]} ${NODELIST[3]})
-RECEIVER_NODES=(${NODELIST[4]} ${NODELIST[5]} ${NODELIST[6]} ${NODELIST[7]})
+# Assign nodes to devices
+SYNC_NODE="${NODELIST[0]}"
+SENDER_NODES=("${NODELIST[1]}" "${NODELIST[2]}" "${NODELIST[3]}")
+RECEIVER_NODES=("${NODELIST[4]}" "${NODELIST[5]}" "${NODELIST[6]}" "${NODELIST[7]}")
-echo "Sender nodes: ${SENDER_NODES[@]}"
-echo "Receiver nodes: ${RECEIVER_NODES[@]}"
+echo "Sender nodes: ${SENDER_NODES[*]}"
+echo "Receiver nodes: ${RECEIVER_NODES[*]}"

Also applies to: 33-34, 41-42


45-54: Quote array expansion in receiver address construction.

For consistency and defensive coding, quote the array variable expansion at line 50 and 52:

 # Build receiver addresses for senders to connect to
 RECEIVER_ADDRESSES=""
 for i in $(seq 0 $((NUM_RECEIVERS - 1))); do
     RECEIVER_PORT=$((RECEIVER_BASE_PORT + i))
     if [ $i -eq 0 ]; then
-        RECEIVER_ADDRESSES="address=tcp://${RECEIVER_NODES[$i]}:${RECEIVER_PORT}"
+        RECEIVER_ADDRESSES="address=tcp://${RECEIVER_NODES[$i]}:${RECEIVER_PORT}"
     else
-        RECEIVER_ADDRESSES="${RECEIVER_ADDRESSES},address=tcp://${RECEIVER_NODES[$i]}:${RECEIVER_PORT}"
+        RECEIVER_ADDRESSES="${RECEIVER_ADDRESSES},address=tcp://${RECEIVER_NODES[$i]}:${RECEIVER_PORT}"

Note: Within double-quoted strings, array subscript expansion is safe, so this is a very minor style suggestion.

examples/n-m/SLURM_README.md (3)

58-62: Fix grammar and markdown link formatting.

Two minor issues:

  1. Line 60: Use hyphenated adjective for clarity: "8-hour time limit" (not "8 hour")
  2. Line 62: Wrap the bare URL in markdown link syntax
-The script uses the `main` partition by default (8 hour time limit). Other available partitions: `debug` (30 min), `grid` (3 days), `long` (7 days), `high_mem`, `gpu`.
+The script uses the `main` partition by default (8-hour time limit). Other available partitions: `debug` (30 min), `grid` (3 days), `long` (7 days), `high_mem`, `gpu`.

-**Cluster Documentation:** https://hpc.gsi.de/virgo
+**Cluster Documentation:** [https://hpc.gsi.de/virgo](https://hpc.gsi.de/virgo)

15-15: Add language specifications to fenced code blocks.

Improve markdown rendering by specifying languages for code blocks:

 1. **Connect to the submit node:**
    ```bash
    ssh vae25.hpc.gsi.de

At line 169, wrap the ASCII diagram:

 ### Communication Pattern
 
-```
+```text
 Synchronizer (PUB)
     |
     | sync messages
     v
 Sender 1, 2, 3 (SUB -> PUSH)
     |
     | data distribution based on message ID
     v
 Receiver 1, 2, 3, 4 (PULL)
-```
+```

At line 205, specify the error message block:

 **Symptoms:**
-```
+```text
 slurmstepd: error: couldn't chdir to `/u/username/...': No such file or directory
-```
+```

Also applies to: 169-169, 205-205


210-213: Minor wordiness in troubleshooting solution.

Line 210 can be simplified:

-- The home directory is not accessible from compute nodes within the container
+- The home directory is inaccessible from compute nodes within the container

Otherwise, the troubleshooting section is comprehensive and addresses real issues users will encounter, particularly the critical point about lustre filesystem access in the vae25 environment.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dcea48f and 6ef7f49.

📒 Files selected for processing (2)
  • examples/n-m/SLURM_README.md (1 hunks)
  • examples/n-m/fairmq-start-ex-n-m-slurm.sh (1 hunks)
🧰 Additional context used
🪛 LanguageTool
examples/n-m/SLURM_README.md

[grammar] ~60-~60: Use a hyphen to join words.
Context: ... uses the main partition by default (8 hour time limit). Other available partit...

(QB_NEW_EN_HYPHEN)


[style] ~210-~210: Consider using “inaccessible” to avoid wordiness.
Context: ... Solution: - The home directory is not accessible from compute nodes within the container...

(NOT_ABLE_PREMIUM)

🪛 markdownlint-cli2 (0.18.1)
examples/n-m/SLURM_README.md

15-15: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


62-62: Bare URL used

(MD034, no-bare-urls)


169-169: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


205-205: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Shellcheck (0.11.0)
examples/n-m/fairmq-start-ex-n-m-slurm.sh

[warning] 29-29: Prefer mapfile or read -a to split command output (or quote to avoid splitting).

(SC2207)


[warning] 33-33: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.

(SC2206)


[warning] 33-33: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.

(SC2206)


[warning] 33-33: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.

(SC2206)


[warning] 34-34: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.

(SC2206)


[warning] 34-34: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.

(SC2206)


[warning] 34-34: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.

(SC2206)


[warning] 34-34: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.

(SC2206)


[error] 41-41: Argument mixes string and array. Use * or separate argument.

(SC2145)


[error] 42-42: Argument mixes string and array. Use * or separate argument.

(SC2145)

🔇 Additional comments (4)
examples/n-m/fairmq-start-ex-n-m-slurm.sh (1)

100-106: Completion logic looks good.

The wait command correctly blocks until all background srun processes complete. The status messaging is clear and helpful for monitoring.

examples/n-m/SLURM_README.md (3)

64-94: Prerequisites and quick start are clear and well-structured.

The prerequisites are concise and the quick start workflow follows standard Slurm patterns. Instructions are easy to follow.


96-185: Customization and topology sections are comprehensive and accurate.

Clear guidance on modifying configuration parameters with proper emphasis on the critical constraint (updating node counts when changing topology). The topology explanation aligns well with the accompanying script.


231-255: Advanced and monitoring sections are well-structured.

The interactive mode guidance and monitoring command examples are practical and appropriately comprehensive for users debugging or inspecting jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant