-
Notifications
You must be signed in to change notification settings - Fork 34
add pure vae25+fairmq n-m example #535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds a comprehensive Slurm documentation guide and shell script for running the FairMQ n-m example on distributed clusters. The guide covers setup, prerequisites, topology configuration, troubleshooting, and monitoring. The script orchestrates a 1 synchronizer, 3 senders, and 4 receivers topology across allocated Slurm nodes. Changes
Sequence Diagram(s)sequenceDiagram
participant Slurm as Slurm Scheduler
participant Sync as Synchronizer<br/>(Node 0)
participant Recv as Receivers<br/>(Nodes 1-2)
participant Send as Senders<br/>(Nodes 1-2)
Slurm->>Sync: Launch (port 5555)
Slurm->>Recv: Launch 4 receivers
Recv->>Sync: Connect to PUB channel<br/>(receive state info)
Sync->>Recv: Publish state<br/>(ready signal)
Slurm->>Send: Launch 3 senders
Send->>Recv: Connect to PUSH/PULL<br/>(send data)
Recv->>Recv: Receive & process<br/>(in parallel)
Send->>Send: Send data stream<br/>(in parallel)
Note over Sync,Send: All components run<br/>until completion
Sync->>Sync: Teardown
Recv->>Recv: Teardown
Send->>Send: Teardown
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes The additions consist of straightforward documentation and a standard Slurm orchestration script. The script follows conventional patterns for launching distributed processes without complex branching logic or intricate state management. Homogeneous review focus: documentation accuracy and script correctness for launching components in sequence. Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
examples/n-m/fairmq-start-ex-n-m-slurm.sh (2)
29-29: Resolve shellcheck warnings for proper array handling and word splitting protection.The script has multiple instances of unquoted array expansions and command substitutions that can cause word splitting if hostnames contain spaces or special characters. While unlikely on typical HPC clusters, these should be fixed to follow shell best practices.
Apply these corrections:
-# Get the list of allocated nodes -NODELIST=($(scontrol show hostname $SLURM_NODELIST)) +# Get the list of allocated nodes +mapfile -t NODELIST < <(scontrol show hostname $SLURM_NODELIST)-# Assign nodes to devices -SYNC_NODE=${NODELIST[0]} -SENDER_NODES=(${NODELIST[1]} ${NODELIST[2]} ${NODELIST[3]}) -RECEIVER_NODES=(${NODELIST[4]} ${NODELIST[5]} ${NODELIST[6]} ${NODELIST[7]}) +# Assign nodes to devices +SYNC_NODE="${NODELIST[0]}" +SENDER_NODES=("${NODELIST[1]}" "${NODELIST[2]}" "${NODELIST[3]}") +RECEIVER_NODES=("${NODELIST[4]}" "${NODELIST[5]}" "${NODELIST[6]}" "${NODELIST[7]}")-echo "Sender nodes: ${SENDER_NODES[@]}" -echo "Receiver nodes: ${RECEIVER_NODES[@]}" +echo "Sender nodes: ${SENDER_NODES[*]}" +echo "Receiver nodes: ${RECEIVER_NODES[*]}"Also applies to: 33-34, 41-42
45-54: Quote array expansion in receiver address construction.For consistency and defensive coding, quote the array variable expansion at line 50 and 52:
# Build receiver addresses for senders to connect to RECEIVER_ADDRESSES="" for i in $(seq 0 $((NUM_RECEIVERS - 1))); do RECEIVER_PORT=$((RECEIVER_BASE_PORT + i)) if [ $i -eq 0 ]; then - RECEIVER_ADDRESSES="address=tcp://${RECEIVER_NODES[$i]}:${RECEIVER_PORT}" + RECEIVER_ADDRESSES="address=tcp://${RECEIVER_NODES[$i]}:${RECEIVER_PORT}" else - RECEIVER_ADDRESSES="${RECEIVER_ADDRESSES},address=tcp://${RECEIVER_NODES[$i]}:${RECEIVER_PORT}" + RECEIVER_ADDRESSES="${RECEIVER_ADDRESSES},address=tcp://${RECEIVER_NODES[$i]}:${RECEIVER_PORT}"Note: Within double-quoted strings, array subscript expansion is safe, so this is a very minor style suggestion.
examples/n-m/SLURM_README.md (3)
58-62: Fix grammar and markdown link formatting.Two minor issues:
- Line 60: Use hyphenated adjective for clarity: "8-hour time limit" (not "8 hour")
- Line 62: Wrap the bare URL in markdown link syntax
-The script uses the `main` partition by default (8 hour time limit). Other available partitions: `debug` (30 min), `grid` (3 days), `long` (7 days), `high_mem`, `gpu`. +The script uses the `main` partition by default (8-hour time limit). Other available partitions: `debug` (30 min), `grid` (3 days), `long` (7 days), `high_mem`, `gpu`. -**Cluster Documentation:** https://hpc.gsi.de/virgo +**Cluster Documentation:** [https://hpc.gsi.de/virgo](https://hpc.gsi.de/virgo)
15-15: Add language specifications to fenced code blocks.Improve markdown rendering by specifying languages for code blocks:
1. **Connect to the submit node:** ```bash ssh vae25.hpc.gsi.deAt line 169, wrap the ASCII diagram:
### Communication Pattern -``` +```text Synchronizer (PUB) | | sync messages v Sender 1, 2, 3 (SUB -> PUSH) | | data distribution based on message ID v Receiver 1, 2, 3, 4 (PULL) -``` +```At line 205, specify the error message block:
**Symptoms:** -``` +```text slurmstepd: error: couldn't chdir to `/u/username/...': No such file or directory -``` +```Also applies to: 169-169, 205-205
210-213: Minor wordiness in troubleshooting solution.Line 210 can be simplified:
-- The home directory is not accessible from compute nodes within the container +- The home directory is inaccessible from compute nodes within the containerOtherwise, the troubleshooting section is comprehensive and addresses real issues users will encounter, particularly the critical point about lustre filesystem access in the vae25 environment.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/n-m/SLURM_README.md(1 hunks)examples/n-m/fairmq-start-ex-n-m-slurm.sh(1 hunks)
🧰 Additional context used
🪛 LanguageTool
examples/n-m/SLURM_README.md
[grammar] ~60-~60: Use a hyphen to join words.
Context: ... uses the main partition by default (8 hour time limit). Other available partit...
(QB_NEW_EN_HYPHEN)
[style] ~210-~210: Consider using “inaccessible” to avoid wordiness.
Context: ... Solution: - The home directory is not accessible from compute nodes within the container...
(NOT_ABLE_PREMIUM)
🪛 markdownlint-cli2 (0.18.1)
examples/n-m/SLURM_README.md
15-15: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
62-62: Bare URL used
(MD034, no-bare-urls)
169-169: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
205-205: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Shellcheck (0.11.0)
examples/n-m/fairmq-start-ex-n-m-slurm.sh
[warning] 29-29: Prefer mapfile or read -a to split command output (or quote to avoid splitting).
(SC2207)
[warning] 33-33: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.
(SC2206)
[warning] 33-33: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.
(SC2206)
[warning] 33-33: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.
(SC2206)
[warning] 34-34: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.
(SC2206)
[warning] 34-34: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.
(SC2206)
[warning] 34-34: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.
(SC2206)
[warning] 34-34: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.
(SC2206)
[error] 41-41: Argument mixes string and array. Use * or separate argument.
(SC2145)
[error] 42-42: Argument mixes string and array. Use * or separate argument.
(SC2145)
🔇 Additional comments (4)
examples/n-m/fairmq-start-ex-n-m-slurm.sh (1)
100-106: Completion logic looks good.The
waitcommand correctly blocks until all background srun processes complete. The status messaging is clear and helpful for monitoring.examples/n-m/SLURM_README.md (3)
64-94: Prerequisites and quick start are clear and well-structured.The prerequisites are concise and the quick start workflow follows standard Slurm patterns. Instructions are easy to follow.
96-185: Customization and topology sections are comprehensive and accurate.Clear guidance on modifying configuration parameters with proper emphasis on the critical constraint (updating node counts when changing topology). The topology explanation aligns well with the accompanying script.
231-255: Advanced and monitoring sections are well-structured.The interactive mode guidance and monitoring command examples are practical and appropriately comprehensive for users debugging or inspecting jobs.
Adds example for running the n-m topology on Slurm cluster with GSI vae25 using sbatch only.