Multi-Node Mode: run.sh
Distributes rollout and training across nodes using a shared filesystem for coordination.
Important
Multi-node mode requires a shared filesystem (NFS, Lustre, Azure Blob FUSE) accessible
at the same --log-path on all nodes.
Setup
--num-nodes <N>Number of nodes (default: 1). Node 0 is master, 1+ are workers.
--rank-weights <w1,w2,...>Required for multi-node. Comma-separated load weights (one per node). E.g.,
"2,1,1"gives node 0 50% of tasks, nodes 1-2 get 25% each.--masterDesignate this node as master (rank 0). Auto-detects IP and writes to shared filesystem.
--worker <rank>Designate this node as a worker with the given rank. Discovers master IP from shared filesystem.
Ranks can also be set via RANK or NODE_RANK environment variables.
Examples
# Complete RL loop (3 nodes)
# Master:
bash scripts/run.sh --data-path /data/shared --log-path /data/exp1 --rl-phase both --eval-interval 6 --num-nodes 3 --rank-weights "1,1,1" --master
# Workers:
bash scripts/run.sh --data-path /data/shared --log-path /data/exp1 --rl-phase both --eval-interval 6 --num-nodes 3 --rank-weights "1,1,1" --worker 1
bash scripts/run.sh --data-path /data/shared --log-path /data/exp1 --rl-phase both --eval-interval 6 --num-nodes 3 --rank-weights "1,1,1" --worker 2
# Rollout only (3 nodes)
# Master:
bash scripts/run.sh --data-path /data/shared --log-path /data/exp1 --rl-phase rollout --rollout-split train-only --num-nodes 3 --rank-weights "1,1,1" --master
# Workers:
bash scripts/run.sh ... --worker 1
bash scripts/run.sh ... --worker 2
Coordination Protocol
For --rl-phase both:
Phase |
Master (rank 0) |
Workers (rank 1+) |
|---|---|---|
0 |
Sync all nodes at iteration start |
Sync with all nodes |
1-2 |
Train rollout, signal complete |
Wait, then train rollout |
3-4 |
Aggregate train trajectories |
Signal files ready |
5-6 |
Decide eval, run if needed, aggregate |
Read decision, run if needed |
7-9 |
Stop vLLM, prepare training data |
Stop vLLM, wait for config |
10 |
Multi-node training |
Multi-node training |
11-12 |
Copy checkpoint, restart vLLM |
Wait, restart vLLM |
13 |
Final sync before next iteration |
Final sync |
--rl-phase rollout uses phases 0-6 only. --rl-phase update uses phases 7-13 only.
Fault Tolerance
The multi-node system uses strict barrier synchronization without automatic failure detection. If any node fails, all surviving nodes hang indefinitely at the next barrier.
Recovery: Kill all processes on all nodes, clean up incomplete files, and restart:
# On all nodes:
pkill -9 -f "run.sh|rollout.py"
# On shared storage:
rm /logs/train_trajectories/*iteration${N}_rank_*.checkpoint
rm /logs/multinode_flags/*
# Restart all nodes with the same commands
For partial data recovery, manually aggregate available rank checkpoint files using torch.load
and concatenate trajectories.