Multi-Node Mode: run.sh

Distributes rollout and training across nodes using a shared filesystem for coordination.

Important

Multi-node mode requires a shared filesystem (NFS, Lustre, Azure Blob FUSE) accessible at the same --log-path on all nodes.

Setup

--num-nodes <N>: Number of nodes (default: 1). Node 0 is master, 1+ are workers.
--rank-weights <w1,w2,...>: Required for multi-node. Comma-separated load weights (one per node). E.g., "2,1,1" gives node 0 50% of tasks, nodes 1-2 get 25% each.
--master: Designate this node as master (rank 0). Auto-detects IP and writes to shared filesystem.
--worker <rank>: Designate this node as a worker with the given rank. Discovers master IP from shared filesystem.

Ranks can also be set via RANK or NODE_RANK environment variables.

Examples

# Complete RL loop (3 nodes)
# Master:
bash scripts/run.sh --data-path /data/shared --log-path /data/exp1 --rl-phase both --eval-interval 6 --num-nodes 3 --rank-weights "1,1,1" --master
# Workers:
bash scripts/run.sh --data-path /data/shared --log-path /data/exp1 --rl-phase both --eval-interval 6 --num-nodes 3 --rank-weights "1,1,1" --worker 1
bash scripts/run.sh --data-path /data/shared --log-path /data/exp1 --rl-phase both --eval-interval 6 --num-nodes 3 --rank-weights "1,1,1" --worker 2

# Rollout only (3 nodes)
# Master:
bash scripts/run.sh --data-path /data/shared --log-path /data/exp1 --rl-phase rollout --rollout-split train-only --num-nodes 3 --rank-weights "1,1,1" --master
# Workers:
bash scripts/run.sh ... --worker 1
bash scripts/run.sh ... --worker 2

Coordination Protocol

For --rl-phase both:

Phase	Master (rank 0)	Workers (rank 1+)
0	Sync all nodes at iteration start	Sync with all nodes
1-2	Train rollout, signal complete	Wait, then train rollout
3-4	Aggregate train trajectories	Signal files ready
5-6	Decide eval, run if needed, aggregate	Read decision, run if needed
7-9	Stop vLLM, prepare training data	Stop vLLM, wait for config
10	Multi-node training	Multi-node training
11-12	Copy checkpoint, restart vLLM	Wait, restart vLLM
13	Final sync before next iteration	Final sync

--rl-phase rollout uses phases 0-6 only. --rl-phase update uses phases 7-13 only.

Fault Tolerance

The multi-node system uses strict barrier synchronization without automatic failure detection. If any node fails, all surviving nodes hang indefinitely at the next barrier.

Recovery: Kill all processes on all nodes, clean up incomplete files, and restart:

# On all nodes:
pkill -9 -f "run.sh|rollout.py"

# On shared storage:
rm /logs/train_trajectories/*iteration${N}_rank_*.checkpoint
rm /logs/multinode_flags/*

# Restart all nodes with the same commands

For partial data recovery, manually aggregate available rank checkpoint files using torch.load and concatenate trajectories.