Multi-Node Mode: run.sh

Distributes rollout and training across nodes using a shared filesystem for coordination.

Important

Multi-node mode requires a shared filesystem (NFS, Lustre, Azure Blob FUSE) accessible at the same --log-path on all nodes.

Setup

--num-nodes <N>

Number of nodes (default: 1). Node 0 is master, 1+ are workers.

--rank-weights <w1,w2,...>

Required for multi-node. Comma-separated load weights (one per node). E.g., "2,1,1" gives node 0 50% of tasks, nodes 1-2 get 25% each.

--master

Designate this node as master (rank 0). Auto-detects IP and writes to shared filesystem.

--worker <rank>

Designate this node as a worker with the given rank. Discovers master IP from shared filesystem.

Ranks can also be set via RANK or NODE_RANK environment variables.

Examples

# Complete RL loop (3 nodes)
# Master:
bash scripts/run.sh --data-path /data/shared --log-path /data/exp1 --rl-phase both --eval-interval 6 --num-nodes 3 --rank-weights "1,1,1" --master
# Workers:
bash scripts/run.sh --data-path /data/shared --log-path /data/exp1 --rl-phase both --eval-interval 6 --num-nodes 3 --rank-weights "1,1,1" --worker 1
bash scripts/run.sh --data-path /data/shared --log-path /data/exp1 --rl-phase both --eval-interval 6 --num-nodes 3 --rank-weights "1,1,1" --worker 2

# Rollout only (3 nodes)
# Master:
bash scripts/run.sh --data-path /data/shared --log-path /data/exp1 --rl-phase rollout --rollout-split train-only --num-nodes 3 --rank-weights "1,1,1" --master
# Workers:
bash scripts/run.sh ... --worker 1
bash scripts/run.sh ... --worker 2

Coordination Protocol

For --rl-phase both:

Phase

Master (rank 0)

Workers (rank 1+)

0

Sync all nodes at iteration start

Sync with all nodes

1-2

Train rollout, signal complete

Wait, then train rollout

3-4

Aggregate train trajectories

Signal files ready

5-6

Decide eval, run if needed, aggregate

Read decision, run if needed

7-9

Stop vLLM, prepare training data

Stop vLLM, wait for config

10

Multi-node training

Multi-node training

11-12

Copy checkpoint, restart vLLM

Wait, restart vLLM

13

Final sync before next iteration

Final sync

--rl-phase rollout uses phases 0-6 only. --rl-phase update uses phases 7-13 only.

Fault Tolerance

The multi-node system uses strict barrier synchronization without automatic failure detection. If any node fails, all surviving nodes hang indefinitely at the next barrier.

Recovery: Kill all processes on all nodes, clean up incomplete files, and restart:

# On all nodes:
pkill -9 -f "run.sh|rollout.py"

# On shared storage:
rm /logs/train_trajectories/*iteration${N}_rank_*.checkpoint
rm /logs/multinode_flags/*

# Restart all nodes with the same commands

For partial data recovery, manually aggregate available rank checkpoint files using torch.load and concatenate trajectories.