Multi-Node Mode: run.sh ================= Distributes rollout and training across nodes using a shared filesystem for coordination. .. important:: Multi-node mode **requires a shared filesystem** (NFS, Lustre, Azure Blob FUSE) accessible at the same ``--log-path`` on all nodes. Setup ----- ``--num-nodes `` Number of nodes (default: 1). Node 0 is master, 1+ are workers. ``--rank-weights `` Required for multi-node. Comma-separated load weights (one per node). E.g., ``"2,1,1"`` gives node 0 50% of tasks, nodes 1-2 get 25% each. ``--master`` Designate this node as master (rank 0). Auto-detects IP and writes to shared filesystem. ``--worker `` Designate this node as a worker with the given rank. Discovers master IP from shared filesystem. Ranks can also be set via ``RANK`` or ``NODE_RANK`` environment variables. Examples -------- .. code-block:: bash # Complete RL loop (3 nodes) # Master: bash scripts/run.sh --data-path /data/shared --log-path /data/exp1 --rl-phase both --eval-interval 6 --num-nodes 3 --rank-weights "1,1,1" --master # Workers: bash scripts/run.sh --data-path /data/shared --log-path /data/exp1 --rl-phase both --eval-interval 6 --num-nodes 3 --rank-weights "1,1,1" --worker 1 bash scripts/run.sh --data-path /data/shared --log-path /data/exp1 --rl-phase both --eval-interval 6 --num-nodes 3 --rank-weights "1,1,1" --worker 2 # Rollout only (3 nodes) # Master: bash scripts/run.sh --data-path /data/shared --log-path /data/exp1 --rl-phase rollout --rollout-split train-only --num-nodes 3 --rank-weights "1,1,1" --master # Workers: bash scripts/run.sh ... --worker 1 bash scripts/run.sh ... --worker 2 Coordination Protocol --------------------- For ``--rl-phase both``: .. list-table:: :header-rows: 1 :widths: 10 45 45 * - Phase - Master (rank 0) - Workers (rank 1+) * - 0 - Sync all nodes at iteration start - Sync with all nodes * - 1-2 - Train rollout, signal complete - Wait, then train rollout * - 3-4 - Aggregate train trajectories - Signal files ready * - 5-6 - Decide eval, run if needed, aggregate - Read decision, run if needed * - 7-9 - Stop vLLM, prepare training data - Stop vLLM, wait for config * - 10 - Multi-node training - Multi-node training * - 11-12 - Copy checkpoint, restart vLLM - Wait, restart vLLM * - 13 - Final sync before next iteration - Final sync ``--rl-phase rollout`` uses phases 0-6 only. ``--rl-phase update`` uses phases 7-13 only. Fault Tolerance --------------- The multi-node system uses **strict barrier synchronization** without automatic failure detection. If any node fails, **all surviving nodes hang indefinitely** at the next barrier. **Recovery:** Kill all processes on all nodes, clean up incomplete files, and restart: .. code-block:: bash # On all nodes: pkill -9 -f "run.sh|rollout.py" # On shared storage: rm /logs/train_trajectories/*iteration${N}_rank_*.checkpoint rm /logs/multinode_flags/* # Restart all nodes with the same commands For partial data recovery, manually aggregate available rank checkpoint files using ``torch.load`` and concatenate trajectories.