Update Script: update_prepare.py (Optional Read)
The scripts/update_prepare.py script prepares training data from collected trajectories
and generates LLaMA-Factory configuration for model fine-tuning.
Overview
This script:
Loads trajectories from the replay buffer
Converts successful trajectories to LLaMA-Factory ShareGPT format
Applies recency bias for sample weighting
Creates dataset files and training configuration
Sets up WandB logging for training runs
The script is invoked by run.sh during the update phase.
Entry Point
python scripts/update_prepare.py [hydra_overrides...]
The script uses the Hydra config name update by default. In practice, run.sh always overrides this with --config-name update_online to use scripts/config/main/update_online.yaml.
Note
The script uses Hydra for configuration. Its @hydra.main decorator defaults to config_name="update", but only update_online.yaml exists in the config directory. Running without --config-name update_online will fail. Always use run.sh or pass the flag explicitly:
# This will fail (no update.yaml exists):
python scripts/update_prepare.py
# This works:
python scripts/update_prepare.py --config-name update_online save_path=/data/exp1 data_path=/data/shared
Key Components
DataPreparationManager
The main class that handles data preparation:
data_manager = DataPreparationManager(
agent=agent,
save_path=config.save_path,
algorithm_config=config.algorithm_config
)
Key Methods:
prepare_all_data(): Samples and converts trajectories to training formatcreate_llamafactory_config(): Generates YAML config for LLaMA-Factory_convert_to_llamafactory_format(): Converts single samples to ShareGPT format
Data Preparation Pipeline
1. Load Trajectories
Trajectories are loaded from <save_path>/train_trajectories/:
train_trajectories = load_all_trajectories(
base_dir=config.save_path,
split='train',
last_n_iterations=4 # Memory optimization
)
2. Clean Trajectories
Invalid trajectories are filtered out:
Empty trajectories
Trajectories with fewer than 2 steps
Trajectories with None values in action/observation/response
3. Create Replay Buffer
The ReplayBuffer handles trajectory sampling:
replay_buffer = ReplayBuffer(
trajectories=train_trajectories,
agent=agent,
filter_successful_only=True,
filter_same_screenshot=True
)
4. Sample Training Data
Positive samples are extracted with recency bias — more recent iterations are sampled
more frequently. For example, with recency_bias_power=2 and iterations 0-3 available,
the sampling weights are proportional to (iteration + 1)^2:
Iteration 0: weight 1 → ~3% of samples
Iteration 1: weight 4 → ~13% of samples
Iteration 2: weight 9 → ~30% of samples
Iteration 3: weight 16 → ~53% of samples
positive_samples = replay_buffer.get_training_samples(
num_samples=positive_samples_to_train,
recency_bias_power=recency_bias_power
)
LLaMA-Factory Configuration
The script generates a complete training configuration:
Training Hyperparameters:
stage: sft
do_train: true
finetuning_type: full
mask_history: true # Only train on last turn
cutoff_len: 16384
per_device_train_batch_size: 3
gradient_accumulation_steps: 4
learning_rate: 1e-6
num_train_epochs: 2
Checkpoint Resume:
The script automatically detects and resumes from checkpoints:
Checks for
trainer_state.jsonand optimizer statesUpdates scheduler learning rate if config changed
Adjusts
max_stepsfor dataset size changes between iterations
WandB Integration:
report_to: wandb
run_name: webgym-<your-run-name>
Environment variables are written to wandb_env.sh for run.sh to source. For example:
# Contents of wandb_env.sh (generated by update_prepare.py):
export WANDB_PROJECT='rl'
export WANDB_ENTITY='your-entity-name'
export WANDB_RUN_NAME='webgym-your-run-name'
export WANDB_RESUME='allow'
export WANDB_RUN_ID='existing-run-id' # Only if resuming
Configuration
Key configuration options in update_online.yaml:
Log Config:
log_config:
run_name: 'webgym-<your-run-name>'
wandb_key_env_var: "WANDB_API_KEY"
entity_name: "<your-wandb-entity-name>"
Algorithm Config:
algorithm_config:
model_output_name: "model.pt"
positive_samples_to_train: 1800
recency_bias_power: 2
val_split_ratio: 0.05
# Training hyperparameters
cutoff_len: 16384
per_device_train_batch_size: 3
per_device_eval_batch_size: 3
gradient_accumulation_steps: 4
learning_rate: 1e-6
max_grad_norm: 1.0
weight_decay: 0.01
num_train_epochs: 2
warmup_steps: 30
lr_scheduler_type: "constant_with_warmup"
logging_steps: 1
bf16: True
# Evaluation
do_eval: false
eval_strategy: "epoch"
# Save settings
save_strategy: "steps"
save_steps: 999999
save_total_limit: 1
save_only_model: False
# Data loading
preprocessing_num_workers: 16
dataloader_num_workers: 2
dataloader_pin_memory: True
remove_unused_columns: False
min_token_length: 10
# Other
gradient_checkpointing: False
plot_loss: False
deepspeed_config_filename: "ds_config_b200_zero1.json"
report_to: "wandb"
Output
The script generates files in <save_path>/llamafactory_data/:
finetune_train.json: Training dataset in ShareGPT formatfinetune_val.json: Validation dataset (ifval_split_ratio> 0)dataset_info.json: Dataset registry for LLaMA-Factorytrain_config.yaml: Complete training configurationwandb_env.sh: WandB environment variables
Next Steps
After update_prepare.py completes, run.sh executes:
# Source WandB environment
source llamafactory_data/wandb_env.sh
# Run LLaMA-Factory training
llamafactory-cli train llamafactory_data/train_config.yaml
The trained model is first saved to <save_path>/checkpoints/, then the final checkpoint is copied to <save_path>/model.pt/. For example:
/data/exp1/
├── checkpoints/
│ └── model_20250115_143022/ # Timestamped checkpoint from LLaMA-Factory
│ ├── config.json
│ ├── model.safetensors
│ ├── trainer_state.json # Used for checkpoint resume detection
│ └── optimizer.pt
└── model.pt/ # Final copy used by vLLM for next iteration
├── config.json
└── model.safetensors