Update Script: update_prepare.py (Optional Read)
=================================
The ``scripts/update_prepare.py`` script prepares training data from collected trajectories
and generates LLaMA-Factory configuration for model fine-tuning.
Overview
--------
This script:
* Loads trajectories from the replay buffer
* Converts successful trajectories to LLaMA-Factory ShareGPT format
* Applies recency bias for sample weighting
* Creates dataset files and training configuration
* Sets up WandB logging for training runs
The script is invoked by ``run.sh`` during the update phase.
Entry Point
-----------
.. code-block:: bash
python scripts/update_prepare.py [hydra_overrides...]
The script uses the Hydra config name ``update`` by default. In practice, ``run.sh`` always overrides this with ``--config-name update_online`` to use ``scripts/config/main/update_online.yaml``.
.. note::
The script uses `Hydra `_ for configuration. Its ``@hydra.main`` decorator defaults to ``config_name="update"``, but only ``update_online.yaml`` exists in the config directory. Running without ``--config-name update_online`` will fail. Always use ``run.sh`` or pass the flag explicitly:
.. code-block:: bash
# This will fail (no update.yaml exists):
python scripts/update_prepare.py
# This works:
python scripts/update_prepare.py --config-name update_online save_path=/data/exp1 data_path=/data/shared
Key Components
--------------
DataPreparationManager
^^^^^^^^^^^^^^^^^^^^^^
The main class that handles data preparation:
.. code-block:: python
data_manager = DataPreparationManager(
agent=agent,
save_path=config.save_path,
algorithm_config=config.algorithm_config
)
**Key Methods:**
* ``prepare_all_data()``: Samples and converts trajectories to training format
* ``create_llamafactory_config()``: Generates YAML config for LLaMA-Factory
* ``_convert_to_llamafactory_format()``: Converts single samples to ShareGPT format
Data Preparation Pipeline
-------------------------
1. Load Trajectories
^^^^^^^^^^^^^^^^^^^^
Trajectories are loaded from ``/train_trajectories/``:
.. code-block:: python
train_trajectories = load_all_trajectories(
base_dir=config.save_path,
split='train',
last_n_iterations=4 # Memory optimization
)
2. Clean Trajectories
^^^^^^^^^^^^^^^^^^^^^
Invalid trajectories are filtered out:
* Empty trajectories
* Trajectories with fewer than 2 steps
* Trajectories with None values in action/observation/response
3. Create Replay Buffer
^^^^^^^^^^^^^^^^^^^^^^^
The ReplayBuffer handles trajectory sampling:
.. code-block:: python
replay_buffer = ReplayBuffer(
trajectories=train_trajectories,
agent=agent,
filter_successful_only=True,
filter_same_screenshot=True
)
4. Sample Training Data
^^^^^^^^^^^^^^^^^^^^^^^
Positive samples are extracted with recency bias — more recent iterations are sampled
more frequently. For example, with ``recency_bias_power=2`` and iterations 0-3 available,
the sampling weights are proportional to ``(iteration + 1)^2``:
.. code-block:: text
Iteration 0: weight 1 → ~3% of samples
Iteration 1: weight 4 → ~13% of samples
Iteration 2: weight 9 → ~30% of samples
Iteration 3: weight 16 → ~53% of samples
.. code-block:: python
positive_samples = replay_buffer.get_training_samples(
num_samples=positive_samples_to_train,
recency_bias_power=recency_bias_power
)
5. Convert to ShareGPT Format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Each sample is converted to LLaMA-Factory's ShareGPT format:
.. code-block:: json
{
"conversations": [
{"from": "human", "value": "Task: ..."},
{"from": "gpt", "value": "Action: click(...)"}
],
"system": "You are a web agent...",
"images": ["/path/to/screenshot.png"]
}
LLaMA-Factory Configuration
---------------------------
The script generates a complete training configuration:
**Training Hyperparameters:**
.. code-block:: yaml
stage: sft
do_train: true
finetuning_type: full
mask_history: true # Only train on last turn
cutoff_len: 16384
per_device_train_batch_size: 3
gradient_accumulation_steps: 4
learning_rate: 1e-6
num_train_epochs: 2
**Checkpoint Resume:**
The script automatically detects and resumes from checkpoints:
* Checks for ``trainer_state.json`` and optimizer states
* Updates scheduler learning rate if config changed
* Adjusts ``max_steps`` for dataset size changes between iterations
**WandB Integration:**
.. code-block:: yaml
report_to: wandb
run_name: webgym-
Environment variables are written to ``wandb_env.sh`` for ``run.sh`` to source. For example:
.. code-block:: bash
# Contents of wandb_env.sh (generated by update_prepare.py):
export WANDB_PROJECT='rl'
export WANDB_ENTITY='your-entity-name'
export WANDB_RUN_NAME='webgym-your-run-name'
export WANDB_RESUME='allow'
export WANDB_RUN_ID='existing-run-id' # Only if resuming
Configuration
-------------
Key configuration options in ``update_online.yaml``:
**Log Config:**
.. code-block:: yaml
log_config:
run_name: 'webgym-'
wandb_key_env_var: "WANDB_API_KEY"
entity_name: ""
**Algorithm Config:**
.. code-block:: yaml
algorithm_config:
model_output_name: "model.pt"
positive_samples_to_train: 1800
recency_bias_power: 2
val_split_ratio: 0.05
# Training hyperparameters
cutoff_len: 16384
per_device_train_batch_size: 3
per_device_eval_batch_size: 3
gradient_accumulation_steps: 4
learning_rate: 1e-6
max_grad_norm: 1.0
weight_decay: 0.01
num_train_epochs: 2
warmup_steps: 30
lr_scheduler_type: "constant_with_warmup"
logging_steps: 1
bf16: True
# Evaluation
do_eval: false
eval_strategy: "epoch"
# Save settings
save_strategy: "steps"
save_steps: 999999
save_total_limit: 1
save_only_model: False
# Data loading
preprocessing_num_workers: 16
dataloader_num_workers: 2
dataloader_pin_memory: True
remove_unused_columns: False
min_token_length: 10
# Other
gradient_checkpointing: False
plot_loss: False
deepspeed_config_filename: "ds_config_b200_zero1.json"
report_to: "wandb"
Output
------
The script generates files in ``/llamafactory_data/``:
* ``finetune_train.json``: Training dataset in ShareGPT format
* ``finetune_val.json``: Validation dataset (if ``val_split_ratio`` > 0)
* ``dataset_info.json``: Dataset registry for LLaMA-Factory
* ``train_config.yaml``: Complete training configuration
* ``wandb_env.sh``: WandB environment variables
Next Steps
----------
After ``update_prepare.py`` completes, ``run.sh`` executes:
.. code-block:: bash
# Source WandB environment
source llamafactory_data/wandb_env.sh
# Run LLaMA-Factory training
llamafactory-cli train llamafactory_data/train_config.yaml
The trained model is first saved to ``/checkpoints/``, then the final checkpoint is copied to ``/model.pt/``. For example:
.. code-block:: text
/data/exp1/
├── checkpoints/
│ └── model_20250115_143022/ # Timestamped checkpoint from LLaMA-Factory
│ ├── config.json
│ ├── model.safetensors
│ ├── trainer_state.json # Used for checkpoint resume detection
│ └── optimizer.pt
└── model.pt/ # Final copy used by vLLM for next iteration
├── config.json
└── model.safetensors