Update Script: update_prepare.py (Optional Read) ================================= The ``scripts/update_prepare.py`` script prepares training data from collected trajectories and generates LLaMA-Factory configuration for model fine-tuning. Overview -------- This script: * Loads trajectories from the replay buffer * Converts successful trajectories to LLaMA-Factory ShareGPT format * Applies recency bias for sample weighting * Creates dataset files and training configuration * Sets up WandB logging for training runs The script is invoked by ``run.sh`` during the update phase. Entry Point ----------- .. code-block:: bash python scripts/update_prepare.py [hydra_overrides...] The script uses the Hydra config name ``update`` by default. In practice, ``run.sh`` always overrides this with ``--config-name update_online`` to use ``scripts/config/main/update_online.yaml``. .. note:: The script uses `Hydra `_ for configuration. Its ``@hydra.main`` decorator defaults to ``config_name="update"``, but only ``update_online.yaml`` exists in the config directory. Running without ``--config-name update_online`` will fail. Always use ``run.sh`` or pass the flag explicitly: .. code-block:: bash # This will fail (no update.yaml exists): python scripts/update_prepare.py # This works: python scripts/update_prepare.py --config-name update_online save_path=/data/exp1 data_path=/data/shared Key Components -------------- DataPreparationManager ^^^^^^^^^^^^^^^^^^^^^^ The main class that handles data preparation: .. code-block:: python data_manager = DataPreparationManager( agent=agent, save_path=config.save_path, algorithm_config=config.algorithm_config ) **Key Methods:** * ``prepare_all_data()``: Samples and converts trajectories to training format * ``create_llamafactory_config()``: Generates YAML config for LLaMA-Factory * ``_convert_to_llamafactory_format()``: Converts single samples to ShareGPT format Data Preparation Pipeline ------------------------- 1. Load Trajectories ^^^^^^^^^^^^^^^^^^^^ Trajectories are loaded from ``/train_trajectories/``: .. code-block:: python train_trajectories = load_all_trajectories( base_dir=config.save_path, split='train', last_n_iterations=4 # Memory optimization ) 2. Clean Trajectories ^^^^^^^^^^^^^^^^^^^^^ Invalid trajectories are filtered out: * Empty trajectories * Trajectories with fewer than 2 steps * Trajectories with None values in action/observation/response 3. Create Replay Buffer ^^^^^^^^^^^^^^^^^^^^^^^ The ReplayBuffer handles trajectory sampling: .. code-block:: python replay_buffer = ReplayBuffer( trajectories=train_trajectories, agent=agent, filter_successful_only=True, filter_same_screenshot=True ) 4. Sample Training Data ^^^^^^^^^^^^^^^^^^^^^^^ Positive samples are extracted with recency bias — more recent iterations are sampled more frequently. For example, with ``recency_bias_power=2`` and iterations 0-3 available, the sampling weights are proportional to ``(iteration + 1)^2``: .. code-block:: text Iteration 0: weight 1 → ~3% of samples Iteration 1: weight 4 → ~13% of samples Iteration 2: weight 9 → ~30% of samples Iteration 3: weight 16 → ~53% of samples .. code-block:: python positive_samples = replay_buffer.get_training_samples( num_samples=positive_samples_to_train, recency_bias_power=recency_bias_power ) 5. Convert to ShareGPT Format ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Each sample is converted to LLaMA-Factory's ShareGPT format: .. code-block:: json { "conversations": [ {"from": "human", "value": "Task: ..."}, {"from": "gpt", "value": "Action: click(...)"} ], "system": "You are a web agent...", "images": ["/path/to/screenshot.png"] } LLaMA-Factory Configuration --------------------------- The script generates a complete training configuration: **Training Hyperparameters:** .. code-block:: yaml stage: sft do_train: true finetuning_type: full mask_history: true # Only train on last turn cutoff_len: 16384 per_device_train_batch_size: 3 gradient_accumulation_steps: 4 learning_rate: 1e-6 num_train_epochs: 2 **Checkpoint Resume:** The script automatically detects and resumes from checkpoints: * Checks for ``trainer_state.json`` and optimizer states * Updates scheduler learning rate if config changed * Adjusts ``max_steps`` for dataset size changes between iterations **WandB Integration:** .. code-block:: yaml report_to: wandb run_name: webgym- Environment variables are written to ``wandb_env.sh`` for ``run.sh`` to source. For example: .. code-block:: bash # Contents of wandb_env.sh (generated by update_prepare.py): export WANDB_PROJECT='rl' export WANDB_ENTITY='your-entity-name' export WANDB_RUN_NAME='webgym-your-run-name' export WANDB_RESUME='allow' export WANDB_RUN_ID='existing-run-id' # Only if resuming Configuration ------------- Key configuration options in ``update_online.yaml``: **Log Config:** .. code-block:: yaml log_config: run_name: 'webgym-' wandb_key_env_var: "WANDB_API_KEY" entity_name: "" **Algorithm Config:** .. code-block:: yaml algorithm_config: model_output_name: "model.pt" positive_samples_to_train: 1800 recency_bias_power: 2 val_split_ratio: 0.05 # Training hyperparameters cutoff_len: 16384 per_device_train_batch_size: 3 per_device_eval_batch_size: 3 gradient_accumulation_steps: 4 learning_rate: 1e-6 max_grad_norm: 1.0 weight_decay: 0.01 num_train_epochs: 2 warmup_steps: 30 lr_scheduler_type: "constant_with_warmup" logging_steps: 1 bf16: True # Evaluation do_eval: false eval_strategy: "epoch" # Save settings save_strategy: "steps" save_steps: 999999 save_total_limit: 1 save_only_model: False # Data loading preprocessing_num_workers: 16 dataloader_num_workers: 2 dataloader_pin_memory: True remove_unused_columns: False min_token_length: 10 # Other gradient_checkpointing: False plot_loss: False deepspeed_config_filename: "ds_config_b200_zero1.json" report_to: "wandb" Output ------ The script generates files in ``/llamafactory_data/``: * ``finetune_train.json``: Training dataset in ShareGPT format * ``finetune_val.json``: Validation dataset (if ``val_split_ratio`` > 0) * ``dataset_info.json``: Dataset registry for LLaMA-Factory * ``train_config.yaml``: Complete training configuration * ``wandb_env.sh``: WandB environment variables Next Steps ---------- After ``update_prepare.py`` completes, ``run.sh`` executes: .. code-block:: bash # Source WandB environment source llamafactory_data/wandb_env.sh # Run LLaMA-Factory training llamafactory-cli train llamafactory_data/train_config.yaml The trained model is first saved to ``/checkpoints/``, then the final checkpoint is copied to ``/model.pt/``. For example: .. code-block:: text /data/exp1/ ├── checkpoints/ │ └── model_20250115_143022/ # Timestamped checkpoint from LLaMA-Factory │ ├── config.json │ ├── model.safetensors │ ├── trainer_state.json # Used for checkpoint resume detection │ └── optimizer.pt └── model.pt/ # Final copy used by vLLM for next iteration ├── config.json └── model.safetensors