Configuration Reference
=======================

WebGym Configs
--------------

The system uses `Hydra <https://hydra.cc/>`_ for hierarchical configuration management.
Config files are located in ``scripts/config/main/``.

**Config File Hierarchy:**

.. code-block:: text

   default.yaml          # Base config (paths, policy, context)
       ├── rollout.yaml      # Base rollout config (env, timeouts, workers)
       │   ├── rollout_train.yaml   # Train rollout (train tasks, difficulty)
       │   └── rollout_test.yaml    # Eval rollout (test tasks, difficulty)
       └── update_online.yaml       # Training config (hyperparameters, data sampling)

Each config file uses ``defaults:`` to inherit from parent configs. Settings in child
configs override parent values. For example, ``rollout_train.yaml`` inherits from
``rollout.yaml`` which inherits from ``default.yaml``:

.. code-block:: yaml

   # default.yaml defines base policy settings:
   policy_config:
     temperature: 1
     max_new_tokens: 3072

   # rollout.yaml inherits default.yaml and adds env settings:
   defaults:
     - default
   env_config:
     server_size: 64

   # rollout_train.yaml inherits default.yaml and rollout.yaml, then overrides specifics:
   defaults:
     - default
     - rollout
   env_config:
     split: "train"
     train_tasks_rollout_size: 1024
   # temperature and server_size are still inherited from the parents

``save_path``
   Root directory for outputs (set via ``--log-path``)

``data_path``
   Path to read-only shared data (task files, HuggingFace cache, etc.; set via ``--data-path``)

``model_config``
   Model selection:

   * ``model_type``: Model type (``qwen3-instruct`` or ``qwen3-think``)
   * ``prompt_version``: Prompt version (``vanilla`` or ``complete``)

``policy_config``
   Model inference settings:

   * ``base_model``: HuggingFace model name (set by run.sh)
   * ``checkpoint_path``: Path to model checkpoint
   * ``max_new_tokens``: Maximum tokens to generate (default: 3072)
   * ``temperature``: Sampling temperature (default: 1)
   * ``top_p``: Top-p sampling (default: 0.99)
   * ``top_k``: Top-k sampling (default: 2)
   * ``min_p``: Min-p sampling (default: 0)

``context_config``
   Interaction mode settings:

   * ``interaction_mode``: How to represent UI elements (``coordinates`` or ``set_of_marks``)

``env_config``
   Environment and HTTP settings:

   * ``vllm_server_url``: vLLM endpoint (default: ``http://localhost:8999``)
   * ``wait_timeout``: HTTP queue timeout in seconds (default: 2400)
   * ``operation_timeout``: HTTP operation timeout (default: 120)
   * ``vllm_timeout``: vLLM request timeout (default: 240)
   * ``evaluation_workers``: Concurrent trajectory evaluation workers (default: 16)
   * ``screenshot_comparison_workers``: Concurrent screenshot comparison workers (default: 32)
   * ``completion_threshold``: Fraction of tasks to complete before killing stragglers (default: 0.95).
     For example, with 1024 tasks and ``completion_threshold=0.95``, once 973 tasks finish, the remaining
     51 slow tasks enter the grace period.
   * ``completion_grace_period``: Seconds to wait after the threshold is reached before force-killing
     remaining tasks (default: 30). Continuing the example above, the 51 remaining tasks get 30 more
     seconds to finish; any still running after that are terminated.
   * ``instance_lifetime_max``: Maximum browser instance lifetime in minutes (default: 50)
   * ``task_timeout_minutes``: Maximum task timeout in minutes (default: 300)
   * ``server_size``: Browser instances per batch (default: 64)
   * ``verbose``: Enable verbose logging (default: False)
   * ``use_rich_actree``: Use rich accessibility tree format (default: False)
   * ``max_retries``: Maximum HTTP retries (default: 2)
   * ``http_pools``: Connection pool sizes per operation type. Train and test configs use different pool sizes:

     .. code-block:: yaml

        # Train (rollout_train.yaml):
        http_pools:
          # Once per task
          metadata: 128          # Screen dimensions
          navigate: 128          # Initial navigation to website
          allocate: 4            # Instance allocation
          release: 4             # Instance cleanup/release
          # Once per step
          screenshot: 128        # Screenshot capture (most frequent, can be slow)
          ac_tree: 128           # Accessibility tree retrieval
          page_metadata: 128     # Page title/URL (once per step)
          execute: 128           # Action execution (click, type, scroll)

        # Test (rollout_test.yaml):
        http_pools:
          navigate: 32           # Initial navigation to website
          screenshot: 128        # Screenshot capture
          ac_tree: 64            # Accessibility tree retrieval
          metadata: 32           # Screen dimensions
          page_metadata: 64      # Page title/URL (once per step)
          execute: 64            # Action execution (click, type, scroll)
          allocate: 4            # Instance allocation
          release: 4             # Instance cleanup/release
   * ``max_vllm_sessions``: Concurrent vLLM requests (default: 128 train, 64 test) (experimental, not supported yet, so use ``server_size`` instead)
   * ``split``: Dataset split (``train`` or ``test``)
   * ``train_difficulty_max_steps``: Max steps per difficulty (easy: 10, medium: 20, hard: 30)
   * ``test_difficulty_max_steps``: Max steps per difficulty (easy: 30, medium: 50, hard: 70)
   * ``train_tasks_rollout_size``: Tasks per train batch (default: 1024)
   * ``test_tasks_rollout_size``: Tasks per eval batch (default: -1 for all in test config, 0 in train config)
   * ``test_tasks_repeat_times``: Repeat each test task N times (default: 1 in test config, 0 in train config)
   * ``train_tasks_sampler``: Task sampling strategy (``uniform`` or ``ratio``)
   * ``train_tasks``: Train task file (default: ``train.jsonl``)
   * ``test_tasks``: Test task file (default: ``test.jsonl``)

``openai_config``
   Evaluator API settings (supports OpenAI and Gemini). Supports per-task model configuration.

   **Default settings** (apply to all tasks unless overridden):

   * ``model``: Default evaluation model (default: ``gemini-3-flash-preview``)
   * ``openai_api_key_env_var``: Environment variable for API key (default: ``GEMINI_API_KEY``)
   * ``base_url``: Base URL for API (default: ``https://generativelanguage.googleapis.com/v1beta/openai/``)

   **Per-task overrides** (optional, each can specify ``model``, ``openai_api_key_env_var``, ``base_url``):

   * ``keypoint_detection``: For judging which screenshots to submit (N-1 calls per trajectory)
   * ``blocking_detection``: For detecting CAPTCHA/blocking pages (1 call per trajectory)
   * ``evaluation``: For criterion_a, criterion_b, and reference_answer (2 + R calls per trajectory)

   **Supported Gemini Models:**

   * ``gemini-3-flash-preview``: More capable model for complex evaluation tasks
   * ``gemini-2.5-flash-lite``: Faster, cheaper model for high-volume operations (keypoint detection, blocking detection)

   **Example: Using Gemini with per-task configuration (default):**

   .. code-block:: yaml

      openai_config:
        # Default configuration for evaluation
        model: "gemini-3-flash-preview"
        openai_api_key_env_var: "GEMINI_API_KEY"
        base_url: "https://generativelanguage.googleapis.com/v1beta/openai/"

        # Use lighter model for high-volume operations
        keypoint_detection:
          model: "gemini-2.5-flash-lite"
        blocking_detection:
          model: "gemini-2.5-flash-lite"
        evaluation:
          model: "gemini-3-flash-preview"

   **Example: Using different providers per task:**

   .. code-block:: yaml

      openai_config:
        # Default: OpenAI for evaluation
        model: "gpt-4o-mini"
        openai_api_key_env_var: "OPENAI_API_KEY"

        # Use Gemini for high-volume keypoint detection (cheaper)
        keypoint_detection:
          model: "gemini-3-flash-preview"
          openai_api_key_env_var: "GEMINI_API_KEY"
          base_url: "https://generativelanguage.googleapis.com/v1beta/openai/"

   Set your API keys: ``export GEMINI_API_KEY="your-key"`` and/or ``export OPENAI_API_KEY="your-key"``

``log_config``
   Logging settings:

   * ``run_name``: WandB run name prefix
   * ``wandb_key_env_var``: Environment variable name containing WandB API key (default: ``WANDB_API_KEY``)
   * ``entity_name``: WandB entity/team name

``algorithm_config``
   Training hyperparameters:

   * ``positive_samples_to_train``: Positive samples per training iteration (default: 1800)
   * ``recency_bias_power``: Bias toward recent trajectories (default: 2)
   * ``cutoff_len``: Maximum sequence length (default: 16384)
   * ``per_device_train_batch_size``: Batch size per GPU (default: 3)
   * ``gradient_accumulation_steps``: Gradient accumulation (default: 4)
   * ``learning_rate``: Learning rate (default: 1e-6)
   * ``num_train_epochs``: Epochs per iteration (default: 2)
   * ``warmup_steps``: LR warmup steps (default: 30)
   * ``lr_scheduler_type``: Scheduler type (default: ``constant_with_warmup``)
   * ``val_split_ratio``: Validation data fraction (default: 0.05)
   * ``deepspeed_config_filename``: DeepSpeed config file (default: ``ds_config_b200_zero1.json``)
   * ``report_to``: Logging backend (default: ``wandb``)

DeepSpeed Configs
-----------------

DeepSpeed config files are located in ``scripts/config/deepspeed/``. These control
distributed training memory optimization and offloading strategies.

**Available Configurations:**

.. list-table::
   :header-rows: 1
   :widths: 40 20 40

   * - File
     - ZeRO Stage
     - Use Case
   * - ``ds_config_h100_zero1.json``
     - 1
     - H100 GPUs, optimizer state partitioning only
   * - ``ds_config_h100_zero2.json``
     - 2
     - H100 GPUs, optimizer + gradient partitioning
   * - ``ds_config_h100_zero3.json``
     - 3
     - H100 GPUs, full parameter partitioning + CPU offload
   * - ``ds_config_b200_zero1.json``
     - 1
     - B200 GPUs, optimizer state partitioning only (default)
   * - ``ds_config_b200_zero2.json``
     - 2
     - B200 GPUs, optimizer + gradient partitioning
   * - ``ds_config_b200_zero3.json``
     - 3
     - B200 GPUs, full partitioning

**ZeRO Stages:**

* **Stage 1** - Optimizer state partitioning. Lowest memory savings, highest speed.
* **Stage 2** - Optimizer + gradient partitioning. Moderate memory savings.
* **Stage 3** - Full parameter partitioning. Maximum memory savings. CPU offload is enabled in the H100 config but not the B200 config.

**Key Settings:**

``bf16.enabled``
   Use bfloat16 precision (recommended for H100/B200)

``activation_checkpointing``
   Recompute activations during backward pass to save memory

``zero_optimization.stage``
   ZeRO optimization level (1, 2, or 3)

``zero_optimization.offload_param``
   Offload model parameters to CPU (ZeRO-3 only)

``zero_optimization.offload_optimizer``
   Offload optimizer states to CPU

``overlap_comm``
   Overlap communication with computation for better throughput

To use a different config, modify ``update_online.yaml``:

.. code-block:: yaml

   algorithm_config:
     deepspeed_config_filename: "ds_config_h100_zero2.json"  # Use ZeRO-2 instead