Configuration Reference

WebGym Configs

The system uses Hydra for hierarchical configuration management. Config files are located in scripts/config/main/.

Config File Hierarchy:

default.yaml          # Base config (paths, policy, context)
    ├── rollout.yaml      # Base rollout config (env, timeouts, workers)
    │   ├── rollout_train.yaml   # Train rollout (train tasks, difficulty)
    │   └── rollout_test.yaml    # Eval rollout (test tasks, difficulty)
    └── update_online.yaml       # Training config (hyperparameters, data sampling)

Each config file uses defaults: to inherit from parent configs. Settings in child configs override parent values. For example, rollout_train.yaml inherits from rollout.yaml which inherits from default.yaml:

# default.yaml defines base policy settings:
policy_config:
  temperature: 1
  max_new_tokens: 3072

# rollout.yaml inherits default.yaml and adds env settings:
defaults:
  - default
env_config:
  server_size: 64

# rollout_train.yaml inherits default.yaml and rollout.yaml, then overrides specifics:
defaults:
  - default
  - rollout
env_config:
  split: "train"
  train_tasks_rollout_size: 1024
# temperature and server_size are still inherited from the parents
save_path

Root directory for outputs (set via --log-path)

data_path

Path to read-only shared data (task files, HuggingFace cache, etc.; set via --data-path)

model_config

Model selection:

  • model_type: Model type (qwen3-instruct or qwen3-think)

  • prompt_version: Prompt version (vanilla or complete)

policy_config

Model inference settings:

  • base_model: HuggingFace model name (set by run.sh)

  • checkpoint_path: Path to model checkpoint

  • max_new_tokens: Maximum tokens to generate (default: 3072)

  • temperature: Sampling temperature (default: 1)

  • top_p: Top-p sampling (default: 0.99)

  • top_k: Top-k sampling (default: 2)

  • min_p: Min-p sampling (default: 0)

context_config

Interaction mode settings:

  • interaction_mode: How to represent UI elements (coordinates or set_of_marks)

env_config

Environment and HTTP settings:

  • vllm_server_url: vLLM endpoint (default: http://localhost:8999)

  • wait_timeout: HTTP queue timeout in seconds (default: 2400)

  • operation_timeout: HTTP operation timeout (default: 120)

  • vllm_timeout: vLLM request timeout (default: 240)

  • evaluation_workers: Concurrent trajectory evaluation workers (default: 16)

  • screenshot_comparison_workers: Concurrent screenshot comparison workers (default: 32)

  • completion_threshold: Fraction of tasks to complete before killing stragglers (default: 0.95). For example, with 1024 tasks and completion_threshold=0.95, once 973 tasks finish, the remaining 51 slow tasks enter the grace period.

  • completion_grace_period: Seconds to wait after the threshold is reached before force-killing remaining tasks (default: 30). Continuing the example above, the 51 remaining tasks get 30 more seconds to finish; any still running after that are terminated.

  • instance_lifetime_max: Maximum browser instance lifetime in minutes (default: 50)

  • task_timeout_minutes: Maximum task timeout in minutes (default: 300)

  • server_size: Browser instances per batch (default: 64)

  • verbose: Enable verbose logging (default: False)

  • use_rich_actree: Use rich accessibility tree format (default: False)

  • max_retries: Maximum HTTP retries (default: 2)

  • http_pools: Connection pool sizes per operation type. Train and test configs use different pool sizes:

    # Train (rollout_train.yaml):
    http_pools:
      # Once per task
      metadata: 128          # Screen dimensions
      navigate: 128          # Initial navigation to website
      allocate: 4            # Instance allocation
      release: 4             # Instance cleanup/release
      # Once per step
      screenshot: 128        # Screenshot capture (most frequent, can be slow)
      ac_tree: 128           # Accessibility tree retrieval
      page_metadata: 128     # Page title/URL (once per step)
      execute: 128           # Action execution (click, type, scroll)
    
    # Test (rollout_test.yaml):
    http_pools:
      navigate: 32           # Initial navigation to website
      screenshot: 128        # Screenshot capture
      ac_tree: 64            # Accessibility tree retrieval
      metadata: 32           # Screen dimensions
      page_metadata: 64      # Page title/URL (once per step)
      execute: 64            # Action execution (click, type, scroll)
      allocate: 4            # Instance allocation
      release: 4             # Instance cleanup/release
    
  • max_vllm_sessions: Concurrent vLLM requests (default: 128 train, 64 test) (experimental, not supported yet, so use server_size instead)

  • split: Dataset split (train or test)

  • train_difficulty_max_steps: Max steps per difficulty (easy: 10, medium: 20, hard: 30)

  • test_difficulty_max_steps: Max steps per difficulty (easy: 30, medium: 50, hard: 70)

  • train_tasks_rollout_size: Tasks per train batch (default: 1024)

  • test_tasks_rollout_size: Tasks per eval batch (default: -1 for all in test config, 0 in train config)

  • test_tasks_repeat_times: Repeat each test task N times (default: 1 in test config, 0 in train config)

  • train_tasks_sampler: Task sampling strategy (uniform or ratio)

  • train_tasks: Train task file (default: train.jsonl)

  • test_tasks: Test task file (default: test.jsonl)

openai_config

Evaluator API settings (supports OpenAI and Gemini). Supports per-task model configuration.

Default settings (apply to all tasks unless overridden):

  • model: Default evaluation model (default: gemini-3-flash-preview)

  • openai_api_key_env_var: Environment variable for API key (default: GEMINI_API_KEY)

  • base_url: Base URL for API (default: https://generativelanguage.googleapis.com/v1beta/openai/)

Per-task overrides (optional, each can specify model, openai_api_key_env_var, base_url):

  • keypoint_detection: For judging which screenshots to submit (N-1 calls per trajectory)

  • blocking_detection: For detecting CAPTCHA/blocking pages (1 call per trajectory)

  • evaluation: For criterion_a, criterion_b, and reference_answer (2 + R calls per trajectory)

Supported Gemini Models:

  • gemini-3-flash-preview: More capable model for complex evaluation tasks

  • gemini-2.5-flash-lite: Faster, cheaper model for high-volume operations (keypoint detection, blocking detection)

Example: Using Gemini with per-task configuration (default):

openai_config:
  # Default configuration for evaluation
  model: "gemini-3-flash-preview"
  openai_api_key_env_var: "GEMINI_API_KEY"
  base_url: "https://generativelanguage.googleapis.com/v1beta/openai/"

  # Use lighter model for high-volume operations
  keypoint_detection:
    model: "gemini-2.5-flash-lite"
  blocking_detection:
    model: "gemini-2.5-flash-lite"
  evaluation:
    model: "gemini-3-flash-preview"

Example: Using different providers per task:

openai_config:
  # Default: OpenAI for evaluation
  model: "gpt-4o-mini"
  openai_api_key_env_var: "OPENAI_API_KEY"

  # Use Gemini for high-volume keypoint detection (cheaper)
  keypoint_detection:
    model: "gemini-3-flash-preview"
    openai_api_key_env_var: "GEMINI_API_KEY"
    base_url: "https://generativelanguage.googleapis.com/v1beta/openai/"

Set your API keys: export GEMINI_API_KEY="your-key" and/or export OPENAI_API_KEY="your-key"

log_config

Logging settings:

  • run_name: WandB run name prefix

  • wandb_key_env_var: Environment variable name containing WandB API key (default: WANDB_API_KEY)

  • entity_name: WandB entity/team name

algorithm_config

Training hyperparameters:

  • positive_samples_to_train: Positive samples per training iteration (default: 1800)

  • recency_bias_power: Bias toward recent trajectories (default: 2)

  • cutoff_len: Maximum sequence length (default: 16384)

  • per_device_train_batch_size: Batch size per GPU (default: 3)

  • gradient_accumulation_steps: Gradient accumulation (default: 4)

  • learning_rate: Learning rate (default: 1e-6)

  • num_train_epochs: Epochs per iteration (default: 2)

  • warmup_steps: LR warmup steps (default: 30)

  • lr_scheduler_type: Scheduler type (default: constant_with_warmup)

  • val_split_ratio: Validation data fraction (default: 0.05)

  • deepspeed_config_filename: DeepSpeed config file (default: ds_config_b200_zero1.json)

  • report_to: Logging backend (default: wandb)

DeepSpeed Configs

DeepSpeed config files are located in scripts/config/deepspeed/. These control distributed training memory optimization and offloading strategies.

Available Configurations:

File

ZeRO Stage

Use Case

ds_config_h100_zero1.json

1

H100 GPUs, optimizer state partitioning only

ds_config_h100_zero2.json

2

H100 GPUs, optimizer + gradient partitioning

ds_config_h100_zero3.json

3

H100 GPUs, full parameter partitioning + CPU offload

ds_config_b200_zero1.json

1

B200 GPUs, optimizer state partitioning only (default)

ds_config_b200_zero2.json

2

B200 GPUs, optimizer + gradient partitioning

ds_config_b200_zero3.json

3

B200 GPUs, full partitioning

ZeRO Stages:

  • Stage 1 - Optimizer state partitioning. Lowest memory savings, highest speed.

  • Stage 2 - Optimizer + gradient partitioning. Moderate memory savings.

  • Stage 3 - Full parameter partitioning. Maximum memory savings. CPU offload is enabled in the H100 config but not the B200 config.

Key Settings:

bf16.enabled

Use bfloat16 precision (recommended for H100/B200)

activation_checkpointing

Recompute activations during backward pass to save memory

zero_optimization.stage

ZeRO optimization level (1, 2, or 3)

zero_optimization.offload_param

Offload model parameters to CPU (ZeRO-3 only)

zero_optimization.offload_optimizer

Offload optimizer states to CPU

overlap_comm

Overlap communication with computation for better throughput

To use a different config, modify update_online.yaml:

algorithm_config:
  deepspeed_config_filename: "ds_config_h100_zero2.json"  # Use ZeRO-2 instead