Configuration Reference
WebGym Configs
The system uses Hydra for hierarchical configuration management.
Config files are located in scripts/config/main/.
Config File Hierarchy:
default.yaml # Base config (paths, policy, context)
├── rollout.yaml # Base rollout config (env, timeouts, workers)
│ ├── rollout_train.yaml # Train rollout (train tasks, difficulty)
│ └── rollout_test.yaml # Eval rollout (test tasks, difficulty)
└── update_online.yaml # Training config (hyperparameters, data sampling)
Each config file uses defaults: to inherit from parent configs. Settings in child
configs override parent values. For example, rollout_train.yaml inherits from
rollout.yaml which inherits from default.yaml:
# default.yaml defines base policy settings:
policy_config:
temperature: 1
max_new_tokens: 3072
# rollout.yaml inherits default.yaml and adds env settings:
defaults:
- default
env_config:
server_size: 64
# rollout_train.yaml inherits default.yaml and rollout.yaml, then overrides specifics:
defaults:
- default
- rollout
env_config:
split: "train"
train_tasks_rollout_size: 1024
# temperature and server_size are still inherited from the parents
save_pathRoot directory for outputs (set via
--log-path)data_pathPath to read-only shared data (task files, HuggingFace cache, etc.; set via
--data-path)model_configModel selection:
model_type: Model type (qwen3-instructorqwen3-think)prompt_version: Prompt version (vanillaorcomplete)
policy_configModel inference settings:
base_model: HuggingFace model name (set by run.sh)checkpoint_path: Path to model checkpointmax_new_tokens: Maximum tokens to generate (default: 3072)temperature: Sampling temperature (default: 1)top_p: Top-p sampling (default: 0.99)top_k: Top-k sampling (default: 2)min_p: Min-p sampling (default: 0)
context_configInteraction mode settings:
interaction_mode: How to represent UI elements (coordinatesorset_of_marks)
env_configEnvironment and HTTP settings:
vllm_server_url: vLLM endpoint (default:http://localhost:8999)wait_timeout: HTTP queue timeout in seconds (default: 2400)operation_timeout: HTTP operation timeout (default: 120)vllm_timeout: vLLM request timeout (default: 240)evaluation_workers: Concurrent trajectory evaluation workers (default: 16)screenshot_comparison_workers: Concurrent screenshot comparison workers (default: 32)completion_threshold: Fraction of tasks to complete before killing stragglers (default: 0.95). For example, with 1024 tasks andcompletion_threshold=0.95, once 973 tasks finish, the remaining 51 slow tasks enter the grace period.completion_grace_period: Seconds to wait after the threshold is reached before force-killing remaining tasks (default: 30). Continuing the example above, the 51 remaining tasks get 30 more seconds to finish; any still running after that are terminated.instance_lifetime_max: Maximum browser instance lifetime in minutes (default: 50)task_timeout_minutes: Maximum task timeout in minutes (default: 300)server_size: Browser instances per batch (default: 64)verbose: Enable verbose logging (default: False)use_rich_actree: Use rich accessibility tree format (default: False)max_retries: Maximum HTTP retries (default: 2)http_pools: Connection pool sizes per operation type. Train and test configs use different pool sizes:# Train (rollout_train.yaml): http_pools: # Once per task metadata: 128 # Screen dimensions navigate: 128 # Initial navigation to website allocate: 4 # Instance allocation release: 4 # Instance cleanup/release # Once per step screenshot: 128 # Screenshot capture (most frequent, can be slow) ac_tree: 128 # Accessibility tree retrieval page_metadata: 128 # Page title/URL (once per step) execute: 128 # Action execution (click, type, scroll) # Test (rollout_test.yaml): http_pools: navigate: 32 # Initial navigation to website screenshot: 128 # Screenshot capture ac_tree: 64 # Accessibility tree retrieval metadata: 32 # Screen dimensions page_metadata: 64 # Page title/URL (once per step) execute: 64 # Action execution (click, type, scroll) allocate: 4 # Instance allocation release: 4 # Instance cleanup/release
max_vllm_sessions: Concurrent vLLM requests (default: 128 train, 64 test) (experimental, not supported yet, so useserver_sizeinstead)split: Dataset split (trainortest)train_difficulty_max_steps: Max steps per difficulty (easy: 10, medium: 20, hard: 30)test_difficulty_max_steps: Max steps per difficulty (easy: 30, medium: 50, hard: 70)train_tasks_rollout_size: Tasks per train batch (default: 1024)test_tasks_rollout_size: Tasks per eval batch (default: -1 for all in test config, 0 in train config)test_tasks_repeat_times: Repeat each test task N times (default: 1 in test config, 0 in train config)train_tasks_sampler: Task sampling strategy (uniformorratio)train_tasks: Train task file (default:train.jsonl)test_tasks: Test task file (default:test.jsonl)
openai_configEvaluator API settings (supports OpenAI and Gemini). Supports per-task model configuration.
Default settings (apply to all tasks unless overridden):
model: Default evaluation model (default:gemini-3-flash-preview)openai_api_key_env_var: Environment variable for API key (default:GEMINI_API_KEY)base_url: Base URL for API (default:https://generativelanguage.googleapis.com/v1beta/openai/)
Per-task overrides (optional, each can specify
model,openai_api_key_env_var,base_url):keypoint_detection: For judging which screenshots to submit (N-1 calls per trajectory)blocking_detection: For detecting CAPTCHA/blocking pages (1 call per trajectory)evaluation: For criterion_a, criterion_b, and reference_answer (2 + R calls per trajectory)
Supported Gemini Models:
gemini-3-flash-preview: More capable model for complex evaluation tasksgemini-2.5-flash-lite: Faster, cheaper model for high-volume operations (keypoint detection, blocking detection)
Example: Using Gemini with per-task configuration (default):
openai_config: # Default configuration for evaluation model: "gemini-3-flash-preview" openai_api_key_env_var: "GEMINI_API_KEY" base_url: "https://generativelanguage.googleapis.com/v1beta/openai/" # Use lighter model for high-volume operations keypoint_detection: model: "gemini-2.5-flash-lite" blocking_detection: model: "gemini-2.5-flash-lite" evaluation: model: "gemini-3-flash-preview"
Example: Using different providers per task:
openai_config: # Default: OpenAI for evaluation model: "gpt-4o-mini" openai_api_key_env_var: "OPENAI_API_KEY" # Use Gemini for high-volume keypoint detection (cheaper) keypoint_detection: model: "gemini-3-flash-preview" openai_api_key_env_var: "GEMINI_API_KEY" base_url: "https://generativelanguage.googleapis.com/v1beta/openai/"
Set your API keys:
export GEMINI_API_KEY="your-key"and/orexport OPENAI_API_KEY="your-key"log_configLogging settings:
run_name: WandB run name prefixwandb_key_env_var: Environment variable name containing WandB API key (default:WANDB_API_KEY)entity_name: WandB entity/team name
algorithm_configTraining hyperparameters:
positive_samples_to_train: Positive samples per training iteration (default: 1800)recency_bias_power: Bias toward recent trajectories (default: 2)cutoff_len: Maximum sequence length (default: 16384)per_device_train_batch_size: Batch size per GPU (default: 3)gradient_accumulation_steps: Gradient accumulation (default: 4)learning_rate: Learning rate (default: 1e-6)num_train_epochs: Epochs per iteration (default: 2)warmup_steps: LR warmup steps (default: 30)lr_scheduler_type: Scheduler type (default:constant_with_warmup)val_split_ratio: Validation data fraction (default: 0.05)deepspeed_config_filename: DeepSpeed config file (default:ds_config_b200_zero1.json)report_to: Logging backend (default:wandb)
DeepSpeed Configs
DeepSpeed config files are located in scripts/config/deepspeed/. These control
distributed training memory optimization and offloading strategies.
Available Configurations:
File |
ZeRO Stage |
Use Case |
|---|---|---|
|
1 |
H100 GPUs, optimizer state partitioning only |
|
2 |
H100 GPUs, optimizer + gradient partitioning |
|
3 |
H100 GPUs, full parameter partitioning + CPU offload |
|
1 |
B200 GPUs, optimizer state partitioning only (default) |
|
2 |
B200 GPUs, optimizer + gradient partitioning |
|
3 |
B200 GPUs, full partitioning |
ZeRO Stages:
Stage 1 - Optimizer state partitioning. Lowest memory savings, highest speed.
Stage 2 - Optimizer + gradient partitioning. Moderate memory savings.
Stage 3 - Full parameter partitioning. Maximum memory savings. CPU offload is enabled in the H100 config but not the B200 config.
Key Settings:
bf16.enabledUse bfloat16 precision (recommended for H100/B200)
activation_checkpointingRecompute activations during backward pass to save memory
zero_optimization.stageZeRO optimization level (1, 2, or 3)
zero_optimization.offload_paramOffload model parameters to CPU (ZeRO-3 only)
zero_optimization.offload_optimizerOffload optimizer states to CPU
overlap_commOverlap communication with computation for better throughput
To use a different config, modify update_online.yaml:
algorithm_config:
deepspeed_config_filename: "ds_config_h100_zero2.json" # Use ZeRO-2 instead