Replay Buffer

The replay buffer module (webgym/data/) manages training data storage and sampling for RL training.

Module Structure

webgym/data/
├── __init__.py
├── replay_buffer.py        # Main replay buffer class
├── components.py           # Data structures (Task, Action, Reward, etc.)
└── response_decomposer.py  # Response parsing utilities

ReplayBuffer

The ReplayBuffer class (replay_buffer.py) extends PyTorch’s Dataset and provides:

Trajectory storage and management
Filtering for successful/unsuccessful trajectories
Same-screenshot step filtering
Support for distributed training

Initialization:

from webgym.data import ReplayBuffer

replay_buffer = ReplayBuffer(
    trajectories=trajectory_list,
    agent=web_agent,
    capacity=None,  # None = unlimited
    filter_successful_only=False,
    include_reward_in_sample=True,
    shuffle=False,
    filter_same_screenshot=True
)

Key Parameters:

trajectories: List of trajectory data to process
agent: WebAgent instance for context management
capacity: Maximum number of samples to store (optional)
filter_successful_only: If True, only samples from successful trajectories are accessible
include_reward_in_sample: Include reward information in each sample (default: True)
shuffle: Shuffle samples (default: False)
filter_same_screenshot: Filter out steps where screenshots haven’t changed (default: True)

Data Components

webgym/data/components.py defines core data structures:

Task: Task description and metadata (task_name, domain, subdomain, website, difficulty, evaluator_reference, reference_answer, attempt_level, task_id, max_steps, trajectory_index)
Observation: Screenshot and page state (task, image_path, ac_tree, page_metadata)
Action: Agent action (action, action_string)
Response: Model response (raw_response, answering_tokens, raw_prompt)
Reward: Task completion reward (reward, evaluation, is_blocked, submit, submission_judgment)

Key Methods:

get_training_samples(num_samples=None, recency_bias_power=1.0): Returns training-eligible samples with recency-weighted sampling. Samples from successful trajectory steps. If recency_bias_power is 1.0, uses uniform random sampling.