Replay Buffer

The replay buffer module (webgym/data/) manages training data storage and sampling for RL training.

Module Structure

webgym/data/
├── __init__.py
├── replay_buffer.py        # Main replay buffer class
├── components.py           # Data structures (Task, Action, Reward, etc.)
└── response_decomposer.py  # Response parsing utilities

ReplayBuffer

The ReplayBuffer class (replay_buffer.py) extends PyTorch’s Dataset and provides:

  • Trajectory storage and management

  • Filtering for successful/unsuccessful trajectories

  • Same-screenshot step filtering

  • Support for distributed training

Initialization:

from webgym.data import ReplayBuffer

replay_buffer = ReplayBuffer(
    trajectories=trajectory_list,
    agent=web_agent,
    capacity=None,  # None = unlimited
    filter_successful_only=False,
    include_reward_in_sample=True,
    shuffle=False,
    filter_same_screenshot=True
)

Key Parameters:

trajectories

List of trajectory data to process

agent

WebAgent instance for context management

capacity

Maximum number of samples to store (optional)

filter_successful_only

If True, only samples from successful trajectories are accessible

include_reward_in_sample

Include reward information in each sample (default: True)

shuffle

Shuffle samples (default: False)

filter_same_screenshot

Filter out steps where screenshots haven’t changed (default: True)

Data Components

webgym/data/components.py defines core data structures:

  • Task: Task description and metadata (task_name, domain, subdomain, website, difficulty, evaluator_reference, reference_answer, attempt_level, task_id, max_steps, trajectory_index)

  • Observation: Screenshot and page state (task, image_path, ac_tree, page_metadata)

  • Action: Agent action (action, action_string)

  • Response: Model response (raw_response, answering_tokens, raw_prompt)

  • Reward: Task completion reward (reward, evaluation, is_blocked, submit, submission_judgment)

Key Methods:

get_training_samples(num_samples=None, recency_bias_power=1.0)

Returns training-eligible samples with recency-weighted sampling. Samples from successful trajectory steps. If recency_bias_power is 1.0, uses uniform random sampling.