Replay Buffer
The replay buffer module (webgym/data/) manages training data storage and sampling for RL training.
Module Structure
webgym/data/
├── __init__.py
├── replay_buffer.py # Main replay buffer class
├── components.py # Data structures (Task, Action, Reward, etc.)
└── response_decomposer.py # Response parsing utilities
ReplayBuffer
The ReplayBuffer class (replay_buffer.py) extends PyTorch’s Dataset and provides:
Trajectory storage and management
Filtering for successful/unsuccessful trajectories
Same-screenshot step filtering
Support for distributed training
Initialization:
from webgym.data import ReplayBuffer
replay_buffer = ReplayBuffer(
trajectories=trajectory_list,
agent=web_agent,
capacity=None, # None = unlimited
filter_successful_only=False,
include_reward_in_sample=True,
shuffle=False,
filter_same_screenshot=True
)
Key Parameters:
trajectoriesList of trajectory data to process
agentWebAgent instance for context management
capacityMaximum number of samples to store (optional)
filter_successful_onlyIf True, only samples from successful trajectories are accessible
include_reward_in_sampleInclude reward information in each sample (default: True)
shuffleShuffle samples (default: False)
filter_same_screenshotFilter out steps where screenshots haven’t changed (default: True)
Data Components
webgym/data/components.py defines core data structures:
Task: Task description and metadata (task_name, domain, subdomain, website, difficulty, evaluator_reference, reference_answer, attempt_level, task_id, max_steps, trajectory_index)Observation: Screenshot and page state (task, image_path, ac_tree, page_metadata)Action: Agent action (action, action_string)Response: Model response (raw_response, answering_tokens, raw_prompt)Reward: Task completion reward (reward, evaluation, is_blocked, submit, submission_judgment)
Key Methods:
get_training_samples(num_samples=None, recency_bias_power=1.0)Returns training-eligible samples with recency-weighted sampling. Samples from successful trajectory steps. If
recency_bias_poweris 1.0, uses uniform random sampling.