RL Pipeline Overview
The WebGym RL pipeline consists of several interconnected components located in webgym/.
Component Summary
- Context Management (
webgym/context/) Handles conversation building and response parsing for different model types and interaction modes.
- Replay Buffer (
webgym/data/) Manages trajectory storage, sampling, and filtering for training.
- Rollout Collection (
webgym/environment/) Orchestrates parallel browser interactions and trajectory collection.
- Policy Configuration (
webgym/models/) Defines the WebAgent and model interfaces for action generation.
- Utilities (
webgym/utils/) Shared utilities including blocklist management, image processing, task sampling, task history tracking, and trajectory storage.
webgym/utils/ ├── __init__.py ├── blocklist_manager.py # Blocked website management ├── image_utils.py # Image encoding/decoding for vision models ├── rollout_sampler.py # Task selection strategies for rollouts ├── task_history_manager.py # Task attempt history tracking └── trajectory_storage.py # Incremental trajectory file storage
- WandB Logger (
webgym/logging/) Provides experiment tracking and logging integration.
Data Flow
┌─────────────────┐
│ Task Sampler │
└────────┬────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐
│ AsyncWebGym │◄────►│ WebAgent │
│ (environment) │ │ (policy) │
└────────┬────────┘ └────────┬────────┘
│ │
│ ▼
│ ┌─────────────────┐
│ │ vLLM Server │
│ └─────────────────┘
▼
┌─────────────────┐
│ Replay Buffer │
└────────┬────────┘
│
▼
┌─────────────────┐
│ LLaMA-Factory │
│ Training │
└─────────────────┘