RL Pipeline Overview

The WebGym RL pipeline consists of several interconnected components located in webgym/.

Component Summary

Context Management (webgym/context/)

Handles conversation building and response parsing for different model types and interaction modes.

Replay Buffer (webgym/data/)

Manages trajectory storage, sampling, and filtering for training.

Rollout Collection (webgym/environment/)

Orchestrates parallel browser interactions and trajectory collection.

Policy Configuration (webgym/models/)

Defines the WebAgent and model interfaces for action generation.

Utilities (webgym/utils/)

Shared utilities including blocklist management, image processing, task sampling, task history tracking, and trajectory storage.

webgym/utils/
├── __init__.py
├── blocklist_manager.py      # Blocked website management
├── image_utils.py            # Image encoding/decoding for vision models
├── rollout_sampler.py        # Task selection strategies for rollouts
├── task_history_manager.py   # Task attempt history tracking
└── trajectory_storage.py     # Incremental trajectory file storage

WandB Logger (webgym/logging/)

Provides experiment tracking and logging integration.

Data Flow

┌─────────────────┐
│  Task Sampler   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐      ┌─────────────────┐
│  AsyncWebGym    │◄────►│   WebAgent      │
│  (environment)  │      │   (policy)      │
└────────┬────────┘      └────────┬────────┘
         │                        │
         │                        ▼
         │               ┌─────────────────┐
         │               │  vLLM Server    │
         │               └─────────────────┘
         ▼
┌─────────────────┐
│  Replay Buffer  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  LLaMA-Factory  │
│  Training       │
└─────────────────┘