Context Management

The context management module (webgym/context/) handles conversation building and response parsing for the web agent.

Module Structure

webgym/context/
├── __init__.py
├── context_manager.py      # Main context manager class
├── universal_prompt.py     # Prompt templates
└── parsers/                # Action parsers
    ├── __init__.py
    ├── base_parser.py          # Abstract parser interface
    ├── coordinates_parser.py   # Coordinates mode parser
    └── set_of_marks_parser.py  # Set-of-marks mode parser

ContextManager

The ContextManager class (context_manager.py) is the central component that:

  • Manages conversation building for different model types

  • Handles response parsing

  • Supports different interaction modes (coordinates vs set-of-marks)

Initialization:

from webgym.context import ContextManager

context_manager = ContextManager(
    context_config={'interaction_mode': 'coordinates'},  # default is 'set_of_marks'
    model_config={'model_type': 'qwen3-instruct'},
    verbose=True
)

Key Methods:

build_conversation(task, trajectory, current_observation, **kwargs)

Builds a model-specific conversation from the current state.

parse_response(raw_response)

Parses the model’s raw response into structured action data.

get_interaction_mode()

Returns the current interaction mode ('coordinates' or 'set_of_marks').

Note

The interaction_mode value must use underscores: set_of_marks (not set-of-marks with hyphens).

Interaction Modes

Coordinates Mode:

Actions are specified using pixel coordinates (x, y). Example: click(500, 300)

Set-of-Marks Mode:

Actions reference numbered UI elements. Example: click([15]) (clicks element #15)

Parsers

Located in webgym/context/parsers/, parsers convert raw model responses into executable actions:

  • Extract action type (click, type, scroll, etc.)

  • Parse action parameters

  • Validate action format