Viewer

The analysis tools are located in analysis/:

analysis/
├── view_trajs.py           # Trajectory viewer (Gradio web interface)
├── visualize_results.py    # Training/test metrics visualizer (matplotlib)
└── font_manager.py         # Font downloading and caching for Unicode support

Trajectory Viewer

A Gradio-based web interface for inspecting agent trajectories step-by-step.

python analysis/view_trajs.py <split> [OPTIONS]

Arguments:

split (required): train or test
--data-path <path> (required): Data directory path
--log-path <path> (required): Logs directory path
--show-prompt: Display full prompt and response for each step
--position <first|last>: Load first or last N iterations (with --num-iterations)
--num-iterations <N>: Number of iterations to load

Examples:

python analysis/view_trajs.py train --data-path /home/v-baihao/data --log-path /home/v-baihao/logs
python analysis/view_trajs.py train --data-path /home/v-baihao/data --log-path /home/v-baihao/logs --position last --num-iterations 5 --show-prompt

Features:

Interactive step navigation with screenshot display
Action coordinate visualization (red dots on screenshots)
Task metadata display (difficulty, domain, subdomain, website)
Reward and evaluation info per trajectory
Accessibility tree inspection
Submission judgment display

Interface Panels:

The metadata panel shows trajectory ID, task description, difficulty, domain, website, evaluator reference, and accessibility tree.

Side-by-side screenshots of consecutive steps with action coordinates, submission status, and submission judgments.

Agent’s final answer with claim-by-claim Criterion B verification against screenshots.

Step-specific prompt details (with --show-prompt)

Full model prompt in JSON format (system message, image inputs, task instructions).

Model output showing memory state, progress tracker, intention, action, and submission judgment.

Visualizer

Generates matplotlib plots for training progress and test performance.

python analysis/visualize_results.py [OPTIONS]

Options:

--data-path <path> (required): Data directory path
--log-path <path> (required): Logs directory path
--mode <mode>: train-only, test-only, or train-test (default: train-test)
--ema <float>: EMA smoothing factor (default: 1.0, no smoothing)
--run <name>: Specific run name to visualize

Output:

Generates metrics.png with a 3x8 grid (24 subplots). Difficulty groups are color-coded: Easy (green), Medium (orange), Hard (red), Overall (black).

Row 1 - Training: Success rate, avg chars/response, avg memory chars, avg steps, % GoBack actions, samples collected, % same screenshot steps, % same screenshot (success only)

Row 2 - OOD Test: Same metrics as Row 1 but for out-of-distribution test set, plus step-limited success rate

Row 3 - Diversity/Error: Tasks seen before, tasks with websites seen before, duplicate websites, train/test crash rates, block rate comparison

Task Monitor

The task monitor (webgym/environment/task_monitor.py) provides real-time visualization of parallel task execution during rollouts.

Key Features:

Real-time web dashboard with SSE updates at http://0.0.0.0:5000
Fine-grained operation tracking per task
Color-coded progress grid with automatic downsampling for >512 tasks
Non-blocking lock acquisition to avoid impacting task execution

Usage:

from webgym.environment.task_monitor import TaskMonitor

monitor = TaskMonitor(total_tasks=100, max_steps=10, enable_web_dashboard=True, web_port=5000)
monitor.start_monitoring()

monitor.start_task("task_0001", task_name="Example Task")
monitor.set_task_navigating("task_0001", url="https://example.com")
monitor.update_task_step("task_0001", step=1)
monitor.finish_task("task_0001", success=True)

monitor.stop_monitoring()

Task Status Codes:

N: Not Started, W: Waiting, 0-9: Running (step number), F: Finished, X: Failed

Operation Codes (per-step: Xm metadata, Xs screenshot, Xt AC tree, Xa vLLM action, Xe executing; per-task: Xn navigating, Xr reward):

Integration: Automatically initialized by AsyncWebGym with enable_web_dashboard=True and web_port=5000.

API:

# Status updates
monitor.start_allocation_wait(task_id, task_name="")
monitor.start_task(task_id, task_name="")
monitor.update_task_step(task_id, step)
monitor.set_task_navigating(task_id, url="")
monitor.set_task_getting_metadata(task_id)
monitor.set_task_taking_screenshot(task_id, step=None)
monitor.set_task_getting_ac_tree(task_id)
monitor.set_task_getting_action(task_id)
monitor.set_task_executing_action(task_id, action="")
monitor.set_task_computing_reward(task_id)
monitor.set_task_normal_phase(task_id)
monitor.finish_task(task_id, success=True, error_message="")

# Queries
summary = monitor.get_progress_summary()
snapshot = monitor.get_status_snapshot()