Viewer ====== The analysis tools are located in ``analysis/``: .. code-block:: text analysis/ ├── view_trajs.py # Trajectory viewer (Gradio web interface) ├── visualize_results.py # Training/test metrics visualizer (matplotlib) └── font_manager.py # Font downloading and caching for Unicode support Trajectory Viewer ----------------- A Gradio-based web interface for inspecting agent trajectories step-by-step. .. code-block:: bash python analysis/view_trajs.py [OPTIONS] **Arguments:** * ``split`` (required): ``train`` or ``test`` * ``--data-path `` (required): Data directory path * ``--log-path `` (required): Logs directory path * ``--show-prompt``: Display full prompt and response for each step * ``--position ``: Load first or last N iterations (with ``--num-iterations``) * ``--num-iterations ``: Number of iterations to load **Examples:** .. code-block:: bash python analysis/view_trajs.py train --data-path /home/v-baihao/data --log-path /home/v-baihao/logs python analysis/view_trajs.py train --data-path /home/v-baihao/data --log-path /home/v-baihao/logs --position last --num-iterations 5 --show-prompt **Features:** * Interactive step navigation with screenshot display * Action coordinate visualization (red dots on screenshots) * Task metadata display (difficulty, domain, subdomain, website) * Reward and evaluation info per trajectory * Accessibility tree inspection * Submission judgment display **Interface Panels:** .. image:: ../../figures/viewer_1.png :alt: Viewer metadata panel The metadata panel shows trajectory ID, task description, difficulty, domain, website, evaluator reference, and accessibility tree. .. image:: ../../figures/viewer_2.png :alt: Step-by-step screenshots Side-by-side screenshots of consecutive steps with action coordinates, submission status, and submission judgments. .. image:: ../../figures/viewer_3.png :alt: Answer and evaluation panel Agent's final answer with claim-by-claim Criterion B verification against screenshots. .. image:: ../../figures/viewer_4.png :alt: Step-specific prompt details (with --show-prompt) Full model prompt in JSON format (system message, image inputs, task instructions). .. image:: ../../figures/viewer_5.png :alt: Raw model response Model output showing memory state, progress tracker, intention, action, and submission judgment. Visualizer ========== Generates matplotlib plots for training progress and test performance. .. code-block:: bash python analysis/visualize_results.py [OPTIONS] **Options:** * ``--data-path `` (required): Data directory path * ``--log-path `` (required): Logs directory path * ``--mode ``: ``train-only``, ``test-only``, or ``train-test`` (default: ``train-test``) * ``--ema ``: EMA smoothing factor (default: 1.0, no smoothing) * ``--run ``: Specific run name to visualize **Output:** Generates ``metrics.png`` with a 3x8 grid (24 subplots). Difficulty groups are color-coded: Easy (green), Medium (orange), Hard (red), Overall (black). .. image:: ../../figures/visualizer_example.png :alt: Example metrics.png output *Row 1 - Training:* Success rate, avg chars/response, avg memory chars, avg steps, % GoBack actions, samples collected, % same screenshot steps, % same screenshot (success only) *Row 2 - OOD Test:* Same metrics as Row 1 but for out-of-distribution test set, plus step-limited success rate *Row 3 - Diversity/Error:* Tasks seen before, tasks with websites seen before, duplicate websites, train/test crash rates, block rate comparison Task Monitor ============ The task monitor (``webgym/environment/task_monitor.py``) provides real-time visualization of parallel task execution during rollouts. .. image:: ../../figures/task_monitor.png :alt: Task Monitor dashboard **Key Features:** * Real-time web dashboard with SSE updates at ``http://0.0.0.0:5000`` * Fine-grained operation tracking per task * Color-coded progress grid with automatic downsampling for >512 tasks * Non-blocking lock acquisition to avoid impacting task execution **Usage:** .. code-block:: python from webgym.environment.task_monitor import TaskMonitor monitor = TaskMonitor(total_tasks=100, max_steps=10, enable_web_dashboard=True, web_port=5000) monitor.start_monitoring() monitor.start_task("task_0001", task_name="Example Task") monitor.set_task_navigating("task_0001", url="https://example.com") monitor.update_task_step("task_0001", step=1) monitor.finish_task("task_0001", success=True) monitor.stop_monitoring() **Task Status Codes:** * ``N``: Not Started, ``W``: Waiting, ``0-9``: Running (step number), ``F``: Finished, ``X``: Failed **Operation Codes** (per-step: ``Xm`` metadata, ``Xs`` screenshot, ``Xt`` AC tree, ``Xa`` vLLM action, ``Xe`` executing; per-task: ``Xn`` navigating, ``Xr`` reward): **Integration:** Automatically initialized by ``AsyncWebGym`` with ``enable_web_dashboard=True`` and ``web_port=5000``. **API:** .. code-block:: python # Status updates monitor.start_allocation_wait(task_id, task_name="") monitor.start_task(task_id, task_name="") monitor.update_task_step(task_id, step) monitor.set_task_navigating(task_id, url="") monitor.set_task_getting_metadata(task_id) monitor.set_task_taking_screenshot(task_id, step=None) monitor.set_task_getting_ac_tree(task_id) monitor.set_task_getting_action(task_id) monitor.set_task_executing_action(task_id, action="") monitor.set_task_computing_reward(task_id) monitor.set_task_normal_phase(task_id) monitor.finish_task(task_id, success=True, error_message="") # Queries summary = monitor.get_progress_summary() snapshot = monitor.get_status_snapshot()