Skip to content

suraj-ranganath/osworld-gpt4o-mini-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GPT-4o-mini OSWorld Benchmark Evaluation

This repository contains a comprehensive evaluation of GPT-4o-mini on the OSWorld benchmark for multimodal desktop automation tasks.

πŸ“„ Read the Full Evaluation Report - Comprehensive analysis of baseline and accessibility tree experiments

πŸ“‹ Overview

Project Goal: Evaluate GPT-4o-mini's capability as a computer-use agent on real desktop tasks using the OSWorld benchmark.

OSWorld is a benchmark for testing multimodal agents on open-ended tasks in real computer environments (Ubuntu desktop with actual applications).

GPT-4o-mini is OpenAI's efficient multimodal model with vision capabilities.

πŸ—οΈ Project Structure

.
β”œβ”€β”€ agents/                           # Agent implementations
β”‚   β”œβ”€β”€ baseline/
β”‚   β”‚   └── osworld_gpt4o_mini_agent.py       # Screenshot-only agent
β”‚   └── a11y_experiment/
β”‚       └── osworld_gpt4o_mini_agent_a11y.py  # Accessibility tree agent
β”œβ”€β”€ scripts/                          # Evaluation and analysis scripts
β”‚   β”œβ”€β”€ run_evaluation.py            # Baseline evaluation runner
β”‚   β”œβ”€β”€ run_evaluation_a11y.py       # A11y evaluation runner
β”‚   β”œβ”€β”€ test_single_task.py          # Single task test runner
β”‚   β”œβ”€β”€ analyze_results.py           # Results analysis tool
β”‚   └── compare_a11y_results.py      # Baseline vs A11y comparison
β”œβ”€β”€ logs/                             # Evaluation execution logs
β”‚   β”œβ”€β”€ full_evaluation.log          # Baseline run (4,907 lines)
β”‚   └── a11y_evaluation.log          # A11y run (4,045 lines)
β”œβ”€β”€ results/                          # Evaluation results
β”‚   β”œβ”€β”€ gpt4o_mini_osworld_style/    # Baseline results
β”‚   β”œβ”€β”€ gpt4o_mini_a11y/             # A11y experiment results
β”‚   └── {domain}/{task_id}/
β”‚       β”œβ”€β”€ traj.jsonl
β”‚       β”œβ”€β”€ result.txt
β”‚       └── step_*.png
β”œβ”€β”€ cache/                            # Task setup files
β”œβ”€β”€ vmware_vm_data/                   # VM snapshots (excluded from git)
β”œβ”€β”€ requirements.txt                  # Python dependencies
β”œβ”€β”€ selected_tasks.json               # 10 curated evaluation tasks
β”œβ”€β”€ REPORT.md                         # Comprehensive evaluation report
β”œβ”€β”€ README.md                         # This file
└── interaction_records.md            # Development log

πŸš€ Quick Start

Prerequisites

  1. OSWorld Environment:

    • OSWorld repository at /Users/suraj/Desktop/OSWorld
    • VMware Fusion installed and configured
    • Conda environment osworld set up
    • VM tested and working
  2. API Key:

    export OPENAI_API_KEY='your-api-key-here'
  3. Dependencies:

    cd "/Users/suraj/Desktop/OSWorld Benchmark GPT-4o-mini Eval"
    pip install -r requirements.txt

Run Evaluation

Baseline (Screenshot Only):

python scripts/run_evaluation.py \
  --provider_name vmware \
  --task_list selected_tasks.json \
  --max_steps 15 \
  --result_dir ./results/gpt4o_mini_osworld_style

With Accessibility Tree Overlay:

python scripts/run_evaluation_a11y.py \
  --provider_name vmware \
  --task_list selected_tasks.json \
  --max_steps 15 \
  --result_dir ./results/gpt4o_mini_a11y

Compare Results:

python scripts/compare_a11y_results.py \
  --baseline ./results/gpt4o_mini_osworld_style \
  --a11y ./results/gpt4o_mini_a11y

πŸ“Š Task Selection

10 diverse tasks covering:

  • Web Browsing (Chrome): 2 tasks - Browser configuration
  • Office Suite: 2 tasks - Document & spreadsheet editing
  • File Operations (OS): 2 tasks - File recovery and management
  • Multimedia (VLC): 1 task - Video playback
  • Development (VS Code): 1 task - Code editor operations
  • Image Editing (GIMP): 1 task - Image manipulation
  • Email (Thunderbird): 1 task - Email client operations

πŸ€– Agent Architecture

We implemented two agent versions:

1. Baseline Agent (agents/baseline/osworld_gpt4o_mini_agent.py)

Vision-only approach:

  1. Input: Screenshot (1920x1080) + Task instruction
  2. Processing: GPT-4o-mini analyzes screen state
  3. Output: Structured JSON with thought, action, and stop flag
  4. Action Space: PyAutoGUI (mouse clicks, keyboard input)
  5. Execution: Actions executed in VM, new screenshot captured

2. Accessibility Tree Agent (agents/a11y_experiment/osworld_gpt4o_mini_agent_a11y.py)

Enhanced with structured UI data:

  1. Input: Screenshot + Filtered Accessibility Tree + Task instruction
  2. Filtering: Extract only interactive elements (buttons, menus, text fields)
    • Raw XML: 50K+ tokens
    • Filtered text: 85-1125 elements, 1-2K tokens
    • Format: [role] 'name' at (x,y) size: WxH state: enabled
  3. Processing: GPT-4o-mini analyzes both visual and structural UI information
  4. Output: Same structured JSON format
  5. Execution: Same PyAutoGUI action space

Key Features (Both Versions):

  • Multimodal vision processing
  • Structured JSON output format
  • Action history tracking
  • Self-termination capability
  • Comprehensive error handling

A11y-Specific Features:

  • Semantic element identification from tree
  • Coordinate information embedded in filtered tree text
  • Token-optimized representation (50Kβ†’2K reduction)
  • Role-based filtering (only interactive elements)

πŸ“ˆ Results

Evaluation Complete (November 19, 2025)

πŸ“Š View Full Evaluation Log (4,907 lines - complete trace with screenshots, reasoning, and actions)

Baseline (Screenshot Only)

  • Success Rate: 0/10 (0.00%)
  • Average Steps: 11.3 per task
  • Total Steps: 113 across all tasks
  • Tasks Reaching Max Steps: 6/10 (60%)
  • Closest to Success: Task 2 - Chrome cookies (85%), Task 8 - VS Code (70%), Task 9 - GIMP (65%)

Primary Failure Modes:

  1. Navigation loops (80%) - Repeating Tab sequences without progress
  2. Dialog traversal (70%) - Couldn't locate buttons in complex dialogs
  3. Workflow incompleteness (30%) - Action succeeded but save/export failed
  4. Task scope errors (10%) - Misunderstood target scope

With Accessibility Tree (Screenshot + Filtered A11y Tree)

Evaluation Complete (November 20, 2025)

πŸ“Š View A11y Evaluation Log (4,045 lines - complete trace with filtered accessibility trees)

  • Success Rate: 0/10 (0.00%) - No improvement over baseline
  • Tasks Analyzed: 6/10 (Tasks 1-6 completed, 7-10 had VM failures)
  • Average Steps: 15 per task (all hit max steps)
  • Token Efficiency: Reduced raw tree from 50Kβ†’2K tokens via filtering

Key Findings:

  • βœ… Semantic Understanding: EXCELLENT - Agent correctly identified all target UI elements by name
  • βœ… Filtering Success: Reduced accessibility tree from 50K+ to 85-1125 interactive elements per screen
  • ❌ Coordinate Extraction: COMPLETE FAILURE - Despite having explicit coordinates in tree, agent always clicked (0,0) or generic positions
  • ❌ Infinite Loops: All tasks repeated identical failed actions 15 times without adaptation
  • Critical Insight: Providing data β‰  Agent using data. The bottleneck is extraction and application, not availability

Example Pattern (All 6 Tasks):

[INFO] Filtered 157 interactive elements from accessibility tree
[INFO] Tree contains: [push button] 'Search engine' at (320, 180) size: 150x30
[INFO] Agent Response: "I see the 'Search engine' menu item. I will click it."
[INFO] Agent Action: pyautogui.click(0, 0)  # ❌ Should use (320, 180)!
[INFO] Reward: 0.00
# Repeats 15 times without change

Comparison with Baseline:

Aspect Baseline A11y Tree Improvement
Element Identification ⚠️ Poor (guessing) βœ… Excellent (named) +100%
Coordinate Accuracy ❌ Imprecise (vision) ❌ Wrong (not extracted) ±0%
Success Rate 0% 0% Β±0%
Action Loops Moderate Severe (15Γ—) -20%

See REPORT.md Section 6 for detailed analysis of the accessibility tree experiment.

πŸ“š Documentation

πŸ” Output Format

After running evaluation:

{
  "task_id": "...",
  "domain": "chrome",
  "instruction": "...",
  "success": true,
  "eval_score": 1.0,
  "num_steps": 8,
  "elapsed_time": 45.2,
  "trajectory_path": "..."
}

Each task generates:

  • JSON trajectory with all steps
  • Screenshots at each step
  • Thoughts and actions taken
  • Success/failure status

🎯 Evaluation Methodology

  1. Environment: Fresh VM snapshot for each task
  2. Agent: GPT-4o-mini with vision (temperature=0.5)
  3. Action Space: PyAutoGUI for direct control
  4. Termination: Max 15 steps or agent signals completion
  5. Evaluation: OSWorld's automated task-specific evaluators

πŸ”§ Agent Implementation

Core Components:

  1. Vision Processing:

    • Screenshot encoding to base64
    • High-detail image analysis
    • Image history management
  2. Prompt Engineering:

    • Detailed system instructions
    • PyAutoGUI API documentation
    • Structured output schema
    • Step-by-step reasoning
  3. Action Parsing:

    • JSON extraction from responses
    • Fallback regex parsing
    • Validation and error handling
  4. Execution Loop:

    • Screenshot β†’ Predict β†’ Execute β†’ Repeat
    • 3-second pause between actions
    • Trajectory recording

πŸ“ Task Examples

Task 1: Change Search Engine (Chrome)

  • Instruction: "Make Bing the main search engine"
  • Domain: Web browser configuration
  • Complexity: Medium (5-8 steps)

Task 2: Recover File (OS)

  • Instruction: "Recover deleted poster from Trash"
  • Domain: File operations
  • Complexity: Simple (2-4 steps)

Task 3: Play Video (VLC)

  • Instruction: "Play music video from desktop"
  • Domain: Multimedia application
  • Complexity: Simple (2-3 steps)

🚨 Known Limitations

  1. Vision-only: No accessibility tree or DOM access
  2. Coordinate-based: Requires precise pixel coordinates
  3. Sequential: No parallel task decomposition
  4. No verification: Cannot detect action failures
  5. Limited memory: Only recent action history

πŸ”¬ Future Improvements

Prompt Engineering:

  • Add few-shot examples
  • Include error recovery strategies
  • Chain-of-thought for complex tasks

Tool Integration:

  • Accessibility tree parsing
  • OCR for text detection
  • Object detection for UI elements

System Design:

  • Multi-agent architecture
  • Closed-loop verification
  • State tracking and monitoring

πŸ“– References

  1. OSWorld Paper: Xie, T., et al. (2024). "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments." arXiv:2404.07972

  2. OSWorld GitHub: https://github.com/xlang-ai/OSWorld

  3. GPT-4o-mini: https://platform.openai.com/docs/models/gpt-4o-mini

πŸ“§ Project Information

Created: November 18, 2025
Baseline Evaluation: November 19, 2025 (0/10 success)
A11y Evaluation: November 20, 2025 (0/10 success)
Purpose: Research evaluation of GPT-4o-mini on desktop automation tasks with and without accessibility tree data
Status: Both evaluations complete - Key finding: Semantic understanding β‰  Coordinate execution

πŸ”¬ Key Research Insights

  1. Vision-Only Limitation: Pure screenshot analysis insufficient for precise GUI coordinate prediction
  2. Data Availability β‰  Data Usage: Providing structured UI data (accessibility trees) improves element identification but doesn't solve coordinate extraction
  3. Extraction Gap: LLMs struggle to parse structured data from natural language format - architectural changes needed beyond prompt engineering
  4. Recommended Approach: Hybrid systems with dedicated grounding modules or direct API access to accessibility trees, not LLM-mediated coordinate extraction

πŸ™ Acknowledgments

  • OSWorld team for the benchmark framework
  • OpenAI for GPT-4o-mini API access
  • Desktop automation community for PyAutoGUI

Evaluation Complete: See REPORT.md for detailed analysis and recommendations.

About

Evaluation of GPT-4o-mini on OSWorld desktop automation benchmark. Compares screenshot-only vs accessibility tree-enhanced approaches across 10 tasks (Chrome, LibreOffice, file ops, etc). Documents critical coordinate extraction failures and provides architectural recommendations for GUI agents.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages