GPT-4o-mini OSWorld Benchmark Evaluation

This repository contains a comprehensive evaluation of GPT-4o-mini on the OSWorld benchmark for multimodal desktop automation tasks.

📄 Read the Full Evaluation Report - Comprehensive analysis of baseline and accessibility tree experiments

📋 Overview

Project Goal: Evaluate GPT-4o-mini's capability as a computer-use agent on real desktop tasks using the OSWorld benchmark.

OSWorld is a benchmark for testing multimodal agents on open-ended tasks in real computer environments (Ubuntu desktop with actual applications).

GPT-4o-mini is OpenAI's efficient multimodal model with vision capabilities.

🏗️ Project Structure

.
├── agents/                           # Agent implementations
│   ├── baseline/
│   │   └── osworld_gpt4o_mini_agent.py       # Screenshot-only agent
│   └── a11y_experiment/
│       └── osworld_gpt4o_mini_agent_a11y.py  # Accessibility tree agent
├── scripts/                          # Evaluation and analysis scripts
│   ├── run_evaluation.py            # Baseline evaluation runner
│   ├── run_evaluation_a11y.py       # A11y evaluation runner
│   ├── test_single_task.py          # Single task test runner
│   ├── analyze_results.py           # Results analysis tool
│   └── compare_a11y_results.py      # Baseline vs A11y comparison
├── logs/                             # Evaluation execution logs
│   ├── full_evaluation.log          # Baseline run (4,907 lines)
│   └── a11y_evaluation.log          # A11y run (4,045 lines)
├── results/                          # Evaluation results
│   ├── gpt4o_mini_osworld_style/    # Baseline results
│   ├── gpt4o_mini_a11y/             # A11y experiment results
│   └── {domain}/{task_id}/
│       ├── traj.jsonl
│       ├── result.txt
│       └── step_*.png
├── cache/                            # Task setup files
├── vmware_vm_data/                   # VM snapshots (excluded from git)
├── requirements.txt                  # Python dependencies
├── selected_tasks.json               # 10 curated evaluation tasks
├── REPORT.md                         # Comprehensive evaluation report
├── README.md                         # This file
└── interaction_records.md            # Development log

🚀 Quick Start

Prerequisites

OSWorld Environment:
- OSWorld repository at /Users/suraj/Desktop/OSWorld
- VMware Fusion installed and configured
- Conda environment osworld set up
- VM tested and working

API Key:

export OPENAI_API_KEY='your-api-key-here'

Dependencies:

cd "/Users/suraj/Desktop/OSWorld Benchmark GPT-4o-mini Eval"
pip install -r requirements.txt

Run Evaluation

Baseline (Screenshot Only):

python scripts/run_evaluation.py \
  --provider_name vmware \
  --task_list selected_tasks.json \
  --max_steps 15 \
  --result_dir ./results/gpt4o_mini_osworld_style

With Accessibility Tree Overlay:

python scripts/run_evaluation_a11y.py \
  --provider_name vmware \
  --task_list selected_tasks.json \
  --max_steps 15 \
  --result_dir ./results/gpt4o_mini_a11y

Compare Results:

python scripts/compare_a11y_results.py \
  --baseline ./results/gpt4o_mini_osworld_style \
  --a11y ./results/gpt4o_mini_a11y

📊 Task Selection

10 diverse tasks covering:

Web Browsing (Chrome): 2 tasks - Browser configuration
Office Suite: 2 tasks - Document & spreadsheet editing
File Operations (OS): 2 tasks - File recovery and management
Multimedia (VLC): 1 task - Video playback
Development (VS Code): 1 task - Code editor operations
Image Editing (GIMP): 1 task - Image manipulation
Email (Thunderbird): 1 task - Email client operations

🤖 Agent Architecture

We implemented two agent versions:

1. Baseline Agent (`agents/baseline/osworld_gpt4o_mini_agent.py`)

Vision-only approach:

Input: Screenshot (1920x1080) + Task instruction
Processing: GPT-4o-mini analyzes screen state
Output: Structured JSON with thought, action, and stop flag
Action Space: PyAutoGUI (mouse clicks, keyboard input)
Execution: Actions executed in VM, new screenshot captured

2. Accessibility Tree Agent (`agents/a11y_experiment/osworld_gpt4o_mini_agent_a11y.py`)

Enhanced with structured UI data:

Input: Screenshot + Filtered Accessibility Tree + Task instruction
Filtering: Extract only interactive elements (buttons, menus, text fields)
- Raw XML: 50K+ tokens
- Filtered text: 85-1125 elements, 1-2K tokens
- Format: [role] 'name' at (x,y) size: WxH state: enabled
Processing: GPT-4o-mini analyzes both visual and structural UI information
Output: Same structured JSON format
Execution: Same PyAutoGUI action space

Key Features (Both Versions):

Multimodal vision processing
Structured JSON output format
Action history tracking
Self-termination capability
Comprehensive error handling

A11y-Specific Features:

Semantic element identification from tree
Coordinate information embedded in filtered tree text
Token-optimized representation (50K→2K reduction)
Role-based filtering (only interactive elements)

📈 Results

Evaluation Complete (November 19, 2025)

📊 View Full Evaluation Log (4,907 lines - complete trace with screenshots, reasoning, and actions)

Baseline (Screenshot Only)

Success Rate: 0/10 (0.00%)
Average Steps: 11.3 per task
Total Steps: 113 across all tasks
Tasks Reaching Max Steps: 6/10 (60%)
Closest to Success: Task 2 - Chrome cookies (85%), Task 8 - VS Code (70%), Task 9 - GIMP (65%)

Primary Failure Modes:

Navigation loops (80%) - Repeating Tab sequences without progress
Dialog traversal (70%) - Couldn't locate buttons in complex dialogs
Workflow incompleteness (30%) - Action succeeded but save/export failed
Task scope errors (10%) - Misunderstood target scope

With Accessibility Tree (Screenshot + Filtered A11y Tree)

Evaluation Complete (November 20, 2025)

📊 View A11y Evaluation Log (4,045 lines - complete trace with filtered accessibility trees)

Success Rate: 0/10 (0.00%) - No improvement over baseline
Tasks Analyzed: 6/10 (Tasks 1-6 completed, 7-10 had VM failures)
Average Steps: 15 per task (all hit max steps)
Token Efficiency: Reduced raw tree from 50K→2K tokens via filtering

Key Findings:

✅ Semantic Understanding: EXCELLENT - Agent correctly identified all target UI elements by name
✅ Filtering Success: Reduced accessibility tree from 50K+ to 85-1125 interactive elements per screen
❌ Coordinate Extraction: COMPLETE FAILURE - Despite having explicit coordinates in tree, agent always clicked (0,0) or generic positions
❌ Infinite Loops: All tasks repeated identical failed actions 15 times without adaptation
Critical Insight: Providing data ≠ Agent using data. The bottleneck is extraction and application, not availability

Example Pattern (All 6 Tasks):

[INFO] Filtered 157 interactive elements from accessibility tree
[INFO] Tree contains: [push button] 'Search engine' at (320, 180) size: 150x30
[INFO] Agent Response: "I see the 'Search engine' menu item. I will click it."
[INFO] Agent Action: pyautogui.click(0, 0)  # ❌ Should use (320, 180)!
[INFO] Reward: 0.00
# Repeats 15 times without change

Comparison with Baseline:

Aspect	Baseline	A11y Tree	Improvement
Element Identification	⚠️ Poor (guessing)	✅ Excellent (named)	+100%
Coordinate Accuracy	❌ Imprecise (vision)	❌ Wrong (not extracted)	±0%
Success Rate	0%	0%	±0%
Action Loops	Moderate	Severe (15×)	-20%

See REPORT.md Section 6 for detailed analysis of the accessibility tree experiment.

📚 Documentation

SETUP_GUIDE.md - Detailed setup and troubleshooting
REPORT.md - Comprehensive evaluation report with methodology
interaction_records.md - Development log and actions taken

🔍 Output Format

After running evaluation:

{
  "task_id": "...",
  "domain": "chrome",
  "instruction": "...",
  "success": true,
  "eval_score": 1.0,
  "num_steps": 8,
  "elapsed_time": 45.2,
  "trajectory_path": "..."
}

Each task generates:

JSON trajectory with all steps
Screenshots at each step
Thoughts and actions taken
Success/failure status

🎯 Evaluation Methodology

Environment: Fresh VM snapshot for each task
Agent: GPT-4o-mini with vision (temperature=0.5)
Action Space: PyAutoGUI for direct control
Termination: Max 15 steps or agent signals completion
Evaluation: OSWorld's automated task-specific evaluators

🔧 Agent Implementation

Core Components:

Vision Processing:
- Screenshot encoding to base64
- High-detail image analysis
- Image history management
Prompt Engineering:
- Detailed system instructions
- PyAutoGUI API documentation
- Structured output schema
- Step-by-step reasoning
Action Parsing:
- JSON extraction from responses
- Fallback regex parsing
- Validation and error handling
Execution Loop:
- Screenshot → Predict → Execute → Repeat
- 3-second pause between actions
- Trajectory recording

📝 Task Examples

Task 1: Change Search Engine (Chrome)

Instruction: "Make Bing the main search engine"
Domain: Web browser configuration
Complexity: Medium (5-8 steps)

Task 2: Recover File (OS)

Instruction: "Recover deleted poster from Trash"
Domain: File operations
Complexity: Simple (2-4 steps)

Task 3: Play Video (VLC)

Instruction: "Play music video from desktop"
Domain: Multimedia application
Complexity: Simple (2-3 steps)

🚨 Known Limitations

Vision-only: No accessibility tree or DOM access
Coordinate-based: Requires precise pixel coordinates
Sequential: No parallel task decomposition
No verification: Cannot detect action failures
Limited memory: Only recent action history

🔬 Future Improvements

Prompt Engineering:

Add few-shot examples
Include error recovery strategies
Chain-of-thought for complex tasks

Tool Integration:

Accessibility tree parsing
OCR for text detection
Object detection for UI elements

System Design:

Multi-agent architecture
Closed-loop verification
State tracking and monitoring

📖 References

OSWorld Paper: Xie, T., et al. (2024). "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments." arXiv:2404.07972
OSWorld GitHub: https://github.com/xlang-ai/OSWorld
GPT-4o-mini: https://platform.openai.com/docs/models/gpt-4o-mini

📧 Project Information

Created: November 18, 2025
Baseline Evaluation: November 19, 2025 (0/10 success)
A11y Evaluation: November 20, 2025 (0/10 success)
Purpose: Research evaluation of GPT-4o-mini on desktop automation tasks with and without accessibility tree data
Status: Both evaluations complete - Key finding: Semantic understanding ≠ Coordinate execution

🔬 Key Research Insights

Vision-Only Limitation: Pure screenshot analysis insufficient for precise GUI coordinate prediction
Data Availability ≠ Data Usage: Providing structured UI data (accessibility trees) improves element identification but doesn't solve coordinate extraction
Extraction Gap: LLMs struggle to parse structured data from natural language format - architectural changes needed beyond prompt engineering
Recommended Approach: Hybrid systems with dedicated grounding modules or direct API access to accessibility trees, not LLM-mediated coordinate extraction

🙏 Acknowledgments

OSWorld team for the benchmark framework
OpenAI for GPT-4o-mini API access
Desktop automation community for PyAutoGUI

Evaluation Complete: See REPORT.md for detailed analysis and recommendations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT-4o-mini OSWorld Benchmark Evaluation

📋 Overview

🏗️ Project Structure

🚀 Quick Start

Prerequisites

Run Evaluation

📊 Task Selection

🤖 Agent Architecture

1. Baseline Agent (`agents/baseline/osworld_gpt4o_mini_agent.py`)

2. Accessibility Tree Agent (`agents/a11y_experiment/osworld_gpt4o_mini_agent_a11y.py`)

📈 Results

Baseline (Screenshot Only)

With Accessibility Tree (Screenshot + Filtered A11y Tree)

📚 Documentation

🔍 Output Format

🎯 Evaluation Methodology

🔧 Agent Implementation

📝 Task Examples

🚨 Known Limitations

🔬 Future Improvements

📖 References

📧 Project Information

🔬 Key Research Insights

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
agents		agents
logs		logs
scripts		scripts
.gitignore		.gitignore
README.md		README.md
REPORT.md		REPORT.md
interaction_records.md		interaction_records.md
requirements.txt		requirements.txt
selected_tasks.json		selected_tasks.json

Folders and files

Latest commit

History

Repository files navigation

GPT-4o-mini OSWorld Benchmark Evaluation

📋 Overview

🏗️ Project Structure

🚀 Quick Start

Prerequisites

Run Evaluation

📊 Task Selection

🤖 Agent Architecture

1. Baseline Agent (agents/baseline/osworld_gpt4o_mini_agent.py)

2. Accessibility Tree Agent (agents/a11y_experiment/osworld_gpt4o_mini_agent_a11y.py)

📈 Results

Baseline (Screenshot Only)

With Accessibility Tree (Screenshot + Filtered A11y Tree)

📚 Documentation

🔍 Output Format

🎯 Evaluation Methodology

🔧 Agent Implementation

📝 Task Examples

🚨 Known Limitations

🔬 Future Improvements

📖 References

📧 Project Information

🔬 Key Research Insights

🙏 Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Baseline Agent (`agents/baseline/osworld_gpt4o_mini_agent.py`)

2. Accessibility Tree Agent (`agents/a11y_experiment/osworld_gpt4o_mini_agent_a11y.py`)

Packages