This repository contains a comprehensive evaluation of GPT-4o-mini on the OSWorld benchmark for multimodal desktop automation tasks.
π Read the Full Evaluation Report - Comprehensive analysis of baseline and accessibility tree experiments
Project Goal: Evaluate GPT-4o-mini's capability as a computer-use agent on real desktop tasks using the OSWorld benchmark.
OSWorld is a benchmark for testing multimodal agents on open-ended tasks in real computer environments (Ubuntu desktop with actual applications).
GPT-4o-mini is OpenAI's efficient multimodal model with vision capabilities.
.
βββ agents/ # Agent implementations
β βββ baseline/
β β βββ osworld_gpt4o_mini_agent.py # Screenshot-only agent
β βββ a11y_experiment/
β βββ osworld_gpt4o_mini_agent_a11y.py # Accessibility tree agent
βββ scripts/ # Evaluation and analysis scripts
β βββ run_evaluation.py # Baseline evaluation runner
β βββ run_evaluation_a11y.py # A11y evaluation runner
β βββ test_single_task.py # Single task test runner
β βββ analyze_results.py # Results analysis tool
β βββ compare_a11y_results.py # Baseline vs A11y comparison
βββ logs/ # Evaluation execution logs
β βββ full_evaluation.log # Baseline run (4,907 lines)
β βββ a11y_evaluation.log # A11y run (4,045 lines)
βββ results/ # Evaluation results
β βββ gpt4o_mini_osworld_style/ # Baseline results
β βββ gpt4o_mini_a11y/ # A11y experiment results
β βββ {domain}/{task_id}/
β βββ traj.jsonl
β βββ result.txt
β βββ step_*.png
βββ cache/ # Task setup files
βββ vmware_vm_data/ # VM snapshots (excluded from git)
βββ requirements.txt # Python dependencies
βββ selected_tasks.json # 10 curated evaluation tasks
βββ REPORT.md # Comprehensive evaluation report
βββ README.md # This file
βββ interaction_records.md # Development log
-
OSWorld Environment:
- OSWorld repository at
/Users/suraj/Desktop/OSWorld - VMware Fusion installed and configured
- Conda environment
osworldset up - VM tested and working
- OSWorld repository at
-
API Key:
export OPENAI_API_KEY='your-api-key-here'
-
Dependencies:
cd "/Users/suraj/Desktop/OSWorld Benchmark GPT-4o-mini Eval" pip install -r requirements.txt
Baseline (Screenshot Only):
python scripts/run_evaluation.py \
--provider_name vmware \
--task_list selected_tasks.json \
--max_steps 15 \
--result_dir ./results/gpt4o_mini_osworld_styleWith Accessibility Tree Overlay:
python scripts/run_evaluation_a11y.py \
--provider_name vmware \
--task_list selected_tasks.json \
--max_steps 15 \
--result_dir ./results/gpt4o_mini_a11yCompare Results:
python scripts/compare_a11y_results.py \
--baseline ./results/gpt4o_mini_osworld_style \
--a11y ./results/gpt4o_mini_a11y10 diverse tasks covering:
- Web Browsing (Chrome): 2 tasks - Browser configuration
- Office Suite: 2 tasks - Document & spreadsheet editing
- File Operations (OS): 2 tasks - File recovery and management
- Multimedia (VLC): 1 task - Video playback
- Development (VS Code): 1 task - Code editor operations
- Image Editing (GIMP): 1 task - Image manipulation
- Email (Thunderbird): 1 task - Email client operations
We implemented two agent versions:
Vision-only approach:
- Input: Screenshot (1920x1080) + Task instruction
- Processing: GPT-4o-mini analyzes screen state
- Output: Structured JSON with thought, action, and stop flag
- Action Space: PyAutoGUI (mouse clicks, keyboard input)
- Execution: Actions executed in VM, new screenshot captured
Enhanced with structured UI data:
- Input: Screenshot + Filtered Accessibility Tree + Task instruction
- Filtering: Extract only interactive elements (buttons, menus, text fields)
- Raw XML: 50K+ tokens
- Filtered text: 85-1125 elements, 1-2K tokens
- Format:
[role] 'name' at (x,y) size: WxH state: enabled
- Processing: GPT-4o-mini analyzes both visual and structural UI information
- Output: Same structured JSON format
- Execution: Same PyAutoGUI action space
Key Features (Both Versions):
- Multimodal vision processing
- Structured JSON output format
- Action history tracking
- Self-termination capability
- Comprehensive error handling
A11y-Specific Features:
- Semantic element identification from tree
- Coordinate information embedded in filtered tree text
- Token-optimized representation (50Kβ2K reduction)
- Role-based filtering (only interactive elements)
Evaluation Complete (November 19, 2025)
π View Full Evaluation Log (4,907 lines - complete trace with screenshots, reasoning, and actions)
- Success Rate: 0/10 (0.00%)
- Average Steps: 11.3 per task
- Total Steps: 113 across all tasks
- Tasks Reaching Max Steps: 6/10 (60%)
- Closest to Success: Task 2 - Chrome cookies (85%), Task 8 - VS Code (70%), Task 9 - GIMP (65%)
Primary Failure Modes:
- Navigation loops (80%) - Repeating Tab sequences without progress
- Dialog traversal (70%) - Couldn't locate buttons in complex dialogs
- Workflow incompleteness (30%) - Action succeeded but save/export failed
- Task scope errors (10%) - Misunderstood target scope
Evaluation Complete (November 20, 2025)
π View A11y Evaluation Log (4,045 lines - complete trace with filtered accessibility trees)
- Success Rate: 0/10 (0.00%) - No improvement over baseline
- Tasks Analyzed: 6/10 (Tasks 1-6 completed, 7-10 had VM failures)
- Average Steps: 15 per task (all hit max steps)
- Token Efficiency: Reduced raw tree from 50Kβ2K tokens via filtering
Key Findings:
- β Semantic Understanding: EXCELLENT - Agent correctly identified all target UI elements by name
- β Filtering Success: Reduced accessibility tree from 50K+ to 85-1125 interactive elements per screen
- β Coordinate Extraction: COMPLETE FAILURE - Despite having explicit coordinates in tree, agent always clicked (0,0) or generic positions
- β Infinite Loops: All tasks repeated identical failed actions 15 times without adaptation
- Critical Insight: Providing data β Agent using data. The bottleneck is extraction and application, not availability
Example Pattern (All 6 Tasks):
[INFO] Filtered 157 interactive elements from accessibility tree
[INFO] Tree contains: [push button] 'Search engine' at (320, 180) size: 150x30
[INFO] Agent Response: "I see the 'Search engine' menu item. I will click it."
[INFO] Agent Action: pyautogui.click(0, 0) # β Should use (320, 180)!
[INFO] Reward: 0.00
# Repeats 15 times without change
Comparison with Baseline:
| Aspect | Baseline | A11y Tree | Improvement |
|---|---|---|---|
| Element Identification | β Excellent (named) | +100% | |
| Coordinate Accuracy | β Imprecise (vision) | β Wrong (not extracted) | Β±0% |
| Success Rate | 0% | 0% | Β±0% |
| Action Loops | Moderate | Severe (15Γ) | -20% |
See REPORT.md Section 6 for detailed analysis of the accessibility tree experiment.
- SETUP_GUIDE.md - Detailed setup and troubleshooting
- REPORT.md - Comprehensive evaluation report with methodology
- interaction_records.md - Development log and actions taken
After running evaluation:
{
"task_id": "...",
"domain": "chrome",
"instruction": "...",
"success": true,
"eval_score": 1.0,
"num_steps": 8,
"elapsed_time": 45.2,
"trajectory_path": "..."
}Each task generates:
- JSON trajectory with all steps
- Screenshots at each step
- Thoughts and actions taken
- Success/failure status
- Environment: Fresh VM snapshot for each task
- Agent: GPT-4o-mini with vision (temperature=0.5)
- Action Space: PyAutoGUI for direct control
- Termination: Max 15 steps or agent signals completion
- Evaluation: OSWorld's automated task-specific evaluators
Core Components:
-
Vision Processing:
- Screenshot encoding to base64
- High-detail image analysis
- Image history management
-
Prompt Engineering:
- Detailed system instructions
- PyAutoGUI API documentation
- Structured output schema
- Step-by-step reasoning
-
Action Parsing:
- JSON extraction from responses
- Fallback regex parsing
- Validation and error handling
-
Execution Loop:
- Screenshot β Predict β Execute β Repeat
- 3-second pause between actions
- Trajectory recording
Task 1: Change Search Engine (Chrome)
- Instruction: "Make Bing the main search engine"
- Domain: Web browser configuration
- Complexity: Medium (5-8 steps)
Task 2: Recover File (OS)
- Instruction: "Recover deleted poster from Trash"
- Domain: File operations
- Complexity: Simple (2-4 steps)
Task 3: Play Video (VLC)
- Instruction: "Play music video from desktop"
- Domain: Multimedia application
- Complexity: Simple (2-3 steps)
- Vision-only: No accessibility tree or DOM access
- Coordinate-based: Requires precise pixel coordinates
- Sequential: No parallel task decomposition
- No verification: Cannot detect action failures
- Limited memory: Only recent action history
Prompt Engineering:
- Add few-shot examples
- Include error recovery strategies
- Chain-of-thought for complex tasks
Tool Integration:
- Accessibility tree parsing
- OCR for text detection
- Object detection for UI elements
System Design:
- Multi-agent architecture
- Closed-loop verification
- State tracking and monitoring
-
OSWorld Paper: Xie, T., et al. (2024). "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments." arXiv:2404.07972
-
OSWorld GitHub: https://github.com/xlang-ai/OSWorld
-
GPT-4o-mini: https://platform.openai.com/docs/models/gpt-4o-mini
Created: November 18, 2025
Baseline Evaluation: November 19, 2025 (0/10 success)
A11y Evaluation: November 20, 2025 (0/10 success)
Purpose: Research evaluation of GPT-4o-mini on desktop automation tasks with and without accessibility tree data
Status: Both evaluations complete - Key finding: Semantic understanding β Coordinate execution
- Vision-Only Limitation: Pure screenshot analysis insufficient for precise GUI coordinate prediction
- Data Availability β Data Usage: Providing structured UI data (accessibility trees) improves element identification but doesn't solve coordinate extraction
- Extraction Gap: LLMs struggle to parse structured data from natural language format - architectural changes needed beyond prompt engineering
- Recommended Approach: Hybrid systems with dedicated grounding modules or direct API access to accessibility trees, not LLM-mediated coordinate extraction
- OSWorld team for the benchmark framework
- OpenAI for GPT-4o-mini API access
- Desktop automation community for PyAutoGUI
Evaluation Complete: See REPORT.md for detailed analysis and recommendations.