OSWorld Benchmark Leaderboard
OSWorld: The first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS with 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows
Leaderboard
Top 18 models on OSWorld Benchmark Leaderboard (scores from public evaluations).
- 1Claude Opus 4.672.7% on OSWorld Benchmark Leaderboard
- 2Claude Sonnet 4.672.5% on OSWorld Benchmark Leaderboard
- 3Qwen3 VL 235B A22B Instruct66.7% on OSWorld Benchmark Leaderboard
- 4Claude Opus 4.566.3% on OSWorld Benchmark Leaderboard
- 5GLM-5V-Turbo62.3% on OSWorld Benchmark Leaderboard
- 6Claude Sonnet 4.561.4% on OSWorld Benchmark Leaderboard
- 7Claude Haiku 4.550.7% on OSWorld Benchmark Leaderboard
- 8Qwen3 VL 32B Thinking41.0% on OSWorld Benchmark Leaderboard
- 9Qwen3 VL 235B A22B Thinking38.1% on OSWorld Benchmark Leaderboard
- 10Qwen3 VL 8B Thinking33.9% on OSWorld Benchmark Leaderboard
- 10Qwen3 VL 8B Instruct33.9% on OSWorld Benchmark Leaderboard
- 12Qwen3 VL 32B Instruct32.6% on OSWorld Benchmark Leaderboard
- 13Qwen3 VL 4B Thinking31.4% on OSWorld Benchmark Leaderboard
- 14Qwen3 VL 30B A3B Thinking30.6% on OSWorld Benchmark Leaderboard
- 15Qwen3 VL 30B A3B Instruct30.3% on OSWorld Benchmark Leaderboard
- 16Qwen3 VL 4B Instruct26.2% on OSWorld Benchmark Leaderboard
- 17Qwen2.5 VL 72B Instruct8.8% on OSWorld Benchmark Leaderboard
- 18Qwen2.5 VL 32B Instruct5.9% on OSWorld Benchmark Leaderboard
Models tracked
Models with osworld in their evaluation profile.
- No models linked yet.