OSWorld Benchmark Leaderboard

OSWorld: The first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS with 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows

Leaderboard

Top 18 models on OSWorld Benchmark Leaderboard (scores from public evaluations).

  1. 1Claude Opus 4.672.7% on OSWorld Benchmark Leaderboard
  2. 2Claude Sonnet 4.672.5% on OSWorld Benchmark Leaderboard
  3. 3Qwen3 VL 235B A22B Instruct66.7% on OSWorld Benchmark Leaderboard
  4. 4Claude Opus 4.566.3% on OSWorld Benchmark Leaderboard
  5. 5GLM-5V-Turbo62.3% on OSWorld Benchmark Leaderboard
  6. 6Claude Sonnet 4.561.4% on OSWorld Benchmark Leaderboard
  7. 7Claude Haiku 4.550.7% on OSWorld Benchmark Leaderboard
  8. 8Qwen3 VL 32B Thinking41.0% on OSWorld Benchmark Leaderboard
  9. 9Qwen3 VL 235B A22B Thinking38.1% on OSWorld Benchmark Leaderboard
  10. 10Qwen3 VL 8B Thinking33.9% on OSWorld Benchmark Leaderboard
  11. 10Qwen3 VL 8B Instruct33.9% on OSWorld Benchmark Leaderboard
  12. 12Qwen3 VL 32B Instruct32.6% on OSWorld Benchmark Leaderboard
  13. 13Qwen3 VL 4B Thinking31.4% on OSWorld Benchmark Leaderboard
  14. 14Qwen3 VL 30B A3B Thinking30.6% on OSWorld Benchmark Leaderboard
  15. 15Qwen3 VL 30B A3B Instruct30.3% on OSWorld Benchmark Leaderboard
  16. 16Qwen3 VL 4B Instruct26.2% on OSWorld Benchmark Leaderboard
  17. 17Qwen2.5 VL 72B Instruct8.8% on OSWorld Benchmark Leaderboard
  18. 18Qwen2.5 VL 32B Instruct5.9% on OSWorld Benchmark Leaderboard

Models tracked

Models with osworld in their evaluation profile.

  • No models linked yet.

View task leaderboards →