OSWorld Benchmark Leaderboard

OSWorld: The first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS with 369 computer tasks involving real web and desktop applications, OS file I/O, and multi-application workflows

Leaderboard

Top 18 models on OSWorld Benchmark Leaderboard (scores from public evaluations).

Rank	Model	Score	Lab
1	Claude Opus 4.6	72.7%	—
2	Claude Sonnet 4.6	72.5%	—
3	Qwen3 VL 235B A22B Instruct	66.7%	—
4	Claude Opus 4.5	66.3%	—
5	GLM-5V-Turbo	62.3%	—
6	Claude Sonnet 4.5	61.4%	—
7	Claude Haiku 4.5	50.7%	—
8	Qwen3 VL 32B Thinking	41.0%	—
9	Qwen3 VL 235B A22B Thinking	38.1%	—
10	Qwen3 VL 8B Thinking	33.9%	—
10	Qwen3 VL 8B Instruct	33.9%	—
12	Qwen3 VL 32B Instruct	32.6%	—
13	Qwen3 VL 4B Thinking	31.4%	—
14	Qwen3 VL 30B A3B Thinking	30.6%	—
15	Qwen3 VL 30B A3B Instruct	30.3%	—
16	Qwen3 VL 4B Instruct	26.2%	—
17	Qwen2.5 VL 72B Instruct	8.8%	—
18	Qwen2.5 VL 32B Instruct	5.9%	—

Models tracked

Models with osworld in their evaluation profile.

No models linked yet.

View task leaderboards →