Terminal-Bench 2.0 Benchmark Leaderboard

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

Leaderboard

Top 40 models on Terminal-Bench 2.0 Benchmark Leaderboard (scores from public evaluations).

Rank	Model	Score	Lab
1	GPT-5.5	82.7%	—
2	Claude Mythos Preview	82.0%	—
3	GPT-5.3 Codex	77.3%	—
4	Gemini 3.5 Flash	76.2%	—
5	GPT-5.4	75.1%	—
6	Claude Opus 4.7	69.4%	—
7	GLM-5.1	69.0%	—
8	Gemini 3.1 Pro	68.5%	—
9	DeepSeek-V4-Pro-Max	67.9%	—
10	Kimi K2.6	66.7%	—
11	Claude Opus 4.6	65.4%	—
12	GPT-5.2 Codex	64.0%	—
13	Qwen3.6 Plus	61.6%	—
14	GPT-5.4 mini	60.0%	—
15	Qwen3.6-27B	59.3%	—
15	Claude Opus 4.5	59.3%	—
17	Claude Sonnet 4.6	59.1%	—
18	Muse Spark	59.0%	—
19	MiMo-V2-Pro	57.1%	—
20	MiniMax M2.7	57.0%	—
21	DeepSeek-V4-Flash-Max	56.9%	—
22	GLM-5	56.2%	—
23	Gemini 3 Pro	54.2%	—
24	GPT-5.1 Codex	52.8%	—
25	Qwen3.5-397B-A17B	52.5%	—
26	Qwen3.6-35B-A3B	51.5%	—
27	Step-3.5-Flash	51.0%	—
28	Kimi K2.5	50.8%	—
29	Qwen3.5-122B-A10B	49.4%	—
30	Gemini 3 Flash	47.6%	—
31	DeepSeek-V3.2 (Thinking)	46.4%	—
31	DeepSeek-V3.2	46.4%	—
31	DeepSeek-V3.2-Speciale	46.4%	—
34	GPT-5.4 nano	46.3%	—
35	Qwen3.5-27B	41.6%	—
36	GLM-4.7	41.0%	—
37	Qwen3.5-35B-A3B	40.5%	—
38	MiMo-V2-Flash	38.5%	—
39	Qwen3-Coder 480B A35B Instruct	37.5%	—
40	Nemotron 3 Super (120B A12B)	31.0%	—

Models tracked

Models with terminal-bench-2 in their evaluation profile.

No models linked yet.

View task leaderboards →