Terminal-Bench 2.0 Benchmark Leaderboard
Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.
Leaderboard
Top 40 models on Terminal-Bench 2.0 Benchmark Leaderboard (scores from public evaluations).
- 1GPT-5.582.7% on Terminal-Bench 2.0 Benchmark Leaderboard
- 2Claude Mythos Preview82.0% on Terminal-Bench 2.0 Benchmark Leaderboard
- 3GPT-5.3 Codex77.3% on Terminal-Bench 2.0 Benchmark Leaderboard
- 4Gemini 3.5 Flash76.2% on Terminal-Bench 2.0 Benchmark Leaderboard
- 5GPT-5.475.1% on Terminal-Bench 2.0 Benchmark Leaderboard
- 6Claude Opus 4.769.4% on Terminal-Bench 2.0 Benchmark Leaderboard
- 7GLM-5.169.0% on Terminal-Bench 2.0 Benchmark Leaderboard
- 8Gemini 3.1 Pro68.5% on Terminal-Bench 2.0 Benchmark Leaderboard
- 9DeepSeek-V4-Pro-Max67.9% on Terminal-Bench 2.0 Benchmark Leaderboard
- 10Kimi K2.666.7% on Terminal-Bench 2.0 Benchmark Leaderboard
- 11Claude Opus 4.665.4% on Terminal-Bench 2.0 Benchmark Leaderboard
- 12GPT-5.2 Codex64.0% on Terminal-Bench 2.0 Benchmark Leaderboard
- 13Qwen3.6 Plus61.6% on Terminal-Bench 2.0 Benchmark Leaderboard
- 14GPT-5.4 mini60.0% on Terminal-Bench 2.0 Benchmark Leaderboard
- 15Qwen3.6-27B59.3% on Terminal-Bench 2.0 Benchmark Leaderboard
- 15Claude Opus 4.559.3% on Terminal-Bench 2.0 Benchmark Leaderboard
- 17Claude Sonnet 4.659.1% on Terminal-Bench 2.0 Benchmark Leaderboard
- 18Muse Spark59.0% on Terminal-Bench 2.0 Benchmark Leaderboard
- 19MiMo-V2-Pro57.1% on Terminal-Bench 2.0 Benchmark Leaderboard
- 20MiniMax M2.757.0% on Terminal-Bench 2.0 Benchmark Leaderboard
- 21DeepSeek-V4-Flash-Max56.9% on Terminal-Bench 2.0 Benchmark Leaderboard
- 22GLM-556.2% on Terminal-Bench 2.0 Benchmark Leaderboard
- 23Gemini 3 Pro54.2% on Terminal-Bench 2.0 Benchmark Leaderboard
- 24GPT-5.1 Codex52.8% on Terminal-Bench 2.0 Benchmark Leaderboard
- 25Qwen3.5-397B-A17B52.5% on Terminal-Bench 2.0 Benchmark Leaderboard
- 26Qwen3.6-35B-A3B51.5% on Terminal-Bench 2.0 Benchmark Leaderboard
- 27Step-3.5-Flash51.0% on Terminal-Bench 2.0 Benchmark Leaderboard
- 28Kimi K2.550.8% on Terminal-Bench 2.0 Benchmark Leaderboard
- 29Qwen3.5-122B-A10B49.4% on Terminal-Bench 2.0 Benchmark Leaderboard
- 30Gemini 3 Flash47.6% on Terminal-Bench 2.0 Benchmark Leaderboard
- 31DeepSeek-V3.2 (Thinking)46.4% on Terminal-Bench 2.0 Benchmark Leaderboard
- 31DeepSeek-V3.246.4% on Terminal-Bench 2.0 Benchmark Leaderboard
- 31DeepSeek-V3.2-Speciale46.4% on Terminal-Bench 2.0 Benchmark Leaderboard
- 34GPT-5.4 nano46.3% on Terminal-Bench 2.0 Benchmark Leaderboard
- 35Qwen3.5-27B41.6% on Terminal-Bench 2.0 Benchmark Leaderboard
- 36GLM-4.741.0% on Terminal-Bench 2.0 Benchmark Leaderboard
- 37Qwen3.5-35B-A3B40.5% on Terminal-Bench 2.0 Benchmark Leaderboard
- 38MiMo-V2-Flash38.5% on Terminal-Bench 2.0 Benchmark Leaderboard
- 39Qwen3-Coder 480B A35B Instruct37.5% on Terminal-Bench 2.0 Benchmark Leaderboard
- 40Nemotron 3 Super (120B A12B)31.0% on Terminal-Bench 2.0 Benchmark Leaderboard
Models tracked
Models with terminal-bench-2 in their evaluation profile.
- No models linked yet.