Terminal-Bench 2.0 Benchmark Leaderboard

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

Leaderboard

Top 40 models on Terminal-Bench 2.0 Benchmark Leaderboard (scores from public evaluations).

  1. 1GPT-5.582.7% on Terminal-Bench 2.0 Benchmark Leaderboard
  2. 2Claude Mythos Preview82.0% on Terminal-Bench 2.0 Benchmark Leaderboard
  3. 3GPT-5.3 Codex77.3% on Terminal-Bench 2.0 Benchmark Leaderboard
  4. 4Gemini 3.5 Flash76.2% on Terminal-Bench 2.0 Benchmark Leaderboard
  5. 5GPT-5.475.1% on Terminal-Bench 2.0 Benchmark Leaderboard
  6. 6Claude Opus 4.769.4% on Terminal-Bench 2.0 Benchmark Leaderboard
  7. 7GLM-5.169.0% on Terminal-Bench 2.0 Benchmark Leaderboard
  8. 8Gemini 3.1 Pro68.5% on Terminal-Bench 2.0 Benchmark Leaderboard
  9. 9DeepSeek-V4-Pro-Max67.9% on Terminal-Bench 2.0 Benchmark Leaderboard
  10. 10Kimi K2.666.7% on Terminal-Bench 2.0 Benchmark Leaderboard
  11. 11Claude Opus 4.665.4% on Terminal-Bench 2.0 Benchmark Leaderboard
  12. 12GPT-5.2 Codex64.0% on Terminal-Bench 2.0 Benchmark Leaderboard
  13. 13Qwen3.6 Plus61.6% on Terminal-Bench 2.0 Benchmark Leaderboard
  14. 14GPT-5.4 mini60.0% on Terminal-Bench 2.0 Benchmark Leaderboard
  15. 15Qwen3.6-27B59.3% on Terminal-Bench 2.0 Benchmark Leaderboard
  16. 15Claude Opus 4.559.3% on Terminal-Bench 2.0 Benchmark Leaderboard
  17. 17Claude Sonnet 4.659.1% on Terminal-Bench 2.0 Benchmark Leaderboard
  18. 18Muse Spark59.0% on Terminal-Bench 2.0 Benchmark Leaderboard
  19. 19MiMo-V2-Pro57.1% on Terminal-Bench 2.0 Benchmark Leaderboard
  20. 20MiniMax M2.757.0% on Terminal-Bench 2.0 Benchmark Leaderboard
  21. 21DeepSeek-V4-Flash-Max56.9% on Terminal-Bench 2.0 Benchmark Leaderboard
  22. 22GLM-556.2% on Terminal-Bench 2.0 Benchmark Leaderboard
  23. 23Gemini 3 Pro54.2% on Terminal-Bench 2.0 Benchmark Leaderboard
  24. 24GPT-5.1 Codex52.8% on Terminal-Bench 2.0 Benchmark Leaderboard
  25. 25Qwen3.5-397B-A17B52.5% on Terminal-Bench 2.0 Benchmark Leaderboard
  26. 26Qwen3.6-35B-A3B51.5% on Terminal-Bench 2.0 Benchmark Leaderboard
  27. 27Step-3.5-Flash51.0% on Terminal-Bench 2.0 Benchmark Leaderboard
  28. 28Kimi K2.550.8% on Terminal-Bench 2.0 Benchmark Leaderboard
  29. 29Qwen3.5-122B-A10B49.4% on Terminal-Bench 2.0 Benchmark Leaderboard
  30. 30Gemini 3 Flash47.6% on Terminal-Bench 2.0 Benchmark Leaderboard
  31. 31DeepSeek-V3.2 (Thinking)46.4% on Terminal-Bench 2.0 Benchmark Leaderboard
  32. 31DeepSeek-V3.246.4% on Terminal-Bench 2.0 Benchmark Leaderboard
  33. 31DeepSeek-V3.2-Speciale46.4% on Terminal-Bench 2.0 Benchmark Leaderboard
  34. 34GPT-5.4 nano46.3% on Terminal-Bench 2.0 Benchmark Leaderboard
  35. 35Qwen3.5-27B41.6% on Terminal-Bench 2.0 Benchmark Leaderboard
  36. 36GLM-4.741.0% on Terminal-Bench 2.0 Benchmark Leaderboard
  37. 37Qwen3.5-35B-A3B40.5% on Terminal-Bench 2.0 Benchmark Leaderboard
  38. 38MiMo-V2-Flash38.5% on Terminal-Bench 2.0 Benchmark Leaderboard
  39. 39Qwen3-Coder 480B A35B Instruct37.5% on Terminal-Bench 2.0 Benchmark Leaderboard
  40. 40Nemotron 3 Super (120B A12B)31.0% on Terminal-Bench 2.0 Benchmark Leaderboard

Models tracked

Models with terminal-bench-2 in their evaluation profile.

  • No models linked yet.

View task leaderboards →