Terminal-Bench Benchmark Leaderboard

Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.

Leaderboard

Top 23 models on Terminal-Bench Benchmark Leaderboard (scores from public evaluations).

  1. 1Claude Sonnet 4.550.0% on Terminal-Bench Benchmark Leaderboard
  2. 2MiniMax M2.147.9% on Terminal-Bench Benchmark Leaderboard
  3. 3Kimi K2-Thinking-090547.1% on Terminal-Bench Benchmark Leaderboard
  4. 4MiniMax M246.3% on Terminal-Bench Benchmark Leaderboard
  5. 5Claude Opus 4.143.3% on Terminal-Bench Benchmark Leaderboard
  6. 6Claude Haiku 4.541.0% on Terminal-Bench Benchmark Leaderboard
  7. 7GLM-4.640.5% on Terminal-Bench Benchmark Leaderboard
  8. 8LongCat-Flash-Chat39.5% on Terminal-Bench Benchmark Leaderboard
  9. 9Claude Opus 439.2% on Terminal-Bench Benchmark Leaderboard
  10. 10DeepSeek-V3.2-Exp37.7% on Terminal-Bench Benchmark Leaderboard
  11. 11GLM-4.537.5% on Terminal-Bench Benchmark Leaderboard
  12. 12Claude Sonnet 435.5% on Terminal-Bench Benchmark Leaderboard
  13. 13Claude 3.7 Sonnet35.2% on Terminal-Bench Benchmark Leaderboard
  14. 14LongCat-Flash-Lite33.8% on Terminal-Bench Benchmark Leaderboard
  15. 15GLM-4.733.3% on Terminal-Bench Benchmark Leaderboard
  16. 16DeepSeek-V3.131.3% on Terminal-Bench Benchmark Leaderboard
  17. 17MiMo-V2-Flash30.5% on Terminal-Bench Benchmark Leaderboard
  18. 18GLM-4.5-Air30.0% on Terminal-Bench Benchmark Leaderboard
  19. 18Kimi K2 Instruct30.0% on Terminal-Bench Benchmark Leaderboard
  20. 20Nemotron 3 Super (120B A12B)25.8% on Terminal-Bench Benchmark Leaderboard
  21. 21Kimi K2-Instruct-090525.0% on Terminal-Bench Benchmark Leaderboard
  22. 22Nemotron 3 Nano (30B A3B)8.5% on Terminal-Bench Benchmark Leaderboard
  23. 23DeepSeek-R1-05285.7% on Terminal-Bench Benchmark Leaderboard

Models tracked

Models with terminal-bench in their evaluation profile.

  • No models linked yet.

View task leaderboards →