Terminal-Bench Benchmark Leaderboard
Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.
Leaderboard
Top 23 models on Terminal-Bench Benchmark Leaderboard (scores from public evaluations).
- 1Claude Sonnet 4.550.0% on Terminal-Bench Benchmark Leaderboard
- 2MiniMax M2.147.9% on Terminal-Bench Benchmark Leaderboard
- 3Kimi K2-Thinking-090547.1% on Terminal-Bench Benchmark Leaderboard
- 4MiniMax M246.3% on Terminal-Bench Benchmark Leaderboard
- 5Claude Opus 4.143.3% on Terminal-Bench Benchmark Leaderboard
- 6Claude Haiku 4.541.0% on Terminal-Bench Benchmark Leaderboard
- 7GLM-4.640.5% on Terminal-Bench Benchmark Leaderboard
- 8LongCat-Flash-Chat39.5% on Terminal-Bench Benchmark Leaderboard
- 9Claude Opus 439.2% on Terminal-Bench Benchmark Leaderboard
- 10DeepSeek-V3.2-Exp37.7% on Terminal-Bench Benchmark Leaderboard
- 11GLM-4.537.5% on Terminal-Bench Benchmark Leaderboard
- 12Claude Sonnet 435.5% on Terminal-Bench Benchmark Leaderboard
- 13Claude 3.7 Sonnet35.2% on Terminal-Bench Benchmark Leaderboard
- 14LongCat-Flash-Lite33.8% on Terminal-Bench Benchmark Leaderboard
- 15GLM-4.733.3% on Terminal-Bench Benchmark Leaderboard
- 16DeepSeek-V3.131.3% on Terminal-Bench Benchmark Leaderboard
- 17MiMo-V2-Flash30.5% on Terminal-Bench Benchmark Leaderboard
- 18GLM-4.5-Air30.0% on Terminal-Bench Benchmark Leaderboard
- 18Kimi K2 Instruct30.0% on Terminal-Bench Benchmark Leaderboard
- 20Nemotron 3 Super (120B A12B)25.8% on Terminal-Bench Benchmark Leaderboard
- 21Kimi K2-Instruct-090525.0% on Terminal-Bench Benchmark Leaderboard
- 22Nemotron 3 Nano (30B A3B)8.5% on Terminal-Bench Benchmark Leaderboard
- 23DeepSeek-R1-05285.7% on Terminal-Bench Benchmark Leaderboard
Models tracked
Models with terminal-bench in their evaluation profile.
- No models linked yet.