Terminal-Bench Benchmark Leaderboard

Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities. The benchmark consists of a dataset of ~100 hand-crafted, human-verified tasks and an execution harness that connects language models to a terminal sandbox.

Leaderboard

Top 23 models on Terminal-Bench Benchmark Leaderboard (scores from public evaluations).

Rank	Model	Score	Lab
1	Claude Sonnet 4.5	50.0%	—
2	MiniMax M2.1	47.9%	—
3	Kimi K2-Thinking-0905	47.1%	—
4	MiniMax M2	46.3%	—
5	Claude Opus 4.1	43.3%	—
6	Claude Haiku 4.5	41.0%	—
7	GLM-4.6	40.5%	—
8	LongCat-Flash-Chat	39.5%	—
9	Claude Opus 4	39.2%	—
10	DeepSeek-V3.2-Exp	37.7%	—
11	GLM-4.5	37.5%	—
12	Claude Sonnet 4	35.5%	—
13	Claude 3.7 Sonnet	35.2%	—
14	LongCat-Flash-Lite	33.8%	—
15	GLM-4.7	33.3%	—
16	DeepSeek-V3.1	31.3%	—
17	MiMo-V2-Flash	30.5%	—
18	GLM-4.5-Air	30.0%	—
18	Kimi K2 Instruct	30.0%	—
20	Nemotron 3 Super (120B A12B)	25.8%	—
21	Kimi K2-Instruct-0905	25.0%	—
22	Nemotron 3 Nano (30B A3B)	8.5%	—
23	DeepSeek-R1-0528	5.7%	—

Models tracked

Models with terminal-bench in their evaluation profile.

No models linked yet.

View task leaderboards →