LiveCodeBench Benchmark Leaderboard

LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.

Leaderboard

Top 50 models on LiveCodeBench Benchmark Leaderboard (scores from public evaluations).

Rank	Model	Score	Lab
1	DeepSeek-V4-Pro-Max	93.5%	—
2	DeepSeek-V4-Flash-Max	91.6%	—
3	DeepSeek-V3.2 (Thinking)	83.3%	—
3	DeepSeek-V3.2	83.3%	—
5	MiniMax M2	83.0%	—
6	LongCat-Flash-Thinking-2601	82.8%	—
7	Nemotron 3 Super (120B A12B)	81.2%	—
8	Grok-3 Mini	80.4%	—
9	Grok 4 Fast	80.0%	—
10	Grok-3	79.4%	—
10	Grok-4 Heavy	79.4%	—
10	LongCat-Flash-Thinking	79.4%	—
13	Grok-4	79.0%	—
14	MiniMax M2.1	78.0%	—
15	DeepSeek-V3.2-Exp	74.1%	—
16	DeepSeek-R1-0528	73.3%	—
17	GLM-4.5	72.9%	—
18	Nemotron Nano 9B v2	71.1%	—
19	Qwen3 235B A22B	70.7%	—
19	GLM-4.5-Air	70.7%	—
21	Gemini 2.5 Pro Preview 06-05	69.0%	—
22	Mercury 2	67.0%	—
23	Llama 3.1 Nemotron Ultra 253B v1	66.3%	—
24	Qwen3 32B	65.7%	—
25	MiniMax M1 80K	65.0%	—
26	Ministral 3 (14B Reasoning 2512)	64.6%	—
27	Mistral Small 4	63.6%	—
28	QwQ-32B	63.4%	—
29	Qwen3 30B A3B	62.6%	—
30	MiniMax M1 40K	62.3%	—
31	Ministral 3 (8B Reasoning 2512)	61.6%	—
32	DeepSeek R1 Distill Llama 70B	57.5%	—
33	DeepSeek R1 Distill Qwen 32B	57.2%	—
34	DeepSeek-V3.1	56.4%	—
35	Qwen2.5 72B Instruct	55.5%	—
36	Min istral 3 (3B Reasoning 2512)	54.8%	—
37	Phi 4 Reasoning	53.8%	—
38	Kimi K2-Instruct-0905	53.7%	—
39	Phi 4 Reasoning Plus	53.1%	—
39	DeepSeek R1 Distill Qwen 14B	53.1%	—
41	Magistral Small 2506	51.3%	—
42	Magistral Medium	50.3%	—
43	QwQ-32B-Preview	50.0%	—
43	DeepSeek R1 Zero	50.0%	—
45	DeepSeek-V3 0324	49.2%	—
46	LongCat-Flash-Chat	48.0%	—
47	Llama 4 Maverick	43.4%	—
48	DeepSeek R1 Distill Llama 8B	39.6%	—
49	DeepSeek-V3	37.6%	—
49	DeepSeek R1 Distill Qwen 7B	37.6%	—

Models tracked

Models with livecodebench in their evaluation profile.

View task leaderboards →