MathVista Benchmark Leaderboard

MathVista evaluates mathematical reasoning of foundation models in visual contexts. It consists of 6,141 examples derived from 28 existing multimodal datasets and 3 newly created datasets (IQTest, FunctionQA, and PaperQA), combining challenges from diverse mathematical and visual tasks to assess models' ability to understand complex figures and perform rigorous reasoning.

Leaderboard

Top 36 models on MathVista Benchmark Leaderboard (scores from public evaluations).

Rank	Model	Score	Lab
1	o3	86.8%	—
2	o4-mini	84.3%	—
3	Step3-VL-10B	84.0%	—
4	Kimi-k1.5	74.9%	—
5	Llama 4 Maverick	73.7%	—
6	GPT-4.1 mini	73.1%	—
7	GPT-4.5	72.3%	—
8	GPT-4.1	72.2%	—
9	o1	71.8%	—
10	QvQ-72B-Preview	71.4%	—
11	Llama 4 Scout	70.7%	—
12	Pixtral Large	69.4%	—
13	Grok-2	69.0%	—
14	Grok-2 mini	68.1%	—
14	Gemini 1.5 Pro	68.1%	—
16	Qwen2.5-Omni-7B	67.9%	—
17	Claude 3.5 Sonnet	67.7%	—
18	Mistral Small 3.2 24B Instruct	67.1%	—
19	Gemini 1.5 Flash	65.8%	—
20	GPT-4o	63.8%	—
21	DeepSeek VL2	62.8%	—
22	Phi-4-multimodal-instruct	62.4%	—
23	GPT-4o	61.4%	—
24	DeepSeek VL2 Small	60.7%	—
25	Pixtral-12B	58.0%	—
26	Llama 3.2 90B Instruct	57.3%	—
27	GPT-4o mini	56.7%	—
28	GPT-4.1 nano	56.2%	—
29	Gemini 1.5 Flash 8B	54.7%	—
30	DeepSeek VL2 Tiny	53.6%	—
31	Grok-1.5	52.8%	—
31	Grok-1.5V	52.8%	—
33	Llama 3.2 11B Instruct	51.5%	—
34	Gemini 1.0 Pro	46.6%	—
35	Phi-3.5-vision-instruct	43.9%	—
36	GPT-3.5 Turbo	0.0%	—

Models tracked

Models with mathvista in their evaluation profile.

No models linked yet.

View task leaderboards →