Qwen2.5 VL 72B Instruct

Qwen2.5-VL is the new flagship vision-language model of Qwen, significantly improved from Qwen2-VL. It excels at recognizing objects, analyzing text/charts/layouts in images, acting as a visual agent, understanding long videos (over 1 hour) with event pinpointing, performing visual localization (bounding boxes/points), and generating structured outputs from documents.

Context —

Benchmarks

GPQA
MMLU
MMLU-Pro
AIME 2025
MATH
HumanEval
MMMU
LiveCodeBench
SWE-Bench Verified

← All models Compare models Benchmark scores