CharXiv-R Benchmark Leaderboard

CharXiv-R is the reasoning component of the CharXiv benchmark, focusing on complex reasoning questions that require synthesizing information across visual chart elements. It evaluates multimodal large language models on their ability to understand and reason about scientific charts from arXiv papers through various reasoning tasks.

Leaderboard

Top 36 models on CharXiv-R Benchmark Leaderboard (scores from public evaluations).

  1. 1Claude Mythos Preview93.2% on CharXiv-R Benchmark Leaderboard
  2. 2Claude Opus 4.791.0% on CharXiv-R Benchmark Leaderboard
  3. 3Kimi K2.686.7% on CharXiv-R Benchmark Leaderboard
  4. 4Muse Spark86.4% on CharXiv-R Benchmark Leaderboard
  5. 5Gemini 3.5 Flash84.2% on CharXiv-R Benchmark Leaderboard
  6. 6GPT-5.282.1% on CharXiv-R Benchmark Leaderboard
  7. 7GPT-5.5 Instant81.6% on CharXiv-R Benchmark Leaderboard
  8. 8Qwen3.6 Plus81.5% on CharXiv-R Benchmark Leaderboard
  9. 9Gemini 3 Pro81.4% on CharXiv-R Benchmark Leaderboard
  10. 10GPT-581.1% on CharXiv-R Benchmark Leaderboard
  11. 11Gemini 3 Flash80.3% on CharXiv-R Benchmark Leaderboard
  12. 12Qwen3.5-27B79.5% on CharXiv-R Benchmark Leaderboard
  13. 13o378.6% on CharXiv-R Benchmark Leaderboard
  14. 14Qwen3.6-27B78.4% on CharXiv-R Benchmark Leaderboard
  15. 15Qwen3.6-35B-A3B78.0% on CharXiv-R Benchmark Leaderboard
  16. 16Qwen3.5-35B-A3B77.5% on CharXiv-R Benchmark Leaderboard
  17. 16Kimi K2.577.5% on CharXiv-R Benchmark Leaderboard
  18. 18Claude Opus 4.677.4% on CharXiv-R Benchmark Leaderboard
  19. 19Qwen3.5-122B-A10B77.2% on CharXiv-R Benchmark Leaderboard
  20. 20Gemini 3.1 Flash-Lite73.2% on CharXiv-R Benchmark Leaderboard
  21. 21o4-mini72.0% on CharXiv-R Benchmark Leaderboard
  22. 22Qwen3 VL 235B A22B Thinking66.1% on CharXiv-R Benchmark Leaderboard
  23. 23Qwen3 VL 32B Thinking65.2% on CharXiv-R Benchmark Leaderboard
  24. 24Qwen3 VL 32B Instruct62.8% on CharXiv-R Benchmark Leaderboard
  25. 25Qwen3 VL 235B A22B Instruct62.1% on CharXiv-R Benchmark Leaderboard
  26. 26GPT-4o58.8% on CharXiv-R Benchmark Leaderboard
  27. 27GPT-4.1 mini56.8% on CharXiv-R Benchmark Leaderboard
  28. 28GPT-4.156.7% on CharXiv-R Benchmark Leaderboard
  29. 29Qwen3 VL 30B A3B Thinking56.6% on CharXiv-R Benchmark Leaderboard
  30. 30GPT-4.555.4% on CharXiv-R Benchmark Leaderboard
  31. 31Qwen3 VL 8B Thinking53.0% on CharXiv-R Benchmark Leaderboard
  32. 32Qwen3 VL 4B Thinking50.3% on CharXiv-R Benchmark Leaderboard
  33. 33Qwen3 VL 30B A3B Instruct48.9% on CharXiv-R Benchmark Leaderboard
  34. 34Qwen3 VL 8B Instruct46.4% on CharXiv-R Benchmark Leaderboard
  35. 35GPT-4.1 nano40.5% on CharXiv-R Benchmark Leaderboard
  36. 36Qwen3 VL 4B Instruct39.7% on CharXiv-R Benchmark Leaderboard

Models tracked

Models with charxiv-r in their evaluation profile.

  • No models linked yet.

View task leaderboards →