Toolathlon Benchmark Leaderboard
Tool Decathlon is a comprehensive benchmark for evaluating AI agents' ability to use multiple tools across diverse task categories. It measures proficiency in tool selection, sequencing, and execution across ten different tool-use scenarios.
Leaderboard
Top 19 models on Toolathlon Benchmark Leaderboard (scores from public evaluations).
- 1Gemini 3.5 Flash56.5% on Toolathlon Benchmark Leaderboard
- 2GPT-5.555.6% on Toolathlon Benchmark Leaderboard
- 3GPT-5.454.6% on Toolathlon Benchmark Leaderboard
- 4DeepSeek-V4-Pro-Max51.8% on Toolathlon Benchmark Leaderboard
- 5Kimi K2.650.0% on Toolathlon Benchmark Leaderboard
- 6Gemini 3 Flash49.4% on Toolathlon Benchmark Leaderboard
- 7DeepSeek-V4-Flash-Max47.8% on Toolathlon Benchmark Leaderboard
- 8GPT-5.246.3% on Toolathlon Benchmark Leaderboard
- 8MiniMax M2.746.3% on Toolathlon Benchmark Leaderboard
- 10MiniMax M2.143.5% on Toolathlon Benchmark Leaderboard
- 11GPT-5.4 mini42.9% on Toolathlon Benchmark Leaderboard
- 12GLM-5.140.7% on Toolathlon Benchmark Leaderboard
- 13Qwen3.6 Plus39.8% on Toolathlon Benchmark Leaderboard
- 14Qwen3.5-397B-A17B38.3% on Toolathlon Benchmark Leaderboard
- 15GPT-5.4 nano35.5% on Toolathlon Benchmark Leaderboard
- 16DeepSeek-V3.2-Speciale35.2% on Toolathlon Benchmark Leaderboard
- 16DeepSeek-V3.235.2% on Toolathlon Benchmark Leaderboard
- 16DeepSeek-V3.2 (Thinking)35.2% on Toolathlon Benchmark Leaderboard
- 19Qwen3.6-35B-A3B26.9% on Toolathlon Benchmark Leaderboard
Models tracked
Models with toolathlon in their evaluation profile.
- No models linked yet.