Tau2 Telecom Benchmark Leaderboard
τ²-Bench telecom domain evaluates conversational agents in a dual-control environment modeled as a Dec-POMDP, where both agent and user use tools in shared telecommunications troubleshooting scenarios that test coordination and communication capabilities.
Leaderboard
Top 30 models on Tau2 Telecom Benchmark Leaderboard (scores from public evaluations).
- 1LongCat-Flash-Thinking-260199.3% on Tau2 Telecom Benchmark Leaderboard
- 1Claude Opus 4.699.3% on Tau2 Telecom Benchmark Leaderboard
- 3GPT-5.498.9% on Tau2 Telecom Benchmark Leaderboard
- 4GPT-5.298.7% on Tau2 Telecom Benchmark Leaderboard
- 5Claude Opus 4.598.2% on Tau2 Telecom Benchmark Leaderboard
- 6GPT-5.598.0% on Tau2 Telecom Benchmark Leaderboard
- 7Claude Sonnet 4.697.9% on Tau2 Telecom Benchmark Leaderboard
- 8MiMo-V2-Pro96.8% on Tau2 Telecom Benchmark Leaderboard
- 9GPT-596.7% on Tau2 Telecom Benchmark Leaderboard
- 10GPT-5.195.6% on Tau2 Telecom Benchmark Leaderboard
- 10GPT-5.1 Instant95.6% on Tau2 Telecom Benchmark Leaderboard
- 10GPT-5.1 Thinking95.6% on Tau2 Telecom Benchmark Leaderboard
- 13GPT-5.4 mini93.4% on Tau2 Telecom Benchmark Leaderboard
- 14GPT-5.4 nano92.5% on Tau2 Telecom Benchmark Leaderboard
- 15Muse Spark91.5% on Tau2 Telecom Benchmark Leaderboard
- 16MiniMax M287.0% on Tau2 Telecom Benchmark Leaderboard
- 16MiniMax M2.187.0% on Tau2 Telecom Benchmark Leaderboard
- 18LongCat-Flash-Thinking83.1% on Tau2 Telecom Benchmark Leaderboard
- 19Claude Haiku 4.583.0% on Tau2 Telecom Benchmark Leaderboard
- 20LongCat-Flash-Chat73.7% on Tau2 Telecom Benchmark Leaderboard
- 21LongCat-Flash-Lite72.8% on Tau2 Telecom Benchmark Leaderboard
- 22Kimi K2-Instruct-090565.8% on Tau2 Telecom Benchmark Leaderboard
- 22Kimi K2 Instruct65.8% on Tau2 Telecom Benchmark Leaderboard
- 24Nemotron 3 Super (120B A12B)64.4% on Tau2 Telecom Benchmark Leaderboard
- 25o358.2% on Tau2 Telecom Benchmark Leaderboard
- 26Qwen3-235B-A22B-Thinking-250745.6% on Tau2 Telecom Benchmark Leaderboard
- 27Qwen3-Next-80B-A3B-Thinking43.9% on Tau2 Telecom Benchmark Leaderboard
- 28Nemotron 3 Nano (30B A3B)42.2% on Tau2 Telecom Benchmark Leaderboard
- 29GPT-4o23.5% on Tau2 Telecom Benchmark Leaderboard
- 30Qwen3-Next-80B-A3B-Instruct13.2% on Tau2 Telecom Benchmark Leaderboard
Models tracked
Models with tau2-telecom in their evaluation profile.
- No models linked yet.