Tau2 Telecom Benchmark Leaderboard

τ²-Bench telecom domain evaluates conversational agents in a dual-control environment modeled as a Dec-POMDP, where both agent and user use tools in shared telecommunications troubleshooting scenarios that test coordination and communication capabilities.

Leaderboard

Top 30 models on Tau2 Telecom Benchmark Leaderboard (scores from public evaluations).

  1. 1LongCat-Flash-Thinking-260199.3% on Tau2 Telecom Benchmark Leaderboard
  2. 1Claude Opus 4.699.3% on Tau2 Telecom Benchmark Leaderboard
  3. 3GPT-5.498.9% on Tau2 Telecom Benchmark Leaderboard
  4. 4GPT-5.298.7% on Tau2 Telecom Benchmark Leaderboard
  5. 5Claude Opus 4.598.2% on Tau2 Telecom Benchmark Leaderboard
  6. 6GPT-5.598.0% on Tau2 Telecom Benchmark Leaderboard
  7. 7Claude Sonnet 4.697.9% on Tau2 Telecom Benchmark Leaderboard
  8. 8MiMo-V2-Pro96.8% on Tau2 Telecom Benchmark Leaderboard
  9. 9GPT-596.7% on Tau2 Telecom Benchmark Leaderboard
  10. 10GPT-5.195.6% on Tau2 Telecom Benchmark Leaderboard
  11. 10GPT-5.1 Instant95.6% on Tau2 Telecom Benchmark Leaderboard
  12. 10GPT-5.1 Thinking95.6% on Tau2 Telecom Benchmark Leaderboard
  13. 13GPT-5.4 mini93.4% on Tau2 Telecom Benchmark Leaderboard
  14. 14GPT-5.4 nano92.5% on Tau2 Telecom Benchmark Leaderboard
  15. 15Muse Spark91.5% on Tau2 Telecom Benchmark Leaderboard
  16. 16MiniMax M287.0% on Tau2 Telecom Benchmark Leaderboard
  17. 16MiniMax M2.187.0% on Tau2 Telecom Benchmark Leaderboard
  18. 18LongCat-Flash-Thinking83.1% on Tau2 Telecom Benchmark Leaderboard
  19. 19Claude Haiku 4.583.0% on Tau2 Telecom Benchmark Leaderboard
  20. 20LongCat-Flash-Chat73.7% on Tau2 Telecom Benchmark Leaderboard
  21. 21LongCat-Flash-Lite72.8% on Tau2 Telecom Benchmark Leaderboard
  22. 22Kimi K2-Instruct-090565.8% on Tau2 Telecom Benchmark Leaderboard
  23. 22Kimi K2 Instruct65.8% on Tau2 Telecom Benchmark Leaderboard
  24. 24Nemotron 3 Super (120B A12B)64.4% on Tau2 Telecom Benchmark Leaderboard
  25. 25o358.2% on Tau2 Telecom Benchmark Leaderboard
  26. 26Qwen3-235B-A22B-Thinking-250745.6% on Tau2 Telecom Benchmark Leaderboard
  27. 27Qwen3-Next-80B-A3B-Thinking43.9% on Tau2 Telecom Benchmark Leaderboard
  28. 28Nemotron 3 Nano (30B A3B)42.2% on Tau2 Telecom Benchmark Leaderboard
  29. 29GPT-4o23.5% on Tau2 Telecom Benchmark Leaderboard
  30. 30Qwen3-Next-80B-A3B-Instruct13.2% on Tau2 Telecom Benchmark Leaderboard

Models tracked

Models with tau2-telecom in their evaluation profile.

  • No models linked yet.

View task leaderboards →