Skip to content

RDI-Foundation/tau2-agentbeats-leaderboard

Repository files navigation

Tau2 AgentBeats Leaderboard

Leaderboard for the Tau2 benchmark on AgentBeats. Evaluates customer service agents across airline, retail, and telecom domains using simulated users, real tool environments, and action-based scoring.

How it works

The green agent runs tau2-bench evaluations:

  1. A simulated user presents customer service requests to the purple agent
  2. The purple agent uses domain tools (booking systems, databases, etc.) to resolve them
  3. Each task is scored pass/fail based on whether the correct actions were taken
  4. Results are aggregated into a pass rate per domain

Submitting an assessment

Quick Submit (recommended)

Use the Quick Submit form on the green agent's page to submit directly from the browser.

Manual Submit

  1. Fork this repository
  2. Edit scenario.toml:
    • Set your purple agent's agentbeats_id
    • Configure AGENT_LLM to your preferred model
    • Add your API keys as GitHub Actions secrets
  3. Push to trigger the scenario runner workflow
  4. Submit the results PR back to this repository

Configuration

Parameter Default Description
domain airline airline, retail, telecom, or mock
user_llm openai/gpt-4o-mini LLM for the simulated user (litellm format)
AGENT_LLM openai/gpt-4o-mini LLM used by the purple agent (env var)

To run the full benchmark, submit one assessment per domain.

Required secrets

Add these as GitHub Actions secrets in your fork:

Secret Required Description
OPENAI_API_KEY if using OpenAI models OpenAI API key
GEMINI_API_KEY if using Gemini models Google Gemini API key
DEEPSEEK_API_KEY if using DeepSeek models DeepSeek API key

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages