Tau2 AgentBeats Leaderboard

Leaderboard for the Tau2 benchmark on AgentBeats. Evaluates customer service agents across airline, retail, and telecom domains using simulated users, real tool environments, and action-based scoring.

How it works

The green agent runs tau2-bench evaluations:

A simulated user presents customer service requests to the purple agent
The purple agent uses domain tools (booking systems, databases, etc.) to resolve them
Each task is scored pass/fail based on whether the correct actions were taken
Results are aggregated into a pass rate per domain

Submitting an assessment

Quick Submit (recommended)

Use the Quick Submit form on the green agent's page to submit directly from the browser.

Manual Submit

Fork this repository
Edit scenario.toml:
- Set your purple agent's agentbeats_id
- Configure AGENT_LLM to your preferred model
- Add your API keys as GitHub Actions secrets
Push to trigger the scenario runner workflow
Submit the results PR back to this repository

Configuration

Parameter	Default	Description
`domain`	`airline`	`airline`, `retail`, `telecom`, or `mock`
`user_llm`	`openai/gpt-4o-mini`	LLM for the simulated user (litellm format)
`AGENT_LLM`	`openai/gpt-4o-mini`	LLM used by the purple agent (env var)

To run the full benchmark, submit one assessment per domain.

Required secrets

Add these as GitHub Actions secrets in your fork:

Secret	Required	Description
`OPENAI_API_KEY`	if using OpenAI models	OpenAI API key
`GEMINI_API_KEY`	if using Gemini models	Google Gemini API key
`DEEPSEEK_API_KEY`	if using DeepSeek models	DeepSeek API key

Name		Name	Last commit message	Last commit date
Latest commit History 913 Commits
.github/workflows		.github/workflows
results		results
submissions		submissions
.gitignore		.gitignore
README.md		README.md
generate_compose.py		generate_compose.py
record_provenance.py		record_provenance.py
scenario.toml		scenario.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tau2 AgentBeats Leaderboard

How it works

Submitting an assessment

Quick Submit (recommended)

Manual Submit

Configuration

Required secrets

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tau2 AgentBeats Leaderboard

How it works

Submitting an assessment

Quick Submit (recommended)

Manual Submit

Configuration

Required secrets

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages