# tau2-bench **Repository Path**: mirrors_huggingface/tau2-bench ## Basic Information - **Project Name**: tau2-bench - **Description**: τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: trl-internal - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-07-31 - **Last Updated**: 2026-06-27 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # $\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment [![python](https://img.shields.io/badge/Python-3.10%2B-blue.svg?style=flat&logo=python&logoColor=white)](https://www.python.org) [![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![arXiv](http://img.shields.io/badge/cs.AI-arXiv%3A2506.07982-B31B1B.svg?logo=arxiv&logoColor=red)](https://arxiv.org/abs/2506.07982) [![blog](https://img.shields.io/badge/blog-tau2--bench-green)](https://sierra.ai/blog/benchmarking-agents-in-collaborative-real-world-scenarios) [![Twitter](https://img.shields.io/twitter/url/https/twitter.com/sierra.svg?style=social&label=Follow%20%40SierraPlatform)](https://x.com/SierraPlatform/status/1932464265207889974) [![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?logo=linkedin&logoColor=white)](https://www.linkedin.com/posts/sierra_last-year-we-introduced-%F0%9D%9C%8F-bench-a-benchmark-activity-7338229693898231809-F8L4?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAdc8goBmhEsiEo1_t_XSJbAnY4_zMfAWcE)
System Overview
Figure 1: τ²-bench allows users to interact with the agent and the environment
Trajectory
Figure 2: Trajectory of a conversation between an agent and a user
## 🍴 Fork instructions To run the code in this project, first, create a Python virtual environment using e.g. `uv`. To install `uv`, follow the [UV Installation Guide](https://docs.astral.sh/uv/getting-started/installation/). ```shell uv self update uv venv taubench --python 3.11 && source taubench/bin/activate && uv pip install --upgrade pip ``` Next, install vLLM: ```shell uv pip install vllm==0.10.0 ``` Then install tau2: ```bash uv pip install -e . ``` You can then run simulations with local models via: ```sh ./run_tau2_local.sh \ --model_id Qwen/Qwen3-4B-Instruct-2507 \ --domain retail \ --num_trials 4 ``` ## Overview $\tau^2$-bench implements a simulation framework for evaluating customer service agents across various domains. Each domain specifies: - a policy that the agent must follow - a set of tools that the agent can use - a set of tasks to evaluate the agent's performance - Optionally: A set of tools that the user simulator can use Domains are: - `mock` - `airline` - `retail` - `telecom` All the information that an agent developer needs to build an agent for a domain can be accessed through the domain's API docs. See [View domain documentation](#view-domain-documentation) for more details. ## Installation 1. Clone the repository: ```bash git clone https://github.com/sierra-research/tau2-bench cd tau2-bench ``` 2. Create a new environment (optional) $\tau^2$-bench requires Python 3.10 or higher. You may create and activate a new environment: ```bash python -m venv .venv source .venv/bin/activate ``` 3. Install tau2 ```bash pip install -e . ``` This will enable you to run the `tau2` command. **Note:** If you use `pip install .` (without `-e`), you'll need to set the `TAU2_DATA_DIR` environment variable to point to your data directory: ```bash export TAU2_DATA_DIR=/path/to/your/tau2-bench/data ``` **Check your data directory setup:** After installation, you can verify that your data directory is correctly configured by running: ```bash tau2 check-data ``` This command will check if the data directory exists and print instructions if it is missing. To remove all the generated files and the virtual environment, run: ```bash make clean ``` ## Quick Start ### Setup LLM API keys We use [LiteLLM](https://github.com/BerriAI/litellm) to manage LLM APIs, so you can use any LLM provider supported by LiteLLM. To provide your API keys, copy `.env.example` as `.env` and edit it to include your API keys. ### Run agent evaluation To run a test evaluation on only 5 tasks with 1 trial per task, run: ```bash tau2 run \ --domain airline \ --agent-llm gpt-4.1 \ --user-llm gpt-4.1 \ --num-trials 1 \ --num-tasks 5 ``` Results will be saved in `data/tau2/simulations/`. ## Command Line Interface The `tau2` command provides a unified interface for all functionality: ### Running Benchmark ```bash tau2 run \ --domain \ --agent-llm \ --user-llm \ --num-trials \ --task-ids \ --max-concurrency \ ... ``` ### Viewing Results ```bash tau2 view ``` This tool allows you to: - Browse simulation files (in `data/tau2/simulations/`) - View agent performance metrics - View a particular simulation - View task details ### View domain documentation ```bash tau2 domain ``` Visit http://127.0.0.1:8004/redoc to see the domain policy and API documentation. ![domain_viewer1](figs/domain_viewer.png) ### Check data configuration ```bash tau2 check-data ``` This command checks if your data directory is properly configured and all required files are present. ## Experiments ### Running Ablation Studies (No User, or Agent with Oracle Plan) `telecom` domain enables running ablation studies. 1. Running an LLM in `no-user` mode. In this mode, the LLM is given all the tools and the information upfront. Just choose `llm_agent_solo` as the agent and `dummy_user` as the user. ```bash tau2 run \ --domain telecom \ --agent llm_agent_solo \ --agent-llm gpt-4.1 \ --user dummy_user \ ... ``` 2. Running an LLM in `oracle-plan` mode. In this mode, the LLM is given an oracle plan ahead of time alleviating the need for action planning. Just choose `llm_agent_gt` as the agent. ```bash tau2 run \ --domain telecom \ --agent llm_agent_gt \ --agent-llm gpt-4.1 \ --user-llm gpt-4.1 \ ... ``` ### Running Telecom Domain with Workflow Policy To test the impact of policy format, we provide an additional "workflow" policy for the telecom domain. To run using this policy, use the `telecom-workflow` domain. ```bash tau2 run \ --domain telecom-workflow \ --agent-llm gpt-4.1 \ --user-llm gpt-4.1 \ ... ``` ## Domains For all the details see the domains [README](src/tau2/domains/README.md). ### Basics - Code is located in `src/tau2/domains/` - Data is located in `data/tau2/domains/` - Each domain has its own configuration and task definitions #### View domain-specific policy and API docs: Run the following command to see the domain policy and API documentation. ```bash tau2 env ``` Then visit http://127.0.0.1:8004/redoc ### Environment CLI (beta) An interactive command-line interface for directly querying and testing domain environments. Features: - Interactive query interface with domain-specific tools - Support for multiple domains (airline, mock, etc.) - Session management with history To use: ```bash make env-cli ``` Available commands: - `:q` - quit the program - `:d` - change domain - `:n` - start new session (clears history) Example usage: ```bash $ make env-cli Welcome to the Environment CLI! Connected to airline domain. Query (:n new session, :d change domain, :q quit)> What flights are available from SF to LA tomorrow? Assistant: Let me check the flight availability for you... [Flight details will appear here] ``` The Environment CLI is useful for: - Testing domain tools and queries - Debugging environment responses - Exploring available domain functionality - Quick domain interaction without starting the full server stack ## Run tests To run the test suite use the command ```sh make test ``` ## Config To configure the framework, see the [config](src/tau2/config.py) file. ### LLM Calls caching LLM call caching is disabled by default. To enable LLM calls caching: - Make sure `redis` is running. - Update the redis config in `config.py` if necessary. - Set `LLM_CACHE_ENABLED` to `True` in `config.py` ## Evaluate Your Own Agent For local or remote agent evaluation, see our [agent developer guide](src/tau2/agent/README.md). ## Orchestration Sequence Diagram ```mermaid sequenceDiagram participant O as Orchestrator participant A as Agent participant U as UserSimulator participant E as Environment Note over O: Initialize(task) rect rgb(100, 150, 150) O->>A: get_init_state_info(message_history) A->>O: agent_state_info O->>U: get_init_state_info(message_history) U->>O: user_state_info O->>E: set_state(initialization_data, initialization_actions, message_history) end Note over O: Start simulation loop Pass messages between Agent, User, and Environment alt Agent/Env to User rect rgb(200, 150, 150) O->>U: generate_next_message(msg, user_state_info) U-->>O: (user_msg, user_state_info) end Note over O: Check if user_msg is STOP else User/Env to Agent rect rgb(100, 200, 100) O->>A: generate_next_message(msg, agent_state_info) A-->>O: (assistant_msg, agent_state_info) Note over O: Check if too many errors end else User/Agent to Environment rect rgb(150, 150, 200) O->>E: get_response(tool_call) E-->>O: tool_message end end Note over O: Check if max turns reached. end Note over O: Return simulation run ``` ## Citation ```bibtex @misc{barres2025tau2, title={$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment}, author={Victor Barres and Honghua Dong and Soham Ray and Xujie Si and Karthik Narasimhan}, year={2025}, eprint={2506.07982}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2506.07982}, } ```