# evalbench

**Repository Path**: mirrors_GoogleCloudPlatform/evalbench

## Basic Information

- **Project Name**: evalbench
- **Description**: EvalBench is a flexible framework designed to measure the quality of generative AI (GenAI) workflows around database specific tasks.
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-03-26
- **Last Updated**: 2026-05-16

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# EvalBench

EvalBench is a flexible framework designed to measure the quality of generative AI (GenAI) workflows around database specific tasks. As of now, it provides a comprehensive set of tools, and modules to evaluate models on NL2SQL tasks, including capability of running and scoring DQL, DML, and DDL queries across multiple supported databases. Its modular, plug-and-play architecture allows you to seamlessly integrate custom components while leveraging a robust evaluation pipeline, result storage, scoring strategies, and dashboarding capabilities.

---

## Getting Started &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/GoogleCloudPlatform/evalbench/blob/main/docs/examples/sqlite_example.ipynb)

Follow the steps below to run EvalBench on your local VM:
> *Note*: Evalbench requires Python 3.10 or higher and **uv** for dependency management.

### 1. Clone the Repository

Clone the EvalBench repository from GitHub:

```bash
git clone git@github.com:GoogleCloudPlatform/evalbench.git
```

### 2. Set Up a Virtual Environment

Navigate to the repository directory and create a virtual environment using `uv`:

```bash
cd evalbench
uv venv
source .venv/bin/activate
```

### 3. Install Dependencies

Install the required Python dependencies using `uv`:

```bash
uv sync
```

### 4. Configure GCP Authentication (For Vertex AI | Gemini Examples)

If gcloud is not installed already, follow the steps in [gcloud installation guide](https://cloud.google.com/sdk/docs/install#installation_instructions).

Then, authenticate using the Google Cloud CLI:

```bash
gcloud auth application-default login
```

This step sets up the necessary credentials for accessing Vertex AI resources on your GCP project.

We can globally set our gcp_project_id using

```bash
export EVAL_GCP_PROJECT_ID=your_project_id_here
export EVAL_GCP_PROJECT_REGION=your_region_here
```

### 5. Set Your Evaluation Configuration

For a quick start, let's run NL2SQL on some sqlite DQL queries.

1. First, read through [sqlite/run_dql.yaml](/datasets/bat/example_run_config.yaml) and see the configuration settings we will be running.

Now, configure your evaluation by setting the `EVAL_CONFIG` environment variable. For example, to run a configuration using the `db_blog` dataset on SQLite:

```bash
export EVAL_CONFIG=datasets/bat/example_run_config.yaml
```

### 6. Run EvalBench

Start the evaluation process using the provided shell script:

```bash
./evalbench/run.sh
```

---

## Overview

EvalBench's architecture is built around a modular design that supports diverse evaluation needs:
- **Modular and Plug-and-Play:** Easily integrate custom scoring modules, data processors, and dashboard components.
- **Flexible Evaluation Pipeline:** Seamlessly run DQL, DML, and DDL tasks while using a consistent base pipeline.
- **Result Storage and Reporting:** Store results in various formats (e.g., CSV, BigQuery) and visualize performance with built-in dashboards.
- **Customizability:** Configure and extend EvalBench to measure the performance of GenAI workflows tailored to your specific requirements.

Evalbench allows quickly creating experiments and A/B testing improvements (Available when BigQuery reporting mode set in run_config)

<img width="911" alt="Evalbench Reporting" src="https://github.com/user-attachments/assets/0881c43e-b359-472b-a7fd-e1fee6a9adf3" />

This includes being able to measure and quantify the specific improvements on databases or specific dialects:

<img width="911" alt="Evalbench Reporting by Databaes / Dialects" src="https://github.com/user-attachments/assets/e2172be1-045a-473d-92aa-304121843e7d" />

And allowing digging deeper into the exact details of the improvements and regressions including highlighting the changes, how they impacted the score and a LLM annotated explanation of the scoring changes if LLM rater is used.

<img width="911" alt="Evalbench Reporting by Databaes / Dialects" src="https://github.com/user-attachments/assets/861696b5-42f1-44c7-a7d0-710f7a32918f" />
<br><br>

A complete guide of Evalbench's available functionality can be found in [run-config documentation](/docs/configs/run-config.md)

Please explore the repository to learn more about customizing your evaluation workflows, integrating new metrics, and leveraging the full potential of EvalBench.


---
For additional documentation, examples, and support, please refer to the [EvalBench documentation](https://github.com/GoogleCloudPlatform/evalbench). Enjoy evaluating your GenAI models!