# ml-agent-evaluator **Repository Path**: mirrors_apple/ml-agent-evaluator ## Basic Information - **Project Name**: ml-agent-evaluator - **Description**: Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge? - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-08-02 - **Last Updated**: 2026-02-28 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # ml-agent-evaluator Software accompanying the research paper, [Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?](https://arxiv.org/abs/2507.17015). This project explores the use of agent-style capabilities (such as tool-use) to improve LLM-as-a-judge systems. The basic setup is illustrated below.

illustration

## Installation ### Standard installation 1. Clone the repo and the submodules: ``` git clone --recurse-submodules git@github.com:apple/ml-agent-evaluator.git ``` 2. Install package: ``` pip install -e ".[api_models,app]" ``` 3. Prepare the datasets (downloads remote dataset + merges it with checked-in labels): ``` python data/prepare.py -m use ``` **[OPTIONAL] Steps for running experiment scripts** 4. Add API secrets in `ageval_secrets.toml` file at root of cloned repo, include/exclude keys as necessary: ``` OPENAI_API_KEY="" ANTHROPIC_API_KEY="" TORI_TASK_ID="" ``` For some tools, additional secrets are expected: - fact_checking ``` RAPID_API_KEY="" ``` If you want to submit a PR, run `python data/prepare.py -m git` to move data back to hashed versions. ## Usage ### Python evaluator A simple example using the main agent evaluator: ```python import ageval.evaluators.agent agent_evaluator = ageval.evaluators.agent.PairwiseEvaluatorAgent() result = agent_evaluator.evaluate( text_a="Apple intelligence is a real product by Apple.", text_b="Apple never announced a product called Apple Intelligence." ) print(f"Preferred text: {result['preferred_text']}") print(f"Reasoning: {result['overall_result']['reasoning']}") ``` > ``` > Preferred text: text_a > Reasoning: Based on the fact-check results, Text A is the better text > as it accurately states that Apple Intelligence is a real product introduced > by Apple, supported by official announcements and discussions. Text B, on > the other hand, falsely claims that Apple never announced a product called > Apple Intelligence, which is contradicted by search results showing multiple > references to Apple Intelligence. > ``` Other available evaluators include: ```python # Basic LLM-as-a-judge evaluators (without tools), prompted to pick the best response import ageval.evaluators.basic ageval.evaluators.basic.BasicPairwiseEvaluator() ageval.evaluators.basic.BasicSingleEvaluator() # Evaluator based on the original SAFE implementation # IMPORTANT: see installation section for special requirements import ageval.evaluators.safe ageval.evaluators.safe.SafePairwiseEvaluator() ageval.evaluators.safe.SafeSingleEvaluator() ``` ### Running experiments via command line > To just test your setup, re-run the experiments and reproduce the numbers in the [demo experiment notebook](./notebooks/0001_demo_experiment.ipynb). > Important: all commands should be run at root of repo to ensure all datapaths resolve correctly Run the main experiments using the command: ``` ageval ``` For example, you can run a short demo experiment with agent vs baseline results on truthfulQA data: ``` ageval -m -cd "./exp/configs/0001_truthful_qa_openai_test" ``` In general, we recommend using `yaml` configs to configure experiments (instead of just relying on command line arguments). See the full list of [pre-set configs here](exp/configs). To run an experiment with a yaml config use: ``` ageval -cd ``` For a full set of experiment parameters, add `--help` to your run command. You can also run: ``` ageval -cd exp/configs/default --help ``` > Note: you must set *some* config above, otherwise the help command won't work correctly (due to changes in the Hydra naming conventions and how it accesses the default config). ### Additional tricks To run the process in the background: ``` nohup ageval & ``` For debugging, change the `log_level` parameter to `DEBUG`: ``` ageval log_level="DEBUG" ... ``` By default experiments are logged to weights & biases (wandb). To avoid this, disable wandb with the corresponding option: ``` ageval wandb.disabled=true ``` ### Using app to inspect annotator results

illustration

This project comes with a simple Gradio app to help interpret and debug annotator outputs. To start the app, simply run: ``` ageval-app ``` For development, run the app in dev mode (to enable auto-reload on code changes): ``` gradio src/ageval/analysis/app.py ``` ### Running original SAFE experiments To run the experiments use ``` conda activate ./env_SAFE ageval-safe-original --help ``` This will give you all the options for running the experiment. ## Development Install development dependencies (and all other dependencies): ``` pip install -e ".[api_models,app,dev]" ``` Add pre-commit hooks (black formatter) to avoid formatting conflicts: ``` pre-commit install ``` To run the tests use (in the root dir): ``` RAPID_API_KEY= OPENAI_API_KEY= pytest src/ ``` To avoid tests that require API keys to be set, run: ``` pytest src/ -m "not api" ``` ## FAQ #### I get a "Broken Pipe" error from wandb when running multirun experiments? Try using the [joblib launcher](https://hydra.cc/docs/plugins/joblib_launcher/), this seems to have helped in the past. *More questions? create a Github issue!* ## Special installation variants Additional installation steps for special use-cases (i.e. using the original SAFE implementation). ### Special installation variant 1: for running SAFE experiments Running the original search-augmented facutality evaluation (SAFE) implementation requires some more manual installation steps: 1. Since the original SAFE implementation has very specific requirements, run the following commands to set up the SAFE dependencies in a separate environment: ``` conda create -p ./env_safe python=3.10 conda activate ./env_safe pip install -r third_party/long_form_factuality/requirements.txt pip install -e .[safe] ``` 2. To be able to run SAFE with more recent open AI models, add the model (e.g. `("gpt-4o-2024-05-13", 1)`) as below to the `SUPPORTED_MODELS_AND_SETTINGS` variable in `env_safe/lib/python3.10/site-packages/langfun/core/llms/openai.py`. ## License All files under the data directory with extensions "csv", "txt", "json" and "jsonl", as provided in the github.com/apple/ml-agent-evaluator repository, are released under CC-by-NC-ND, given in the LICENSE_DATA file. All other files released under the license given in the LICENSE file.