# minerva_memory_test

**Repository Path**: mirrors_microsoft/minerva_memory_test

## Basic Information

- **Project Name**: minerva_memory_test
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-06-27
- **Last Updated**: 2025-06-28

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Introduction 
This repository contains the code for the paper "Minerva: A Programmable Memory Test Benchmark for Language Models" [(PDF)](https://arxiv.org/abs/2502.03358).

Minerva is a programmable benchmark designed for evaluating how effectively Large Language Models (LLMs) utilize their memory/context. The benchmark provides a structured way to assess various memory-related capabilities of LLMs.

## Test Categories
Minerva comprises six categories of memory tests:

- Search
- Recall and Edit
- Match and Compare
- Spot the Differences
- Compute on Sets and Lists
- Stateful Processing

Plus composite tests that integrate multiple atomic skills to simulate real word scenarios:

- Processing Data Blocks
- Theory of Mind

In total, Minerva consists of 21 distinct tasks spanning these categories.

## Benchmark Snapshot

A complete snapshot of the benchmark dataset used in the paper is available in the `resource/minerva_snapshot` directory.

## Programmability

Minerva is a fully programmable benchmark that allows researchers to customize and extend the test suite. Users can leverage the provided code to generate new test samples with varying parameters, enabling more thorough and tailored evaluations of LLM memory capabilities.

# Quick Start

## Generate Tests

To generate new memory test data:


```python
# Generate all tests
python src/generate_test.py --output_dir ./memory_tests

# Generate specific category tests
python src/generate_test.py --output_dir ./memory_tests --task_category recall_and_edit

# Generate a specific test
python src/generate_test.py --output_dir ./memory_tests --task_name snapshot_unique_words

# List all available tasks
python src/generate_test.py --list-tasks
```

## Run Evaluation

To evaluate an LLM on the memory tests:

We provide a sample script for calling LLM API with Azure OpenAI API.

Please first set up your Azure credentials in `src/azure_api_config.yaml`.


```python
# Run all tests with specific model
python src/run_test.py --task_dir ./memory_tests --result_dir ./results --model_name gpt-4o --llm_aip_config src/azure_api_config.yaml

# Run specific category
python src/run_test.py --task_dir ./memory_tests --result_dir ./results --task_category search

# Run specific test
python src/run_test.py --task_dir ./memory_tests --result_dir ./results --task_name string_search_word
```

# Citation

If you use Minerva in your research, please cite:

```
@inproceedings{
xia2025minerva,
title={Minerva: A Programmable Memory Test Benchmark for Language Models},
author={Menglin Xia and Victor R{\"u}hle and Saravan Rajmohan and Reza Shokri},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=ib9drlZllP}
}
```

# License

The project code is licensed under the MIT License. For complete terms, please refer to the LICENSE file.

The Minerva benchmark snapshot data is synthetically generated by the Minerva program and is licensed under the Community Data License Agreement (CDLA-2.0).


# Trademark Notice

Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.