# SecRL **Repository Path**: mirrors_microsoft/SecRL ## Basic Information - **Project Name**: SecRL - **Description**: Benchmarking LLM agents on Cyber Threat Investigation. - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 0 - **Created**: 2025-08-22 - **Last Updated**: 2026-04-11 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation [![Paper](https://img.shields.io/badge/paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2507.14201) [![Hugging Face](https://img.shields.io/badge/Dataset-orange?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/datasets/anandmudgerikar/excytin-bench) [![Blog](https://img.shields.io/badge/Blog-5C2D91?style=for-the-badge&logo=rss&logoColor=white)](https://www.microsoft.com/en-us/security/blog/2025/10/14/microsoft-raises-the-bar-a-smarter-way-to-measure-ai-for-cybersecurity/) ## 🎉 News - **[2025/11/25]**: We updated the evaluation chart with Claude Opus-4.5 and GPT 5.1! - **[2025/11/04]**: We updated the evaluation chart with Claude Sonnet-4.5 and Haiku-4.5. These are base model results with default parameters; we will share results with [extended thinking](https://docs.claude.com/en/docs/build-with-claude/extended-thinking) and [larger context windows](https://docs.claude.com/en/docs/build-with-claude/context-windows#1m-token-context-window) soon! - **[2025/10/28]**: Checkout [open-source research](https://www.arc.computer/blog/inference-time-continual-learning) applying in-context continual learning methods to the ExCyTIn framework, achieving significant cost reductions and enabling cross-incident knowledge transfer. - **[2025/10/14]**: Checkout our latest [blog post](https://www.microsoft.com/en-us/security/blog/2025/10/14/microsoft-raises-the-bar-a-smarter-way-to-measure-ai-for-cybersecurity/)! - **[2025/10/05]**: We updated the evaluation chart with Qwen-235B and Grok-4 (GPT-5 family is also updated)! ----- We present the first benchmark to test LLM-based agents on threat hunting in the form of security question-answering pairs. The environment consists 2 main components: 1. A MYSQL database where an agent can interact to retrieve information. 2. A set of generated questions and answers for testing in `secgym/questions/tests` folder, or from [huggingface](https://huggingface.co/datasets/anandmudgerikar/excytin-bench). ![ExCyTIn-Bench](./secgym/assets/overview.png) ## 🛠️ Environment Setup 1. Download database from Hugging Face. Please download the data `data_anonymized.tar.gz` from this [link](https://huggingface.co/datasets/anandmudgerikar/excytin-bench). Put the folder `data_anonymized` under `secgym/database/`. 2. We are using MYSQL docker container for the database. Please first install Docker Desktop and docker-compose and then pull the mysql image: ```bash docker pull mysql:9.0 ``` 3. Make sure your Docker Desktop is open, then run the following command to set up the mysql container for 8 different databases: ```bash bash scripts/setup_docker.sh ``` It will run this command: `python secgym/database/setup_database.py --csv --port --sql_file --container_name ` for 8 different incidents. This script will create 8 different containers. Note that these container are binded to the csv files in the `data_anonymized` folder. This will take up 10GB of disk space. Check out volumes with `docker system df -v`. To set docker for a database that contains all the data (all 8 attacks), please uncomment the first command in `setup_docker.sh`. Note that this will take up 33GB of disk space. 4. Setup the environment using conda or venv with Python=3.11 and install the requirements with `pip install -e . --use-pep517`.The following is an example using conda: ```bash conda create -n excytin python=3.11 conda activate excytin pip install -e . --use-pep517 ``` If you find consistent errors with the installation (maybe be caused by updated version of some packages), you can try to install the requirements with `pip install -r requirements_freeze.txt`, which is the frozen version of the requirements. 5. LLM setup. We are using [AG2](https://docs.ag2.ai/latest/) for API calling. Setup your API key in the `secgym/myconfig.py` file. You can follow the instructions [here](https://autogen-ai.github.io/autogen/docs/notebooks/autogen_uniformed_api_calling#config-list-setup). ## 🏃‍♂️ Runs 1. Run Baseline. `--trial_run` will run only 2 questions from 1 incident for testing purposes. The results will be saved in `experiments/final_results` folder. ```bash python experiments/run_exp.py --trial_run ``` ## 🤖 Question Generation Process All the questions are generated based on constructed graphs from the database. The generation process is as follows: 1. The `SecurityIncident` and `SecurityAlert` logs are used to construct a graph for each incident, check out this [notebook](notebooks/extract_construct_graph.ipynb) for more details. 2. We run train-test split on the constructed graph. Run the [question_split.ipynb](notebooks/question_split.ipynb) notebook to get the split (saved to `experiments/split_files`). The train and test are split based on a proposed path relevance score. 2. We use LLM to generate questions based on the constructed graph. Currently, we already have the questions generated for the 8 different incidents in the `secgym/questions/tests` folder using OpenAI O1. If you want to rerun the question generation process, please use the following command: ```bash python experiments/run_qa_gen.py --model gpt-4.1 --solution_model gpt-4.1 --relevant_type low_split --qa_path secgym/qagen/graph_files ``` Note in this script we use `gpt-4.1` for question and solution generation. After all the questions are generated, you should expect new files in `secgym/questions` folder like `incident__qa.json` where `i` is the incident number. Note: All results from the paper use the questions in `secgym/questions/o1/test` folder. The train questions under `secgym/questions/o1/train` are only partial and used for Expel to collect new rules. Please use the quetion answer pairs from `secgym/questions/o1/test` for benchmarking against the results shown in the paper. We highly recommend using the latest models to generate the question answer dataset yourselves before running any hill climbing training experiments as to minimize noise and bias during training. Currently the latest question answer pairs are generated using OpenAI O3 with low correlation paths and can be found in `secgym/questions/o3/`. ## 📊 Results Below is the evaluation results of the LLM agents on the test questions. We set temperature = 0 and max_step = 25. GPT-4o is used for evaluation. The full evaluation logs with the latest models can be found under the `latest_experiments` folder. The full evaluation logs for older models can be downloaded from [this link](https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/ykw5399_psu_edu/EXOMtXyFSRNGvKsLZGPIAfwBZhkKMr11oROccOydbWyioA?e=XzpQLa). If can also be found under this [branch](https://github.com/microsoft/SecRL/tree/before_cleanup_all_history) under `final_results` folder (along with the original code). ![ExCyTIn-Bench](./secgym/assets/updated_eval_results_12_25.png) ## 📝 Citation If you find this work useful, please cite our paper: ```bibtex @article{wu2025excytin, title={ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation}, author={Wu, Yiran and Velazco, Mauricio and Zhao, Andrew and Luj{\'a}n, Manuel Ra{\'u}l Mel{\'e}ndez and Movva, Srisuma and Roy, Yogesh K and Nguyen, Quang and Rodriguez, Roberto and Wu, Qingyun and Albada, Michael and others}, journal={arXiv preprint arXiv:2507.14201}, year={2025} } ```