# EfficientRAG
**Repository Path**: devine/EfficientRAG
## Basic Information
- **Project Name**: EfficientRAG
- **Description**: https://github.com/microsoft/EfficientRAG
- **Primary Language**: Python
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-03-17
- **Last Updated**: 2025-03-17
## Categories & Tags
**Categories**: Uncategorized
**Tags**: RAG
## README
Official code repo for EMNLP 2024 paper - **EfficientRAG: Efficient Retriever for Multi-Hop Question Answering**
Efficient RAG is a new framework to train Labeler and Filter to learn to conduct multi-hop RAG without multiple LLM calls.
## Updates
* 2024-09-12 open source the code
## Setup
You can now download our synthesized data from this [link](https://box.nju.edu.cn/f/a86b512077c7489b8da3/).
You should unzip the `EfficientRAG.zip` file and place all the data under the `data` directory.
Within this directory, the `negative_sampling_extracted` folder contains our final synthesized data, which is referenced in [2.4 Negative Sampling](https://github.com/NIL-zhuang/EfficientRAG-official?tab=readme-ov-file#24-negative-sampling).
Additionally, the `efficient_rag` directory includes two folders: `labeler` and `filter`, which store the training data constructed for the model, as referenced in [2.5 Training Data](https://github.com/NIL-zhuang/EfficientRAG-official?tab=readme-ov-file#25-training-data).
### 1. Installation
You need to install PyTorch >= 2.1.0 first, and then install dependent Python libraries by running the command
```bash
pip install -r requirements.txt
```
You can also create a conda environment with python>=3.9
```bash
conda create -n python=3.9 pip
conda activate
pip install -r requirements.txt
```
### Preparation
1. Download the dataset from [HotpotQA](https://huggingface.co/datasets/hotpotqa/hotpot_qa), [2WikiMQA](https://github.com/Alab-NII/2wikimultihop) and [MuSiQue](https://huggingface.co/datasets/dgslibisey/MuSiQue). Separate them as train, dev and test set, and then put them under `data/dataset`.
2. Download the retriever model [Contriever](https://huggingface.co/facebook/contriever-msmarco) and base model [DeBERTa](https://huggingface.co/microsoft/deberta-v3-large), put them under `model_cache`
3. Prepare the corpus by extract documents and construct embedding.
```bash
python src/retrievers/multihop_data_extractor.py --dataset hotpotQA
```
```bash
python src/retrievers/passage_embedder.py \
--passages data/corpus/hotpotQA/corpus.jsonl \
--output_dir data/corpus/hotpotQA/contriever \
--model_type contriever
```
4. Deploy [LLaMA-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) with [vLLM](https://github.com/vllm-project/vllm) framework, and configure it in `src/language_models/llama.py`
### 2. Training Data Construction
We will use hotpotQA training set as an example. You could construct 2WikiMQA and MuSiQue in the same way.
#### 2.1 Query Decompose
```bash
python src/data_synthesize/query_decompose.py \
--dataset hotpotQA \
--split train \
--model llama3
```
#### 2.2 Token Labeling
```bash
python src/data_synthesize/token_labeling.py \
--dataset hotpotQA \
--split train \
--model llama3
```
```bash
python src/data_synthesize/token_extraction.py \
--data_path data/synthesized_token_labeling/hotpotQA/train.jsonl \
--save_path data/token_extracted/hotpotQA/train.jsonl \
--verbose
```
#### 2.3 Next Query Filtering
```bash
python src/data_synthesize/next_hop_query_construction.py \
--dataset hotpotQA \
--split train \
--model llama
```
```bash
python src/data_synthesize/next_hop_query_filtering.py \
--data_path data/synthesized_next_query/hotpotQA/train.jsonl \
--save_path data/next_query_extracted/hotpotQA/train.jsonl \
--verbose
```
#### 2.4 Negative Sampling
```bash
python src/data_synthesize/negative_sampling.py \
--dataset hotpotQA \
--split train \
--retriever contriever
```
```bash
python src/data_synthesize/negative_sampling_labeled.py \
--dataset hotpotQA \
--split train \
--model llama
```
```bash
python src/data_synthesize/negative_token_extraction.py \
--dataset hotpotQA \
--split train \
--verbose
```
#### 2.5 Training Data
```bash
python src/data_synthesize/training_data_synthesize.py \
--dataset hotpotQA \
--split train
```
## Training
Training Filter model
```bash
python src/efficient_rag/filter_training.py \
--dataset hotpotQA \
--save_path saved_models/filter
```
Training Labeler model
```bash
python src/efficient_rag/labeler_training.py \
--dataset hotpotQA \
--tags 2
```
## Inference
EfficientRAG retrieve procedure
```bash
python src/efficientrag_retrieve.py \
--dataset hotpotQA \
--retriever contriever \
--labels 2 \
--labeler_ckpt <> \
--filter_ckpt <> \
--topk 10 \
```
Use LLaMA-3-8B-Instruct as generator
```bash
python src/efficientrag_qa.py \
--fpath <> \
--model llama-8B \
--dataset hotpotQA
```
## Citation
If you find this paper or code useful, please cite by:
```txt
@inproceedings{zhuang2024efficientrag,
title={EfficientRAG: Efficient Retriever for Multi-Hop Question Answering},
author={Zhuang, Ziyuan and Zhang, Zhiyang and Cheng, Sitao and Yang, Fangkai and Liu, Jia and Huang, Shujian and Lin, Qingwei and Rajmohan, Saravan and Zhang, Dongmei and Zhang, Qi},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},
pages={3392--3411},
year={2024}
}
```