# FrugalRAG **Repository Path**: mirrors_microsoft/FrugalRAG ## Basic Information - **Project Name**: FrugalRAG - **Description**: RL finetuning method that ensures that the inference-time compute for queries is optimized based on query difficulty, leading to significant inference efficiency. - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-02 - **Last Updated**: 2025-10-03 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # FrugalRAG **A Retrieval-Augmented Generation approach for efficient multi-hop question answering** [![arXiv](https://img.shields.io/badge/arXiv-2507.07634-b31b1b.svg)](https://arxiv.org/abs/2507.07634) [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) > **Paper:** *FrugalRAG: Learning to retrieve and reason for multi-hop QA* ## Overview Reinforcement learning (RL) based on the final answer’s reward has driven recent progress in small language models (SLMs) on reasoning-heavy tasks such as math and code. However, applying the same techniques to retrieval-augmented generation (RAG) benchmarks like multi-hop QA has yielded limited gains—often trailing supervised or prompting-only baselines. Instead, we argue that a viable path for RL in multi-hop QA is to use test-time scaling judiciously, for optimizing both the final answer accuracy and the efficiency in reaching that answer. We propose FrugalRAG, a two-stage finetuning framework that adaptively reduces the number of retrieval steps based on a question’s difficulty. First, we train an SLM with supervised finetuning on a full-exploration policy that generates broad sub-queries. Then, we apply RL to adaptively prune search depth based on question difficulty, directly rewarding policies that balance correctness with frugality. Unlike prior approaches requiring 100× more data, our method achieves competitive performance with only 1,000 examples. On HotPotQA and other multi-hop QA benchmarks, FrugalRAG attains state-of-the-art efficiency–accuracy tradeoffs, cutting retrieval cost nearly in half. Moreover, on the challenging BrowseCompPlus benchmark, it generalizes zero-shot and surpasses SLM-based and other baselines. These results demonstrate the use of RL—not to increase reasoning steps but to optimize them—as an effective solution for scalable, efficient RAG. ## Installation ### Prerequisites - Python 3.10+ - CUDA-compatible GPU - 16GB+ GPU memory (recommended) ### Environment Setup We recommend using Conda for environment management: ```bash git clone https://github.com/microsoft/FrugalRAG.git cd FrugalRAG conda env create -n frag --file environment.yaml conda activate frag pip install vllm==0.8.3 --no-deps ``` > See [dataset setup](src/data/README.md) and [training guide](src/train/README.md) before evaluation. ## Quick Start ### 1. Start Language Model Server ```bash python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct --gpu-memory-utilization 0.70 --tensor-parallel-size 1 --port 7501 ``` ### 2. Start Retrieval Backend **Option A: ColBERT (Abstracts)** ```bash CUDA_VISIBLE_DEVICES=1 PORT=8000 python -m src.search.serve_colbert.py --index_root ../data/index/ --index wiki17.nbits.local --colbert_path ../data/colbertv2.0 --collection_path ../data/wiki.abstracts.2017/collection.tsv ``` **Option B: ColBERT (Abstracts and Body)** ```bash CUDA_VISIBLE_DEVICES=1 PORT=8000 python -m src.search.serve_colbert.py --index_root ../data/index/ --index wiki18.nbits.local --colbert_path ../data/colbertv2.0 --collection_path ../data/wiki.2018/collection.tsv ``` **Option C: E5** ```bash # convert faiss flat index to pytorch shards for fast execution python src/search/shard_embeddings.py # start server INDEX_DIR=../data/e5-base-v2/pytorch-shards/ E5_MODEL_NAME_OR_PATH="intfloat/e5-base-v2" TOP_K=5 uvicorn src.search.start_e5_server_main:app --port 8001 ``` ## Evaluation ### Running Evaluation Ensure all required services are running before evaluation. Run evaluation: ```bash python -m src.evaluation.eval_mp --model_name_or_path [MODEL_PATH] --output_path [OUTPUT_PATH] --prompt_path [PROMPT_PATH] --answer_model [BASE_MODEL_NAME] --port 7501 7502 --search_port 8000 --dataset_name [DATASET_NAME] --input_file [DEV_FILE_PATH] # extract the final answer with CoT prompt python -m src.evaluation.eval_mp --model_name_or_path [MODEL_PATH] --output_path [OUTPUT_PATH] --prompt_path [PROMPT_PATH] --answer_model [BASE_MODEL_NAME] --port 7501 7502 --search_port 8000 --dataset_name [DATASET_NAME] --input_file [DEV_FILE_PATH] --answer_only True ``` Run MBE (ensure you set the path in the grade_all.py script) ``` python -m src.evaluation.grade_all ``` ### Available Evaluation Metrics The evaluation framework automatically computes: - **Exact Match (EM)**: Binary accuracy for correct answers - **Match**: Checks if gold answer is present in the generated answer - **F1 Score**: Token-level overlap between predicted and gold answers - **Cost Efficiency**: Retrieval operations per query - **Recall/Support F1**: Retrieval peformance - **MBE**: LLM Judge Score ## Configuration ### Configuration Files Configuration files are organized in the `configs/` directory: ``` configs/ ├── create_data/ # Data generation configs │ └── colbert/ │ ├── hotpot_qwen7b_finish.json │ └── hotpot_qwen7b_nofinish.json ├── sft/ # Supervised fine-tuning configs │ └── colbert/ │ ├── hotpot_qwen7b_m5_0.90.json │ └── hotpot_qwen7b_m5_nofinish.json ├── grpo/ # GRPO reinforcement learning configs │ └── colbert/ │ └── hotpot_qwen7b_m5_0.90.json └── default_config.yaml # Accelerate configuration ``` > Please set the correct paths, port numbers to ensure the models run smoothly. ### Key Configuration Parameters | Parameter | Description | Example | |-----------|-------------|---------| | `model_name_or_path` | Base model for training | `"Qwen/Qwen2.5-7B-Instruct"` | | `search_port` | Retrieval server port | `8000` | | `port` | Model server ports (reasoner, answer generator) | `[7501, 7502]` | | `max_iters` | Maximum reasoning iterations | `5` | | `ndocs` | Documents retrieved per iteration | `3` or `5` | | `dataset_name` | Target dataset (hotpot, 2wiki, musique) | `"hotpot"` | --- ## Troubleshooting ### Common Issues & Solutions #### Import Errors in ColBERTv2 **Issue**: `ImportError: cannot import name 'AdamW' from 'transformers'` **Solution**: Comment out the import in the relevant files. We use a newer `transformers` version. #### Device Mismatch During SFT **Issue**: `RuntimeError: Expected all tensors to be on the same device` **Solution**: In the transformers package file `loss_utils.py`, line 38, use: ```python if reduction == "sum": loss = loss / num_items_in_batch.to(loss.device) return loss ``` #### NCCL Error ``` Exception: Call to collective_rpc method failed: Weight update group already initialized. Call close_communicator first. ``` Just rerun `trl vllm-serve`, sometimes it does not call `close_communicator` on its own. ### Performance Optimization - **Memory Usage**: Adjust `--gpu-memory-utilization` based on your GPU memory - **Training Speed**: Use DeepSpeed Zero3 for large model training - **Inference Speed**: Use `--tensor-parallel-size` for multi-GPU inference ## Citation If you use this repository, please cite our paper: ```bibtex @misc{java2025frugalraglearningretrievereason, title={FrugalRAG: Learning to retrieve and reason for multi-hop QA}, author={Abhinav Java and Srivathsan Koundinyan and Nagarajan Natarajan and Amit Sharma}, year={2025}, eprint={2507.07634}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.07634}, } ``` ## Acknowledgments - [ColBERT](https://github.com/stanford-futuredata/ColBERT) for the efficient retrieval framework - [DSPy](https://github.com/stanfordnlp/dspy) for the programming framework for language models - [vLLM](https://github.com/vllm-project/vllm) for high-performance LLM inference