# LLaVA-Rad **Repository Path**: mirrors_microsoft/LLaVA-Rad ## Basic Information - **Project Name**: LLaVA-Rad - **Description**: Official implementation of LLaVa-Rad, a small multimodal model for chest X-ray findings generation. - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-02-22 - **Last Updated**: 2026-03-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # 🏥 LLaVA-Rad [![Code License](https://img.shields.io/badge/Code%20License-Microsoft%20Research-red)](LICENSE) [![Data](https://img.shields.io/badge/PhysioNet-Data-228B22)](https://physionet.org/content/llava-rad-mimic-cxr-annotation/1.0.0/) [![🤗](https://img.shields.io/badge/🤗-Model-FFA500)](https://huggingface.co/microsoft/llava-rad/) [![Eval](https://img.shields.io/badge/eval-CheXprompt-purple)](https://github.com/microsoft/chexprompt/) [![Preprint](https://img.shields.io/badge/arXiv-Preprint-blue)](https://arxiv.org/abs/2403.08002) [![Peer Reviewed Paper](https://img.shields.io/badge/Peer%20Reviewed%20Paper-Nature%20Communications-cyan)](https://doi.org/10.1038/s41467-025-58344-x) ## Introduction Official implementation of LLaVa-Rad, introduced in ["A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings"](https://doi.org/10.1038/s41467-025-58344-x). LLaVA-Rad can take in as input a frontal chest X-ray and optionally a reason for exam and will output the corresponding findings. **Note:** if you are interested in radiologist aligned evaluation of generated reports, we recommend you use the [CheXprompt](https://github.com/microsoft/chexprompt) codebase. ## Contents - [Introduction](#introduction) - [Requirements](#requirements) - [Installation](#installation) - [Train](#train) - [0. Preparation](#0-preparation) - [1. Pretrain (Alignment)](#1-pretrain-alignment) - [2. Fine-tuning (LoRA)](#2-fine-tuning-lora) - [Inference](#inference) - [Evaluation](#evaluation) - [Citation](#citation) - [License and Usage Notices](#license-and-usage-notices) - [Acknowledgements](#acknowledgements) ## Requirements We trained and tested LLaVA-Rad using Python 3.10. For optimal inference, we recommend using a GPU environment. LLaVA-Rad has been tested on NVIDIA V100 and A100 GPUs with CUDA 11.x (or newer) drivers, on recent versions of Ubuntu. ## Installation Follow these steps to set up LLaVA-Rad: 1. Clone the repository and navigate to the project folder: ```Shell git clone https://github.com/microsoft/LLaVA-Rad.git cd LLaVA-Rad ``` 2. Create and activate a virtual environment (Python 3.10): ```Shell conda create -n llavarad python=3.10 -y conda activate llavarad ``` 3. Upgrade pip and install the package: ```Shell pip install --upgrade pip # enable PEP 660 support pip install -e . ``` 4. [Optional] Install additional dependencies for training: ```bash pip install ninja pip install flash-attn --no-build-isolation ``` ## Train When starting from scratch, the following checkpoints are needed: - A pre-trained LM checkpoint, e.g., [lmsys/vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5) - By default, we use a customized domain-specific ViT, BiomedCLIP-CXR. See [README.md](./llava/model/multimodal_encoder/open_clip_encoder/README.md) for details. ### 0. Preparation Before running the commands below, you need to have the data, image folder, and the above checkpoints ready. **0.1 Data** To download the data, sign the data use agreement and follow the instructions for download at [LLaVA-Rad MIMIC-CXR Annotations on PhysioNet](https://physionet.org/content/llava-rad-mimic-cxr-annotation/1.0.0/). This will include reports with extracted sections in LLaVA format, split into train/dev/test. **0.2 Images** You need to download the [MIMIC-CXR-JPG images from PhysioNet](https://physionet.org/content/mimic-cxr-jpg/2.0.0/) by signing the data use agreement and following the instructions. **0.3 Model weights** You can find the pretrained model weights for BiomedCLIP-CXR and LLaVA-Rad at https://huggingface.co/microsoft/llava-rad. **Notes before proceeding:** - Change the paths in the scripts below according to where you downloaded the data. - Batch size is set for 4-GPU machines. If your machine has a difference number of GPUs, please change batch size. Training commands have been tested on a single 80GB A100 and 4x80GB H100, using torch 2.4.1 and cuda 11.8 with flash attention 2.7.2.post1. ### 1. Pretrain (Alignment) At this stage, we only train the projection layer (which aligns the vision features with text features). The vision encoder and LLM are all frozen. ```bash bash scripts/pretrain.sh ``` We get a pretrained projector `mm_projector.bin` after pretraining. ### 2. Fine-tuning (LoRA) Once we have a pretrained projector, we can do fine-tuning. The command below fine-tunes the projector and LoRA of LLM: ```bash bash scripts/finetune_lora.sh ``` ## Inference Before running the command below, you need to change the script accordingly. ```bash bash scripts/eval.sh ``` **Note:** To reproduce the evaluation results from the manuscript on the MIMIC-CXR dataset, changing the script means uncommenting and updating the paths for `query_file` and `image_folder`. In the manuscript, the Open-I and CheXpert chest X-ray images and reports are also used for evaluation. These datasets are available at their corresponding sources: [Open-I](https://openi.nlm.nih.gov/faq) | [CheXpert](https://stanfordaimi.azurewebsites.net/datasets/5158c524-d3ab-4e02-96e9-6ee9efc110a1). ## Evaluation If you have run inference using multiple GPUs and have a resulting set of chunks with results, make sure you concatenate prediction chunks into a single file before running the following command: ```bash cd llava/eval/rr_eval python run.py ${YOUR_PREDICTION_FILE} ``` ## Citation ```bibtex @Article{ZambranoChaves2025, author={Zambrano Chaves, Juan Manuel and Huang, Shih-Cheng and Xu, Yanbo and Xu, Hanwen and Usuyama, Naoto and Zhang, Sheng and Wang, Fei and Xie, Yujia and Khademi, Mahmoud and Yang, Ziyi and Awadalla, Hany and Gong, Julia and Hu, Houdong and Yang, Jianwei and Li, Chunyuan and Gao, Jianfeng and Gu, Yu and Wong, Cliff and Wei, Mu and Naumann, Tristan and Chen, Muhao and Lungren, Matthew P. and Chaudhari, Akshay and Yeung-Levy, Serena and Langlotz, Curtis P. and Wang, Sheng and Poon, Hoifung}, title={A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings}, journal={Nature Communications}, year={2025}, month={Apr}, day={01}, volume={16}, number={1}, pages={3108}, issn={2041-1723}, doi={10.1038/s41467-025-58344-x}, url={https://doi.org/10.1038/s41467-025-58344-x} } ``` ## License and Usage Notices The data, code, and model checkpoints are licensed and intended for research use only. The code and model checkpoints are subject to additional restrictions as determined by the Terms of Use of LLaMA, Vicuna, and GPT-4 respectively. Code and model checkpoints may be used for research purposes and should not be used in direct clinical care or for any clinical decision making purpose. ## Acknowledgements Our codebase heavily relies on [LLaVA](https://github.com/haotian-liu/LLaVA) v1.5. Please check out their repo for more information, and consider citing them in addition to our manuscript if you use this codebase. ```bibtex @misc{liu2023improvedllava, title={Improved Baselines with Visual Instruction Tuning}, author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae}, publisher={arXiv:2310.03744}, year={2023}, } ```