# Retention-Score **Repository Path**: mirrors_ibm/Retention-Score ## Basic Information - **Project Name**: Retention-Score - **Description**: Code repo for AAAI 2025 paper "Retention Score: Quantifying Jailbreak Risks for Vision Language Models" - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-12-23 - **Last Updated**: 2026-04-26 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Retention Score: Quantifying Jailbreak Risks for Vision Language Models This is the official implementation of the paper "Retention Score: Quantifying Jailbreak Risks for Vision Language Models", accepted at AAAI 2025. ## Table of Contents - [Code Explanation](#code-explanation) - [Detailed Implementation for Models and Dataset](#detailed-implementation-for-models-and-dataset) - [Evaluation Settings](#evaluation-settings) - [Reference](#reference) ## Code Explanation - `generation_code/*`: Contains utilities and model descriptions for generating adversarial examples. - `evaluation_code/*`: Contains the evaluation code for assessing the robustness of the generated examples. - `minigpt_adversarial_generation.py`: Script to create adversarial images for any images in the specified directory. - `minigpt_real_our.py`: Generates responses for each prompt and image using the specified datasets. - `get_metric.py`: Utilizes the Perspective API to generate evaluation metrics. - `cal_score_acc.py`: Computes the retention score and ASR (Adversarial Success Rate). - `gemini_evaluation.py` and `gpt4v_evaluation.py`: Scripts for API evaluation. ## Detailed Implementation for Models and Dataset 1. Create two directories named `samples` and `models` to store generated samples and robust models. 2. Install the required packages: ``` pip install -r requirements.txt ``` This will install all necessary dependencies for the project. 3. To generate samples, follow these steps: - Use the diffusion generator to generate images and save them into the specified directory. - Refer to the provided examples for generating adversarial images. - Use the specified datasets for generating responses and save the results to a JSONL file. ## Evaluation Settings 1. **Image Evaluation**: - Use the `minigpt_adversarial_generation.py` to create adversarial images. - Use `get_metric.py` with the Perspective API to generate evaluation metrics. - Use `cal_score_acc.py` to compute the retention score and ASR. 2. **Text Evaluation**: - Use the paraphrasing model to get paraphrased prompts for adversarial behavior. - Generate responses using the specified scripts and evaluate the scores for each response. You can modify the sample size in each sub-setting to change the number of samples for evaluation. ## Reference ```bibtex @misc{li2024retentionscorequantifyingjailbreak, title={Retention Score: Quantifying Jailbreak Risks for Vision Language Models}, author={Zaitang Li and Pin-Yu Chen and Tsung-Yi Ho}, year={2024}, eprint={2412.17544}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2412.17544}, }