# MoMA
**Repository Path**: ByteDance/MoMA
## Basic Information
- **Project Name**: MoMA
- **Description**: MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-04-24
- **Last Updated**: 2026-01-25
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# ___***MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation***___
#### ECCV 2024 Accepted
---
## Introduction
we present MoMA: an open-vocabulary, training-free personalized image model that boasts flexible zero-shot capabilities. As foundational text-to-image models rapidly evolve, the demand for robust image-to-image translation grows. Addressing this need, MoMA specializes in subject-driven personalized image generation. Utilizing an open-source, Multimodal Large Language Model (MLLM), we train MoMA to serve a dual role as both a feature extractor and a generator. This approach effectively synergizes reference image and text prompt information to produce valuable image features, facilitating an image diffusion model. To better leverage the generated features, we further introduce a novel self-attention shortcut method that efficiently transfers image features to an image diffusion model, improving the resemblance of the target object in generated images. Remarkably, as a tuning-free plug-and-play module, our model requires only a single reference image and outperforms existing methods in generating images with high detail fidelity, enhanced identity-preservation, and prompt faithfulness. We commit to making our work open-source, thereby providing universal access to these advancements.

## Release
- [2024/04/20] 🔥 We release the model code on GitHub.
- [2024/04/22] 🔥 We add a HuggingFace repository and release the checkpoints.
- [2024/05/21] 🔥 We launch an [Online Demo](https://huggingface.co/spaces/yizhezhu/MoMA_zeroGPU) on HuggingFace Space! You don't need to provide masks. Our demo takes care of it!
## Installation
1. Install LlaVA:
Please install from its [official repository](https://github.com/haotian-liu/LLaVA#install)
2. Download our MoMA repository
```
cd ..
git clone https://github.com/bytedance/MoMA.git
cd MoMA
pip install -r requirements.txt
```
(we also provide a requirements_freeze.txt, generated by ```pip freeze```)
## Memory Requirements
We support 8-bit and 4-bit inferences which reduce memory consumptions:
+ If you have 22 GB or more GPU memory:
```args.load_8bit, args.load_4bit = False, False```
+ If you have 18 GB or more GPU memory:
```args.load_8bit, args.load_4bit = True, False```
+ If you have 14 GB or more GPU memory:
```args.load_8bit, args.load_4bit = False, True```
## Download Models
**You don't have to download any checkpoints**, our code will automatically download them from [HuggingFace repositories](https://huggingface.co/KunpengSong/MoMA_llava_7b/tree/main), which includes:
```
VAE: stabilityai--sd-vae-ft-mse
StableDiffusion: Realistic_Vision_V4.0_noVAE
MoMA:
Multi-modal LLM: MoMA_llava_7b (13 GB)
Attentions and mappings: attn_adapters_projectors.th (151 Mb)
```
## How to Use
### Jupyter-notebook
+ [**run_MoMA_notebook.ipynb**](run_MoMA_notebook.ipynb)
### Python code
+ [**run_evaluate_MoMA.py**](run_evaluate_MoMA.py)
run:
```CUDA_VISIBLE_DEVICES=0 python run_evaluate_MoMA.py```
(generated images will be saved in the output folder)
## Example Results
New context:

New texture:

**Hyper parameter:**
- In "changing context", you can increase the `strength` to get more accurate details. Mostly,`strength=1.0` is the best. It's recommended that `strength` is no greater than `1.2`.
- In "changing texture", you can change the `strength` to balance between detail accuracy and prompt fidelity. To get better prompt fidelity, just decrease `strength`. Mostly, `strength=0.4` is the best. It's recommended that `strength` is no greater than `0.6`.
## Citation
If you find our work useful for your research and applications, please consider citing us by:
```BibTeX
@article{song2024moma,
title={MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation},
author={Song, Kunpeng and Zhu, Yizhe and Liu, Bingchen and Yan, Qing and Elgammal, Ahmed and Yang, Xiao},
journal={arXiv preprint arXiv:2404.05674},
year={2024}
}
```