# MMaDA **Repository Path**: mirrors_trending/MMaDA ## Basic Information - **Project Name**: MMaDA - **Description**: MMaDA - Open-Sourced Multimodal Large Diffusion Language Models - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-06-01 - **Last Updated**: 2026-01-31 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

Multimodal Large Diffusion Language Models (NeurIPS 2025)

MMaDA Paper on arXiv MMaDA on Hugging Face MMaDA on Hugging Face MMaDA on Hugging Face MMaDA on Hugging Face MMaDA on Hugging Face

## 🌌 Introduction MMaDA is a new family of **multimodal diffusion foundation models** designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. MMaDA is distinguished by three key innovations: 1. MMaDA adopts a **unified diffusion architecture** with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. 2. MMaDA introduces a **mixed long chain-of-thought (CoT) fine-tuning** strategy that curates a unified CoT format across modalities. 3. MMaDA adopts a unified policy-gradient-based RL algorithm, which we call **UniGRPO**, tailored for diffusion foundation models. Utilizing diversified reward modeling, **UniGRPO** unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements.
MMaDA decoding demo

MMaDA's decoding demo. This video showcases how a diffusion foundation model generates text and image.
The "Text Generation" part uses a semi-autoregressive sampling method, while the "Multimodal Generation" part adopts non-autoregressive diffusion denoising.

## 📰 Latest Updates * **[2025-11-13]** We release **[MMaDA-Parallel](https://arxiv.org/abs/2511.09611)**, a new class of multimodal dLLMs for Thinking-Aware Image Editing and Generation. * **[2025-09-09]** We open source a comprehensive RL framework for dLLMs, **[dLLM-RL](https://github.com/Gen-Verse/dLLM-RL)** with released SOTA instruct and long-CoT models **[TraDo-8B-Instruct](https://huggingface.co/Gen-Verse/TraDo-8B-Instruct), [TraDo-4B-Instruct](https://huggingface.co/Gen-Verse/TraDo-4B-Instruct), and [TraDo-8B-Thinking](https://huggingface.co/Gen-Verse/TraDo-8B-Thinking)**. * **[2025-06-02]** We open source our **[MMaDA-8B-MixCoT](https://huggingface.co/Gen-Verse/MMaDA-8B-MixCoT)**. * **[2025-05-24]** We add support for MPS inference, tested on M4. * **[2025-05-22]** We release the inference and training code of MMaDA for text generation, multimodal generation and image generation. * **[2025-05-22]** We open source our **[MMaDA-8B-Base](https://huggingface.co/Gen-Verse/MMaDA-8B-Base)**. * **[2025-05-22]** We release our [research paper](https://arxiv.org/abs/2505.15809) and [demo](https://huggingface.co/spaces/Gen-Verse/MMaDA) for the first unified multimodal diffusion model: MMaDA. ## 🧬 MMaDA Series Overview MMaDA includes a series of checkpoints reflecting different training stages: 1. **[MMaDA-8B-Base](https://huggingface.co/Gen-Verse/MMaDA-8B-Base)**: After pretraining and instruction tuning. Capable of basic text generation, image generation, image captioning and **thinking ablities**. 2. **[MMaDA-8B-MixCoT](https://huggingface.co/Gen-Verse/MMaDA-8B-MixCoT)**: After mixed long chain-of-thought (CoT) fine-tuning. Capable of **complex** textual, multimodal and image generation reasoning. 3. **MMaDA-8B-Max (coming soon)**: After UniGRPO reinforment learning. Excels at complex reasoning and awesome visual generation. Will be released in the future. 4. **[MMaDA-Parallel-A](https://huggingface.co/tyfeld/MMaDA-Parallel-A) and [MMaDA-Parallel-M](https://huggingface.co/tyfeld/MMaDA-Parallel-M)**: A **parallel thinking-aware** multimodal diffusion model that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory.

Overview of MMaDA's capablities.

## ⚙️ Quick Start First, set up the enviroment: ``` pip install -r requirements.txt ``` Launch local Gradio demo: ``` python app.py ``` Or try it online via our [Huggingface Demo](https://huggingface.co/spaces/Gen-Verse/MMaDA). ## 🚀 Inference For batch-level inference, we provide our inference scripts here. ### 1. Text Generation For text generation, we follow LLaDA's configuration and generation script. Simple run: ```bash python generate.py ``` ### 2. MultiModal Generation For multimodal generation and text-to-image generation, first login your wandb account: ``` wandb login ``` Inference demo for MultiModal Generation and you can view the results on wandb: ``` python3 inference_mmu.py \ config=configs/mmada_demo.yaml \ mmu_image_root=./mmu_validation \ mmu_prompts_file=./mmu_validation/prompts_with_vqa.json \ ``` ### 3. Text-to-Image Genertion For multimodal generation and text-to-image generation, first login your wandb account: ``` wandb login ``` Inference demo for Text-to-Image Genertion and you can view the results on wandb: ``` python3 inference_t2i.py config=configs/mmada_demo.yaml batch_size=1 validation_prompts_file=validation_prompts/text2image_prompts.txt guidance_scale=3.5 generation_timesteps=15 mode='t2i' ``` ## 🔧 Training **Update your training data path in `configs/xx.yaml`.** ### Stage 0. Prepare your accelerate configs Please first prepare your accelerate configs. You can simple run ``` accelerate config ``` Or use our provided configs in `accelerate_configs`: ``` ├── accelerate_configs/ | ├── 1_gpu.yaml | └── 8_node_8_gpus_deepspeed_zero2.yaml (for 8 * 8 gpus) ``` ### Stage 1.1: Pre-training on ImageNet First we use LLaDA-8B-Instruct to initialize our model, and train on ImageNet for basic visual capbalities. ``` accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train_mmada.py config=configs/mmada_pretraining_stage1_llada_instruct.yaml ``` ### Stage 1.2 Pre-training on Image-Text Dataset Then we replace the ImageNet dataset in Stage 1.1 with Image-Text Dataset. Please change the pretrained model path in `mmada_pretraining_stage2_llada_instruct.yaml` with your checkpoint in Stage 1.1 ``` accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train_mmada_stage2.py config=configs/mmada_pretraining_stage2_llada_instruct.yaml ``` ### Stage 1.3 Pre-training on Text Instruction following In this stage, we begin training on text instruction following and include corresponding validations. Please change the pretrained model path in `mmada_pretraining_stage3_llada_instruct.yaml` with your checkpoint in Stage 1.2 ``` accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train_mmada_stage3.py config=configs/mmada_pretraining_stage3_llada_instruct.yaml ``` ### Stage 2.1 Mix-CoT Training (Text Only) In this stage, we begin our Mix-CoT finetuning with text reasoning first, along with improved image quality. Please change the pretrained model path in `mmada_pretraining_stage3_llada_instruct.yaml` with your checkpoint in Stage 1.3 and prepare your CoT data. ``` accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train_mmada_stage_cot_sft.py config=configs/mmada_pretraining_stage3_llada_instruct_512_cot.yaml ``` ### Stage 2.2 Mix-CoT Training (with MultiModal Reasoning) In this stage, we include multimodal reasoning, along with improved image quality. Please change the pretrained model path in `mmada_pretraining_stage3_llada_instruct.yaml` with your checkpoint in Stage 2.1 and prepare your CoT data. ``` accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train_mmada_stage4.py config=configs/mmada_pretraining_stage4_llada_instruct.yaml ``` ### Stage 3 UniGRPO RL [Will be released once we finished our code transition to OpenRLHF] ## 📊 Evaluation Please refer to [evaluation/eval.md](evaluation/eval.md) for more details. ## 📖 Citation ``` @article{yang2025mmada, title={MMaDA: Multimodal Large Diffusion Language Models}, author={Yang, Ling and Tian, Ye and Li, Bowen and Zhang, Xinchen and Shen, Ke and Tong, Yunhai and Wang, Mengdi}, journal={arXiv preprint arXiv:2505.15809}, year={2025} } ``` ## 🤝 Acknowledgments This work is heavily based on [Show-o](https://github.com/showlab/Show-o), [LLaDA](https://github.com/ML-GSAI/LLaDA), [maskgit](https://github.com/google-research/maskgit), [transformers](https://github.com/huggingface/transformers), [accelerate](https://github.com/huggingface/accelerate) and [webdataset](https://github.com/webdataset/webdataset). Thanks to all the authors for their great work.