# ml-sid-dit **Repository Path**: mirrors_apple/ml-sid-dit ## Basic Information - **Project Name**: ml-sid-dit - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-31 - **Last Updated**: 2026-02-28 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # SiD-DiT: Score Identity Distillation for DiT-based Flow Matching Models |![SiD-DiT](Examples/sid_dit_teaser.png "Four-Step Text-to-Image Generation")| |:--:| |*Four-Step Text-to-Image Generation samples from SiD-DiT*| Welcome to the official implementation of **Score Identity Distillation (SiD)** for DiT-based diffusion and flow-matching models. This repository enables **fast, few-step text-to-image generation** via scalable, generalizable distillation techniques. The same set of hyperparameters work across **all major DiT-based Flow Matching models**, including: - **SANA** (Rectified Flow and TrigFlow; 0.6B and 1.6B) - **Stable Diffusion 3-Medium** - **Stable Diffusion 3.5-Medium** - **Stable Diffusion 3.5-Large** - **FLUX.1-dev** (512×512 and 1024×1024) The research paper can be found at [SiD-DiT](https://arxiv.org/abs/2509.25127). --- ## Installation ### Step 1: Environment Setup ```bash conda env create -f sid_dit_environment.yml conda init source ~/.bashrc conda activate sid_dit ``` ### Step 2: Install Dependencies ```bash pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 pip install \ accelerate==1.8.1 blobfile==3.0.0 click==8.2.1 \ datasets==2.19.0 diffusers==0.33.1 ftfy==6.3.1 \ huggingface-hub==0.33.0 numpy==1.26.4 open-clip-torch==2.32.0 \ pillow==10.3.0 requests==2.31.0 safetensors==0.5.3 \ scipy==1.13.0 timm==0.9.16 tokenizers==0.21.1 tqdm==4.66.4 \ transformers==4.52.4 wcwidth==0.2.13 protobuf==6.31.1 sentencepiece==0.2.0 ``` --- ## Dataset Setup To train the model, we just need to prepare the training prompts as SiD is a data-free framework. By default, we use the text prompts from the [midjourney-v6-llava](https://huggingface.co/datasets) Hugging Face dataset. We can also choose to use Aesthetic6+, Aesthetic6.25+, Aesthetic6.5+, or any other list of prompts, as long as they do not include COCO captions. The prompts should be saved save them in the following path: `/data/datasets/aesthetics_6_plus/aesthetics_6_plus.txt.` You may also change the data path, but change the path in `run_sid_dit.sh` accordingly. To evaluate the metrics on the fly, we may prepare the validation set of MSCOCO. Please follow the setting of [SiD-LSG](https://github.com/mingyuanzhou/SiD-LSG) for the MSCOCO data preparation. --- ## Hugging Face Access Login using your token: ```bash huggingface-cli login --token ``` Ensure your token has access to **SD3** and **SD3.5** models. This is not needed for using SANA. --- ## Launch Training Run with (sd3-medium as an example): ```bash sh run_sid_dit.sh sd3-medium 1_minus_sigma 8 ``` - Check `run_sid_dit.sh` for all available model options and configurations. --- ## Training Configurations Two FSDP training variants are supported: - **AMP (Autocast + bf16) + FSDP** - **Pure BF16 + FSDP** Example: In `run_sid_dit_sd3.sh`, use the following flags: **AMP + FSDP** ```bash --fp16 0 \ --bf16 0 \ --autocast_bf16 1 ``` - Uses: `lr = glr = 1e-6`, Adam `eps = 1e-8` **Pure BF16 + FSDP** ```bash --fp16 0 \ --bf16 1 \ --autocast_bf16 1 ``` - Uses: `lr = glr = 1e-5`, Adam `eps = 1e-4` > These hyperparameters are plug-and-play across all supported models. All models—**except FLUX.1-dev at 1024×1024 resolution**—can be trained on a **single node with 8×80GB A100 or H100 GPUs**, with **rapid convergence** typically achieved within a few hours. Longer training can yield incremental gains, but improvements taper off after the initial convergence phase. For **FLUX.1-dev at 1024×1024**, we **recommend using B100 GPUs**. Although it is technically possible to fit the model into eight 80GB GPUs using `cpu_offloading`, we’ve observed inconsistencies in FSDP gradient updates when `cpu_offloading` is enabled—leading to behavior that diverges from the non-offloaded baseline. While `cpu_offloading` is supported in the codebase, it **has not been fully debugged** and should be used with caution. --- ## Generate samples ``` python generate_sid_fewstep_sd3.py --outdir= --seeds='1,2,3,4,5' \ --batch=4 --network= --text_prompts='prompts/fig1-captions.txt' \ --pretrained_model_name_or_path='stabilityaistable-diffusion-3-medium-diffusers' ``` Check `prompts` folder for text prompts to reproduce the results shown in the paper. --- ## Features **SiD-DiT** compresses large DiT-based models into **fast**, **few-step generators** with high visual fidelity and broad applicability. Key features include: - **Few-step generation** (default: 4 steps) - **Flexible noise scheduling**: - `fresh` (default) - `fixed` and `ddim` — equally effective in practice, and often preferred in tasks requiring deterministic latent inputs - **Configurable loss weighting schemes**: - Default: `1_minus_sigma` - Alternatives: `sid_default`, `1_over_sigma`, and other variants - Each weighting function biases the output differently (e.g., toward higher contrast or saturation). Choose based on aesthetic preference. - We favor `1_minus_sigma` for brighter, more "sunny" visuals. - **Distributed training** via FSDP (Fully Sharded Data Parallel) - **Support for AMP and BF16** training modes - **Automatic FID & CLIP evaluation** on COCO-2014 for checkpoint selection - These metrics are useful for tracking progress **within the same teacher**, but may not be reliable when comparing across different teachers or at higher resolutions. > **Note**: SiD is **data-free by default**, requiring only **text prompts** for distillation. In this repository, the default configuration uses the [midjourney-v6-llava](https://huggingface.co/datasets) Hugging Face dataset, which provides synthetic text–image pairs. However, **only the prompts** are used under data-free settings. For training with **adversarial losses**, the corresponding images are also utilized. Be aware that the synthetic images in midjourney-v6-llava are often of **lower quality than the outputs of SiD-distilled models** (e.g., from SD3, SD3.5, FLUX). As such, **we do not recommend enabling Diffusion GAN training (setting `--train_diffusiongan 1`) unless your provided image data is of demonstrably higher quality than your distilled model outputs**. --- ## Background & References Please cite our work if you find it is helpful: ```bibtex @misc{zhou2025scoredistillationflowmatching, title={Score Distillation of Flow Matching Models}, author={Mingyuan Zhou and Yi Gu and Huangjie Zheng and Liangchen Song and Guande He and Yizhe Zhang and Wenze Hu and Yinfei Yang}, year={2025}, eprint={2509.25127}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2509.25127}, } ``` SiD-DiT builds on the Score identity Distillation research series: - **Few-Step Diffusion via Score Identity Distillation** [arXiv:2505.12674](https://arxiv.org/abs/2505.12674) ```bibtex @misc{zhou2025fewstepdiffusionscoreidentity, title={Few-Step Diffusion via Score Identity Distillation}, author={Mingyuan Zhou and Yi Gu and Zhendong Wang}, year={2025}, eprint={2505.12674}, archivePrefix={arXiv}, primaryClass={cs.CV} } ``` - **Guided Score Identity Distillation for Data-Free One-Step Text-to-Image Generation** [arXiv:2406.01561](https://arxiv.org/abs/2406.01561) ```bibtex @inproceedings{zhou2025guided, title={Guided Score Identity Distillation for Data-Free One-Step Text-to-Image Generation}, author={Zhou, Mingyuan and Wang, Zhendong and Zheng, Huangjie and Huang, Hai}, booktitle={ICLR 2025}, year={2025} } ``` - **Adversarial Score Identity Distillation: Rapidly Surpassing the Teacher in One Step** [OpenReview](https://openreview.net/forum?id=lS2SGfWizd) ```bibtex @inproceedings{zhou2025adversarial, title={Adversarial Score Identity Distillation: Rapidly Surpassing the Teacher in One Step}, author={Mingyuan Zhou and Huangjie Zheng and Yi Gu and Zhendong Wang and Hai Huang}, booktitle={ICLR 2025} } ``` - **Score Identity Distillation: Exponentially Fast Distillation of Pretrained Diffusion Models for One-Step Generation** [arXiv:2404.04057](https://arxiv.org/abs/2404.04057) ```bibtex @inproceedings{zhou2024score, title={Score Identity Distillation: Exponentially Fast Distillation of Pretrained Diffusion Models for One-Step Generation}, author={Zhou, Mingyuan and Zheng, Huangjie and Wang, Zhendong and Yin, Mingzhang and Huang, Hai}, booktitle={ICML 2024}, year={2024} } ``` ---