# DecomposedAttention
**Repository Path**: ByteDance/DecomposedAttention
## Basic Information
- **Project Name**: DecomposedAttention
- **Description**: The official repo for "D -Attn: Decomposed Attention for Large Vision-and-Language Model"
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-10-23
- **Last Updated**: 2026-01-26
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# $\mathcal{D}$-Attn: Decomposed Attention for Large Vision-and-Language Model
*Large Vision-and-Language Model with linear computational complexity for vision modality and stronger VL capability.*
**$\mathcal{D}$-Attn: Decomposed Attention for Large Vision-and-Language Models** [[Paper](https://arxiv.org/abs/2502.01906)]
[Chia-Wen Kuo](https://scholar.google.com/citations?user=iip65VkAAAAJ&hl=en&oi=ao), [Sijie Zhu](https://scholar.google.com/citations?user=8aO4k80AAAAJ&hl=en&oi=ao), [Fan Chen](https://scholar.google.com/citations?hl=en&user=yNgP-0oAAAAJ), [Xiaohui Shen](https://scholar.google.com/citations?user=pViZYwIAAAAJ&hl=en&oi=ao), [Longyin Wen](https://scholar.google.com/citations?user=PO9WFl0AAAAJ&hl=en&oi=ao)
**Vidi: Large Multimodal Models for Video Understanding and Editing** [[Webpage](https://bytedance.github.io/vidi-website/)] [[Paper](https://arxiv.org/abs/2504.15681)] [[Code](https://github.com/bytedance/vidi)]
[Intelligent Editing Team, ByteDance Inc.]()
## Contents
- [Install](#install)
- [Model Weights](#model-weights)
- [Training](#training)
- [Evaluation](#evaluation)
## Install
1. Clone this repository and navigate to the dattn folder
```bash
git clone https://github.com/bytedance/DecomposedAttention
cd DecomposedAttention
```
2. Install packages
```Shell
conda create -n dattn python=3.11 -y
conda activate dattn
bash run/install.sh
```
## Model Weights
Coming soon.
## Training
### Data preparation
Download json annotation files [here](https://huggingface.co/datasets/cwkuo/D-Attn/tree/main), including `blip_laion_cc_sbu_558k.json` for the alignment, `shrcap_filtered.json` for pre-training, and `llava_gpt4v_filtered.json` for sft.
Download images following [ShareGPT4V](https://github.com/ShareGPT4Omni/ShareGPT4V/blob/master/docs/Data.md), including `LAION-CC-SBU-558K`, `COCO`, `WebData`, `SAM`, `GQA`, `OCR-VQA`, `TextVQA`, `VisualGenome`.
Organize downloaded data as follows:
```none
DecomposedAttention
├── ...
├── data
│ ├── blip_laion_cc_sbu_558k.json
│ ├── shrcap_filtered.json
│ ├── llava_gpt4v_filtered.json
│ ├── train
│ | ├── llava
│ │ │ ├── llava_pretrain
│ │ │ │ ├── images
│ │ ├── coco
│ │ │ ├── train2017
│ │ ├── sam
│ │ │ ├── images
│ │ ├── gqa
│ │ │ ├── images
│ │ ├── ocr_vqa
│ │ │ ├── images
│ │ ├── textvqa
│ │ │ ├── train_images
│ │ ├── vg
│ │ │ ├── VG_100K
│ │ │ ├── VG_100K_2
│ │ ├── share_textvqa
│ │ │ ├── images
│ │ ├── web-celebrity
│ │ │ ├── images
│ │ ├── web-landmark
│ │ │ ├── images
│ │ ├── wikiart
│ │ │ ├── images
├── ...
```
### Mistral 7B v0.3
```bash
# all training ckpts will be stored in the ./checkpoints folder
mkdir -p checkpoints
# multimodal alignment stage
bash run/mistral_aln.sh
# multimodal pre-training stage
bash run/mistral_pt.sh
# instruction tuning stage
bash run/mistral_it.sh
```
### Gemma 2 9B
```bash
# all training ckpts will be stored in the ./checkpoints folder
mkdir -p checkpoints
# multimodal alignment stage
bash run/gemma_aln.sh
# multimodal pre-training stage
bash run/gemma_pt.sh
# instruction tuning stage
bash run/gemma_it.sh
```
## Evaluation
### Data preparation
Follow [LLaVA](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md) to download `ScienceQA`, `MME`, `GQA`, `POPE`, `TextVQA`, `SEED-Bench`, `LLaVA-Bench-in-the-Wild`, `MM-Vet`, `VQAv2`, `MMBench`, and `VisWiz`.
Follow [MMStar](https://github.com/MMStar-Benchmark/MMStar) to download `MMStar` benchmark.
Organize downloaded benchmarks as follows:
```none
DecomposedAttention
├── ...
├── data
│ ├── val
│ | ├── scienceqa
│ │ ├── MME
│ │ ├── gqa
│ │ ├── pope
│ │ ├── textvqa
│ │ ├── seed_bench
│ │ ├── llava-bench-in-the-wild
│ │ ├── mm-vet
│ │ ├── vqav2
│ │ ├── mmbench
│ │ ├── vizwiz
├── ...
```
### Evaluate all
```bash
# suppose we have 8 GPUs on a machine
# evaluate Mistral 7B v0.3
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash run/mistral_eval.sh
# evaluate Gemma 2 9B
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash run/gemma_eval.sh
```
## Citation
If you find $\mathcal{D}$-Attn useful for your research and applications, please cite our works:
```bibtex
@article{kuo2025rethinking,
title={D-Attn: Decomposed Attention for Large Vision-and-Language Models},
author={Kuo, Chia-Wen and Zhu, Sijie and Chen, Fan and Shen, Xiaohui and Wen, Longyin},
journal={arXiv preprint arXiv:2502.01906},
year={2025}
}
@article{team2025vidi,
title={Vidi: Large Multimodal Models for Video Understanding and Editing},
author={Vidi Team, and Liu, Celong and Kuo, Chia-Wen and Du, Dawei and Chen, Fan and Chen, Guang and Yuan, Jiamin and Zhang, Lingxi and Guo, Lu and Li, Lusha and others},
journal={arXiv preprint arXiv:2504.15681},
year={2025}
}
```