# DecomposedAttention

**Repository Path**: ByteDance/DecomposedAttention

## Basic Information

- **Project Name**: DecomposedAttention
- **Description**: The official repo for "D -Attn: Decomposed Attention for Large Vision-and-Language Model"
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-10-23
- **Last Updated**: 2026-01-26

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# $\mathcal{D}$-Attn: Decomposed Attention for Large Vision-and-Language Model

*Large Vision-and-Language Model with linear computational complexity for vision modality and stronger VL capability.*

**$\mathcal{D}$-Attn: Decomposed Attention for Large Vision-and-Language Models** [[Paper](https://arxiv.org/abs/2502.01906)] <br>
[Chia-Wen Kuo](https://scholar.google.com/citations?user=iip65VkAAAAJ&hl=en&oi=ao), [Sijie Zhu](https://scholar.google.com/citations?user=8aO4k80AAAAJ&hl=en&oi=ao), [Fan Chen](https://scholar.google.com/citations?hl=en&user=yNgP-0oAAAAJ), [Xiaohui Shen](https://scholar.google.com/citations?user=pViZYwIAAAAJ&hl=en&oi=ao), [Longyin Wen](https://scholar.google.com/citations?user=PO9WFl0AAAAJ&hl=en&oi=ao)

**Vidi: Large Multimodal Models for Video Understanding and Editing** [[Webpage](https://bytedance.github.io/vidi-website/)] [[Paper](https://arxiv.org/abs/2504.15681)] [[Code](https://github.com/bytedance/vidi)] <br>
[Intelligent Editing Team, ByteDance Inc.]()

## Contents
- [Install](#install)
- [Model Weights](#model-weights)
- [Training](#training)
- [Evaluation](#evaluation)

## Install

1. Clone this repository and navigate to the dattn folder
```bash
git clone https://github.com/bytedance/DecomposedAttention
cd DecomposedAttention
```

2. Install packages
```Shell
conda create -n dattn python=3.11 -y
conda activate dattn
bash run/install.sh
```

## Model Weights

Coming soon.

## Training

### Data preparation

Download json annotation files [here](https://huggingface.co/datasets/cwkuo/D-Attn/tree/main), including `blip_laion_cc_sbu_558k.json` for the alignment, `shrcap_filtered.json` for pre-training, and `llava_gpt4v_filtered.json` for sft.

Download images following [ShareGPT4V](https://github.com/ShareGPT4Omni/ShareGPT4V/blob/master/docs/Data.md), including `LAION-CC-SBU-558K`, `COCO`, `WebData`, `SAM`, `GQA`, `OCR-VQA`, `TextVQA`, `VisualGenome`.

Organize downloaded data as follows:

```none
DecomposedAttention
├── ...
├── data
│   ├── blip_laion_cc_sbu_558k.json
│   ├── shrcap_filtered.json
│   ├── llava_gpt4v_filtered.json
│   ├── train
│   |   ├── llava
│   │   │   ├── llava_pretrain
│   │   │   │   ├── images
│   │   ├── coco
│   │   │   ├── train2017
│   │   ├── sam
│   │   │   ├── images
│   │   ├── gqa
│   │   │   ├── images
│   │   ├── ocr_vqa
│   │   │   ├── images
│   │   ├── textvqa
│   │   │   ├── train_images
│   │   ├── vg
│   │   │   ├── VG_100K
│   │   │   ├── VG_100K_2
│   │   ├── share_textvqa
│   │   │   ├── images
│   │   ├── web-celebrity
│   │   │   ├── images
│   │   ├── web-landmark
│   │   │   ├── images
│   │   ├── wikiart
│   │   │   ├── images
├── ...
```

### Mistral 7B v0.3
  ```bash
  # all training ckpts will be stored in the ./checkpoints folder
  mkdir -p checkpoints

  # multimodal alignment stage
  bash run/mistral_aln.sh

  # multimodal pre-training stage
  bash run/mistral_pt.sh

  # instruction tuning stage
  bash run/mistral_it.sh
  ```

### Gemma 2 9B
  ```bash
  # all training ckpts will be stored in the ./checkpoints folder
  mkdir -p checkpoints

  # multimodal alignment stage
  bash run/gemma_aln.sh

  # multimodal pre-training stage
  bash run/gemma_pt.sh

  # instruction tuning stage
  bash run/gemma_it.sh
  ```

## Evaluation

### Data preparation

Follow [LLaVA](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md) to download `ScienceQA`, `MME`, `GQA`, `POPE`, `TextVQA`, `SEED-Bench`, `LLaVA-Bench-in-the-Wild`, `MM-Vet`, `VQAv2`, `MMBench`, and `VisWiz`.

Follow [MMStar](https://github.com/MMStar-Benchmark/MMStar) to download `MMStar` benchmark.

Organize downloaded benchmarks as follows:

```none
DecomposedAttention
├── ...
├── data
│   ├── val
│   |   ├── scienceqa
│   │   ├── MME
│   │   ├── gqa
│   │   ├── pope
│   │   ├── textvqa
│   │   ├── seed_bench
│   │   ├── llava-bench-in-the-wild
│   │   ├── mm-vet
│   │   ├── vqav2
│   │   ├── mmbench
│   │   ├── vizwiz
├── ...
```

### Evaluate all

  ```bash
  # suppose we have 8 GPUs on a machine

  # evaluate Mistral 7B v0.3
  CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash run/mistral_eval.sh

  # evaluate Gemma 2 9B
  CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash run/gemma_eval.sh
  ```


## Citation

If you find $\mathcal{D}$-Attn useful for your research and applications, please cite our works:
```bibtex
@article{kuo2025rethinking,
  title={D-Attn: Decomposed Attention for Large Vision-and-Language Models},
  author={Kuo, Chia-Wen and Zhu, Sijie and Chen, Fan and Shen, Xiaohui and Wen, Longyin},
  journal={arXiv preprint arXiv:2502.01906},
  year={2025}
}

@article{team2025vidi,
  title={Vidi: Large Multimodal Models for Video Understanding and Editing},
  author={Vidi Team, and Liu, Celong and Kuo, Chia-Wen and Du, Dawei and Chen, Fan and Chen, Guang and Yuan, Jiamin and Zhang, Lingxi and Guo, Lu and Li, Lusha and others},
  journal={arXiv preprint arXiv:2504.15681},
  year={2025}
}
```