# DecomposedAttention **Repository Path**: ByteDance/DecomposedAttention ## Basic Information - **Project Name**: DecomposedAttention - **Description**: The official repo for "D -Attn: Decomposed Attention for Large Vision-and-Language Model" - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-23 - **Last Updated**: 2026-01-26 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # $\mathcal{D}$-Attn: Decomposed Attention for Large Vision-and-Language Model *Large Vision-and-Language Model with linear computational complexity for vision modality and stronger VL capability.* **$\mathcal{D}$-Attn: Decomposed Attention for Large Vision-and-Language Models** [[Paper](https://arxiv.org/abs/2502.01906)]
[Chia-Wen Kuo](https://scholar.google.com/citations?user=iip65VkAAAAJ&hl=en&oi=ao), [Sijie Zhu](https://scholar.google.com/citations?user=8aO4k80AAAAJ&hl=en&oi=ao), [Fan Chen](https://scholar.google.com/citations?hl=en&user=yNgP-0oAAAAJ), [Xiaohui Shen](https://scholar.google.com/citations?user=pViZYwIAAAAJ&hl=en&oi=ao), [Longyin Wen](https://scholar.google.com/citations?user=PO9WFl0AAAAJ&hl=en&oi=ao) **Vidi: Large Multimodal Models for Video Understanding and Editing** [[Webpage](https://bytedance.github.io/vidi-website/)] [[Paper](https://arxiv.org/abs/2504.15681)] [[Code](https://github.com/bytedance/vidi)]
[Intelligent Editing Team, ByteDance Inc.]() ## Contents - [Install](#install) - [Model Weights](#model-weights) - [Training](#training) - [Evaluation](#evaluation) ## Install 1. Clone this repository and navigate to the dattn folder ```bash git clone https://github.com/bytedance/DecomposedAttention cd DecomposedAttention ``` 2. Install packages ```Shell conda create -n dattn python=3.11 -y conda activate dattn bash run/install.sh ``` ## Model Weights Coming soon. ## Training ### Data preparation Download json annotation files [here](https://huggingface.co/datasets/cwkuo/D-Attn/tree/main), including `blip_laion_cc_sbu_558k.json` for the alignment, `shrcap_filtered.json` for pre-training, and `llava_gpt4v_filtered.json` for sft. Download images following [ShareGPT4V](https://github.com/ShareGPT4Omni/ShareGPT4V/blob/master/docs/Data.md), including `LAION-CC-SBU-558K`, `COCO`, `WebData`, `SAM`, `GQA`, `OCR-VQA`, `TextVQA`, `VisualGenome`. Organize downloaded data as follows: ```none DecomposedAttention ├── ... ├── data │ ├── blip_laion_cc_sbu_558k.json │ ├── shrcap_filtered.json │ ├── llava_gpt4v_filtered.json │ ├── train │ | ├── llava │ │ │ ├── llava_pretrain │ │ │ │ ├── images │ │ ├── coco │ │ │ ├── train2017 │ │ ├── sam │ │ │ ├── images │ │ ├── gqa │ │ │ ├── images │ │ ├── ocr_vqa │ │ │ ├── images │ │ ├── textvqa │ │ │ ├── train_images │ │ ├── vg │ │ │ ├── VG_100K │ │ │ ├── VG_100K_2 │ │ ├── share_textvqa │ │ │ ├── images │ │ ├── web-celebrity │ │ │ ├── images │ │ ├── web-landmark │ │ │ ├── images │ │ ├── wikiart │ │ │ ├── images ├── ... ``` ### Mistral 7B v0.3 ```bash # all training ckpts will be stored in the ./checkpoints folder mkdir -p checkpoints # multimodal alignment stage bash run/mistral_aln.sh # multimodal pre-training stage bash run/mistral_pt.sh # instruction tuning stage bash run/mistral_it.sh ``` ### Gemma 2 9B ```bash # all training ckpts will be stored in the ./checkpoints folder mkdir -p checkpoints # multimodal alignment stage bash run/gemma_aln.sh # multimodal pre-training stage bash run/gemma_pt.sh # instruction tuning stage bash run/gemma_it.sh ``` ## Evaluation ### Data preparation Follow [LLaVA](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md) to download `ScienceQA`, `MME`, `GQA`, `POPE`, `TextVQA`, `SEED-Bench`, `LLaVA-Bench-in-the-Wild`, `MM-Vet`, `VQAv2`, `MMBench`, and `VisWiz`. Follow [MMStar](https://github.com/MMStar-Benchmark/MMStar) to download `MMStar` benchmark. Organize downloaded benchmarks as follows: ```none DecomposedAttention ├── ... ├── data │ ├── val │ | ├── scienceqa │ │ ├── MME │ │ ├── gqa │ │ ├── pope │ │ ├── textvqa │ │ ├── seed_bench │ │ ├── llava-bench-in-the-wild │ │ ├── mm-vet │ │ ├── vqav2 │ │ ├── mmbench │ │ ├── vizwiz ├── ... ``` ### Evaluate all ```bash # suppose we have 8 GPUs on a machine # evaluate Mistral 7B v0.3 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash run/mistral_eval.sh # evaluate Gemma 2 9B CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash run/gemma_eval.sh ``` ## Citation If you find $\mathcal{D}$-Attn useful for your research and applications, please cite our works: ```bibtex @article{kuo2025rethinking, title={D-Attn: Decomposed Attention for Large Vision-and-Language Models}, author={Kuo, Chia-Wen and Zhu, Sijie and Chen, Fan and Shen, Xiaohui and Wen, Longyin}, journal={arXiv preprint arXiv:2502.01906}, year={2025} } @article{team2025vidi, title={Vidi: Large Multimodal Models for Video Understanding and Editing}, author={Vidi Team, and Liu, Celong and Kuo, Chia-Wen and Du, Dawei and Chen, Fan and Chen, Guang and Yuan, Jiamin and Zhang, Lingxi and Guo, Lu and Li, Lusha and others}, journal={arXiv preprint arXiv:2504.15681}, year={2025} } ```