# CAS-ViT **Repository Path**: vt-developer/CAS-ViT ## Basic Information - **Project Name**: CAS-ViT - **Description**: No description available - **Primary Language**: Python - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-12-23 - **Last Updated**: 2024-12-23 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [![paper](https://img.shields.io/badge/arXiv-Paper-.svg)](https://arxiv.org/abs/2408.03703) [![Code](https://img.shields.io/badge/Project-Website-87CEEB)](https://github.com/Tianfang-Zhang/CAS-ViT) πŸ“Œ Official Implementation of our proposed method CAS-ViT. ---

Comparison of diverse self-attention mechanisms. (a) is the classical multi-head self-attention in ViT. (b) is the separable self-attention in MobileViTv2, which reduces the feature metric of a matrix to a vector. (c) is the swift self-attention in SwiftFormer, which achieves efficient feature association only with **Q** and **K**. (d) is proposed convolutional additive self-attention.

**Upper:** Illustration of the classification backbone network. Four stages downsample the original image to 1/4, 1/8, 1/16, 1/32 . **Lower:** Block architecture with N$_i$ blocks stacked in each stage. ## Model Zoo You can download the pretrained weights and configs from [Model Zoo](./MODEL_ZOO.md). ## Requirements ```bash torch==1.8.0 torchvision==0.9.1 timm==0.5.4 mmcv-full==1.5.3 mmdet==2.24 mmsegmentation==0.24 ``` ## Classification ### 1. Data Prepare Download ImageNet-1K dataset. ``` β”‚imagenet/ β”œβ”€β”€train/ β”‚ β”œβ”€β”€ n01440764 β”‚ β”‚ β”œβ”€β”€ n01440764_10026.JPEG β”‚ β”‚ β”œβ”€β”€ n01440764_10027.JPEG β”‚ β”‚ β”œβ”€β”€ ...... β”‚ β”œβ”€β”€ ...... β”œβ”€β”€val/ β”‚ β”œβ”€β”€ ILSVRC2012_val_00000293.JPEG β”‚ β”œβ”€β”€ ILSVRC2012_val_00002138.JPEG β”‚ β”œβ”€β”€ ...... ``` Load image from `./classification/data/imagenet1k/train.txt`. ### 2. Evaluation Download the pretrained weights from [Model Zoo](./MODEL_ZOO.md) and run the following command for evaluation on ImageNet-1K dataset. ```shell MODEL=rcvit_m # model to evaluate: rcvit_{xs, s, m, t} python main.py --model ${MODEL} --eval True --resume --input_size 384 --data_path ``` Checkpoint of CAS-ViT-M should give: ``` * Acc@1 81.430 Acc@5 95.664 loss 0.907 ``` ### 3. Training On a single machine with 8 GPUs, run the following command to train: ```python python -m torch.distributed.launch --nproc_per_node 8 main.py \ --data_path \ --output_dir \ --model rcvit_m \ --lr 6e-3 --batch_size 128 --drop_path 0.1 \ --model_ema True --model_ema_eval True \ --use_amp True --multi_scale_sampler ``` ### 4. Finetuning On a single machine with 8 GPUs, run the following command to funetune: ```python python -m torch.distributed.launch --nproc_per_node 8 main.py \ --data_path \ --output_dir \ --finetune \ --input_size 384 --epoch 30 --batch_size 64 \ --lr 5e-5 --min_lr 5e-5 --weight_decay 0.05 \ --drop_path 0 --model_ema True \ --model_ema_eval True --use_amp True \ --auto_resume False --multi_scale_sampler ``` ## Object Detection and Instance Segmentation ### 1. Data preparation Prepare COCO according to the guidelines in [MMDetection](https://github.com/open-mmlab/mmdetection/tree/v2.24.0). ### 2. Evaluation To evaluate CAS-ViT + RetinaNet on COCO val 2017 on a single machine with 8 GPUs, run the following command: ```python python -m torch.distributed.launch --nproc_per_node 8 test.py \ \ \ --launcher pytorch ``` ### 3. Training To train CAS-ViT-M + RetinaNet on COCO val 2017 on a single machine with 8 GPUs, run the following command: ```python python -m torch.distributed.launch --nproc_per_node 8 train.py \ --launcher pytorch ``` ## Semantic Segmentation ### 1. Data preparation Prepare ADE20K according to the guidelines in [MMSegmentation](https://github.com/open-mmlab/mmsegmentation/tree/0.x). ### 2. Evaluation To evaluate CAS-ViT + Semantic FPN on ADE20K on a single machine with 8 GPUs, run the following command: ```python python -m torch.distributed.launch --nproc_per_node 8 tools/test.py \ \ \ --launcher pytorch ``` ### 3. Training To train CAS-ViT-M + Semantic FPN on ADE20K on a single machine with 8 GPUs, run the following command: ```python python -m torch.distributed.launch --nproc_per_node 8 tools/train.py \ --launcher pytorch ``` ## Citation ```bibtex @article{zhang2024cas, title={CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications}, author={Zhang, Tianfang and Li, Lei and Zhou, Yang and Liu, Wentao and Qian, Chen and Ji, Xiangyang}, journal={arXiv preprint arXiv:2408.03703}, year={2024} } ``` ## Acknowledgment Our code was build base on [ConvNeXt](https://github.com/facebookresearch/ConvNeXt), [EdgeNeXt](https://github.com/mmaaz60/EdgeNeXt/tree/main), [PoolFormer](https://github.com/sail-sg/poolformer/tree/main), [MMDetection](https://github.com/open-mmlab/mmdetection/tree/v2.24.0) and [MMsegmentation](https://github.com/open-mmlab/mmsegmentation/tree/0.x). Thanks for their public repository and excellent contributions!