# RegionViT **Repository Path**: mirrors_ibm/RegionViT ## Basic Information - **Project Name**: RegionViT - **Description**: open source the research work for published on arxiv. https://arxiv.org/abs/2106.02689 - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-07-14 - **Last Updated**: 2025-09-06 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # RegionViT: Regional-to-Local Attention for Vision Transformers This repository is the official implementation of RegionViT: Regional-to-Local Attention for Vision Transformers. [ArXiv](https://arxiv.org/abs/2106.02689) We provided the codes for [Image Classification](#image-classification) and [Object Detection](#object-detection). If you use the codes and models from this repo, please cite our work. Thanks! ``` @inproceedings{ chen2021regionvit, title={{RegionViT: Regional-to-Local Attention for Vision Transformers}}, author={Chun-Fu (Richard) Chen and Rameswar Panda and Quanfu Fan}, booktitle={ArXiv}, year={2021} } ``` ## Image Classification ### Installation To install requirements: ```setup pip install -r requirements.txt ``` ### Data preparation Download and extract ImageNet train and val images from http://image-net.org/. The directory structure is the standard layout for the torchvision [`datasets.ImageFolder`](https://pytorch.org/docs/stable/torchvision/datasets.html#imagefolder), and the training and validation data is expected to be in the `train/` folder and `val` folder respectively: ``` /path/to/imagenet/ train/ class1/ img1.jpeg class2/ img2.jpeg val/ class1/ img3.jpeg class/2 img4.jpeg ``` ### Model Zoo We provide models trained on ImageNet1K. Models can be found [here](https://github.com/IBM/RegionViT/releases/tag/weights-v0.1). | Name | Acc@1 | #FLOPs | #Params | URL | | --- | --- | --- | --- | --- | | RegionViT-Ti | 80.4 | 2.4 | 13.8M | [model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/RegionViT-Ti.pth) | | RegionViT-S | 82.6 | 5.3 | 30.6M| [model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/RegionViT-S.pth) | | RegionViT-M | 83.1 | 7.4 | 41.2M | [model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/RegionViT-S.pth) | | RegionViT-B | 83.2 | 13.0 | 72.7M | [model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/RegionViT-B.pth) | ### Training To train RegionViT-S on ImageNet on a single node with 8 gpus for 300 epochs run: ```shell script python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model regionvit_small_224 --batch-size 256 --data-path /path/to/imagenet ``` Model names of other models are `regionvit_tiny_224`, `regionvit_medium_224` and `regionvit_base_224`. ### Multinode training Distributed training is available via Slurm and `submitit`: To train RegionViT-S model on ImageNet on 4 nodes with 8 gpus each for 300 epochs: ``` python run_with_submitit.py --model regionvit_small_224 --data-path /path/to/imagenet --batch-size 256 --warmup-epochs 50 ``` Note that: some slurm configurations might need to be changed based on your cluster. ### Evaluation To evaluate a pretrained model on RegionViT-S: ``` python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model regionvit_small_224 --batch-size 256 --data-path /path/to/imagenet --eval --initial_checkpoint /path/to/checkpoint ``` ## Object Detection We performed the object detection based on [Detectron2](https://github.com/facebookresearch/detectron2) with some modifications. The modified version can be found at https://github.com/chunfuchen/detectron2. The major difference is the data augmentation pipepile. ### Installation Follows the installation guide on [Install.md](https://github.com/chunfuchen/detectron2/blob/master/INSTALL.md). ### Data preparation Follows Detectron2 to setup MS COCO dataset. [Link](https://detectron2.readthedocs.io/en/latest/tutorials/builtin_datasets.html) ### Training Before training, you will need to convert the pretrained model into Detectron2 format. We provide the script `tools/convert_cls_model_to_d2.py` for the conversion. ```shell script python3 tools/convert_cls_model_to_d2.py --model /path/to/pretrained/model --ows 7 --nws 7 ``` Then, to train RetinaNet RegionViT-S on MS COCO with 1x schedule: ```shell script python main_detection.py --num-gpus 8 --resume --config-file detection/configs/retinanet_regionvit_FPN_1x.yaml MODEL.BACKBONE.REGIONVIT regionvit_small_224 MODEL.WEIGHTS /path/to/pretrained_model OUTPUT_DIR /path/to/log_folder ``` Model names of other models are `regionvit_base_224`, `regionvit_small_w14_224`, etc. Supported models can be found [here](./regionvit/regionvit.py) ### Model Zoo We provide models trained on MS COCO with MaskRCNN and RetinaNet. Models can be found [here](https://github.com/IBM/RegionViT/releases/tag/weights-v0.1). #### MaskRCNN | Name | #Params (M) | #FLOPs (G) | box mAP (1x) | mask mAP (1x) | box mAP (3x) | mask mAP (3x) | url | | --- | --- | --- | --- | --- | ---| --- | --- | | RegionViT-S | 50.1 | 171.3 | 42.5 | 39.5 | 46.3 | 42.3 | [1x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/MaskRCNN_1x_RegionViT-S.pth)
[3x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/MaskRCNN_3x_RegionViT-S.pth) | | RegionViT-S+ | 50.9 | 182.9 | 43.5 | 40.4 | 47.3 | 43.4 | [1x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/MaskRCNN_1x_RegionViT-S+.pth)
[3x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/MaskRCNN_3x_RegionViT-S+.pth) | | RegionViT-S+ (w/ PEG) | 50.9 | 183.0 | 44.2 | 40.8 | 47.6 | 43.4 | [1x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/MaskRCNN_1x_RegionViT-S+peg.pth)
[3x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/MaskRCNN_3x_RegionViT-S+peg.pth) | | RegionViT-B | 92.2 | 287.9 | 43.5 | 40.1 | 47.2 | 43.0 | [1x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/MaskRCNN_1x_RegionViT-B.pth)
[3x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/MaskRCNN_3x_RegionViT-B.pth) | | RegionViT-B+ | 93.2 | 307.1 | 44.5 | 41.0 | 48.1 | 43.5 | [1x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/MaskRCNN_1x_RegionViT-B+.pth)
[3x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/MaskRCNN_3x_RegionViT-B+.pth) | | RegionViT-B+ (w/ PEG) | 93.2 | 307.2 | 45.4 | 41.6 | 48.3 | 43.5 | [1x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/MaskRCNN_1x_RegionViT-B+peg.pth)
[3x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/MaskRCNN_3x_RegionViT-B+peg.pth) | | RegionViT-B+ (w/ PEG) dagger | 93.2 | 464.4 | 46.3 | 42.4 | 49.2 | 44.5 | [1x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/MaskRCNN_1x_RegionViT-B+peg_dagger.pth)
[3x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/MaskRCNN_3x_RegionViT-B+peg_dagger.pth) | #### RetinaNet | Name | #Params (M) | #FLOPs (G) | box mAP (1x) | box mAP (3x) | url | | --- | --- | --- | --- | --- | ---| | RegionViT-S | 40.8 | 192.6 | 42.2 | 45.8 | [1x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/RetinaNet_1x_RegionViT-S.pth)
[3x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/RetinaNet_3x_RegionViT-S.pth) | | RegionViT-S+ | 41.5 | 204.2 | 43.1 | 46.9 | [1x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/RetinaNet_1x_RegionViT-S+.pth)
[3x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/RetinaNet_3x_RegionViT-S+.pth) | | RegionViT-S+ (w/ PEG) | 41.6 | 204.3 | 43.9 | 46.7 | [1x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/RetinaNet_1x_RegionViT-S+peg.pth)
[3x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/RetinaNet_3x_RegionViT-S+peg.pth) | | RegionViT-B | 83.4 | 308.9 | 43.3 | 46.1 | [1x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/RetinaNet_1x_RegionViT-B.pth)
[3x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/RetinaNet_3x_RegionViT-B.pth) | | RegionViT-B+ | 84.4 | 328.1 | 44.2 | 46.9| [1x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/RetinaNet_1x_RegionViT-B+.pth)
[3x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/RetinaNet_3x_RegionViT-B+.pth) | | RegionViT-B+ (w/ PEG) | 84.5 | 328.2 | 44.6 | 46.9 | [1x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/RetinaNet_1x_RegionViT-B+peg.pth)
[3x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/RetinaNet_3x_RegionViT-B+peg.pth) | | RegionViT-B+ (w/ PEG) dagger | 84.5 | 506.4 | 46.1 | 48.2 | [1x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/RetinaNet_1x_RegionViT-B+peg_dagger.pth)
[3x model](https://github.com/IBM/RegionViT/releases/download/weights-v0.1/RetinaNet_3x_RegionViT-B+peg_dagger.pth) |