# CrossFormer **Repository Path**: advancer-debug/CrossFormer ## Basic Information - **Project Name**: CrossFormer - **Description**: No description available - **Primary Language**: Python - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2024-01-30 - **Last Updated**: 2024-01-30 ## Categories & Tags **Categories**: Uncategorized **Tags**: 目标分割 ## README # CrossFormer This repository is the code for our paper [CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention](https://arxiv.org/pdf/2108.00154.pdf) (**ICLR 2022 Acceptance**). Authors: [Wenxiao Wang](https://www.wenxiaowang.com), Lu Yao, [Long Chen](https://zjuchenlong.github.io/), Binbin Lin, [Deng Cai](http://www.cad.zju.edu.cn/home/dengcai/), [Xiaofei He](http://www.cad.zju.edu.cn/home/xiaofeihe/), Wei Liu ## Updates - [x] Mask-RCNN detection/instance segmentation results with 3x training schedule. - [x] Cascade Mask-RCNN detection/instance segmentation results with 3x training schedule. - [x] The usage of `get_flops.py` in detection and segmentation. - [x] Upload the pretrained CrossFormer-L. ## Introduction Existing vision transformers fail to build attention among objects/features of different scales (cross-scale attention), while such ability is very important to visual tasks. **CrossFormer** is a versatile vision transformer which solves this problem. Its core designs contain **C**ross-scale **E**mbedding **L**ayer (**CEL**), **L**ong-**S**hort **D**istance **A**ttention (**L/SDA**), which work together to enable cross-scale attention. **CEL** blends every input embedding with multiple-scale features. **L/SDA** split all embeddings into several groups, and the self-attention is only computed within each group (embeddings with the same color border belong to the same group.). ![](./figures/github_pic.png) Further, we also propose a dynamic position bias (DPB) module, which makes the effective yet inflexible relative position bias apply to variable image size. Now, experiments are done on four representative visual tasks, *i.e.*, image classification, objection detection, and instance/semantic segmentation. Results show that CrossFormer outperforms existing vision transformers in these tasks, especially in dense prediction tasks (*i.e.*, object detection and instance/semantic segmentation). We think it is because image classification only pays attention to one object and large-scale features, while dense prediction tasks rely more on cross-scale attention. ## Prerequisites 1. Libraries (Python3.6-based) ```bash pip3 install numpy scipy Pillow pyyaml torch==1.7.0 torchvision==0.8.1 timm==0.3.2 ``` 2. Dataset: ImageNet 3. Requirements for detection/instance segmentation and semantic segmentation are listed here: [detection/README.md](./detection/README.md) or [segmentation/README.md](./segmentation/README.md) ## Getting Started ### Training ```bash ## There should be two directories under the path_to_imagenet: train and validation ## CrossFormer-T python -u -m torch.distributed.launch --nproc_per_node 8 main.py --cfg configs/tiny_patch4_group7_224.yaml \ --batch-size 128 --data-path path_to_imagenet --output ./output ## CrossFormer-S python -u -m torch.distributed.launch --nproc_per_node 8 main.py --cfg configs/small_patch4_group7_224.yaml \ --batch-size 128 --data-path path_to_imagenet --output ./output ## CrossFormer-B python -u -m torch.distributed.launch --nproc_per_node 8 main.py --cfg configs/base_patch4_group7_224.yaml --batch-size 128 --data-path path_to_imagenet --output ./output ## CrossFormer-L python -u -m torch.distributed.launch --nproc_per_node 8 main.py --cfg configs/large_patch4_group7_224.yaml \ --batch-size 128 --data-path path_to_imagenet --output ./output ``` ### Testing ```bash ## Take CrossFormer-T as an example python -u -m torch.distributed.launch --nproc_per_node 1 main.py --cfg configs/tiny_patch4_group7_224.yaml \ --batch-size 128 --data-path path_to_imagenet --eval --resume path_to_crossformer-t.pth ``` Training scripts for objection detection: [detection/README.md](./detection/README.md). Training scripts for semantic segmentation: [segmentation/README.md](./segmentation/README.md). ## Results ### Image Classification Models trained on ImageNet-1K and evaluated on its validation set. The input image size is 224 x 224. | Architectures | Params | FLOPs | Accuracy | Models | | ------------- | ------: | -----: | --------: | :---------------- | | ResNet-50 | 25.6M | 4.1G | 76.2% | - | | RegNetY-8G | 39.0M | 8.0G | 81.7% | - | | **CrossFormer-T** | **27.8M** | **2.9G** | **81.5%** | [Google Drive](https://drive.google.com/file/d/1YSkU9enn-ITyrbxLH13zNcBYvWSEidfq/view?usp=sharing)/[BaiduCloud](https://pan.baidu.com/s/1M45YXZgVvp6Ew9DO8UhdlA), key: nkju | | **CrossFormer-S** | **30.7M** | **4.9G** | **82.5%** | [Google Drive](https://drive.google.com/file/d/1RAkigsgr33va0RZ85S2Shs2BhXYcS6U8/view?usp=sharing)/[BaiduCloud](https://pan.baidu.com/s/1Xf4MXfb_soCnJFBeNDmoQQ), key: fgqj | | **CrossFormer-B** | **52.0M** | **9.2G** | **83.4%** | [Google Drive](https://drive.google.com/file/d/1bK8biVCi17nz_nkt7rBfio_kywUpllSU/view?usp=sharing)/[BaiduCloud](https://pan.baidu.com/s/1f5dH__UGDXb-HoOPHT5p0A), key: 7md9 | | **CrossFormer-L** | **92.0M** | **16.1G** | **84.0%** | [Google Drive](https://drive.google.com/file/d/1zRWByVW_KIZ87NgaBkDIm60DAsGJErdG/view?usp=sharing)/[BaiduCloud](https://pan.baidu.com/s/1YJLeHy_cxLBrZLklQBCA_A), key: cc89| More results compared with other vision transformers can be seen in the [paper](https://arxiv.org/pdf/2108.00154.pdf). ### Objection Detection & Instance Segmentation Models trained on COCO 2017. Backbones are initialized with weights pre-trained on ImageNet-1K. | Backbone | Detection Head | Learning Schedule | Params | FLOPs | box AP | mask AP | | ------------- | ----------------- | -------------------- | ------: | ------: | ------: | ------: | | ResNet-101 | RetinaNet | 1x | 56.7M | 315.0G | 38.5 | - | | **CrossFormer-S** | RetinaNet | 1x | **40.8M** | **282.0G** | **44.4** | - | | **CrossFormer-B** | RetinaNet | 1x | **62.1M** | **389.0G** | **46.2** | - | | ResNet-101 | Mask-RCNN | 1x | 63.2M | 336.0G | 40.4 | 36.4 | | **CrossFormer-S** | Mask-RCNN | 1x | **50.2M** | **301.0G** | **45.4** | **41.4** | | **CrossFormer-B** | Mask-RCNN | 1x | **71.5M** | **407.9G** | **47.2** | **42.7** | | **CrossFormer-S** | Mask-RCNN | 3x | **50.2M** | **291.1G** | **48.7** | **43.9** | | **CrossFormer-B** | Mask-RCNN | 3x | **71.5M** | **398.1G** | **49.8** | **44.5** | | **CrossFormer-S** | Cascade-Mask-RCNN | 3x | **88.0M** | **769.7G** | **52.2** | **45.2** | More results and pretrained models for objection detection: [detection/README.md](./detection/README.md). ### Semantic Segmentation Models trained on ADE20K. Backbones are initialized with weights pre-trained on ImageNet-1K. | Backbone | Segmentation Head | Iterations | Params | FLOPs | IOU | MS IOU | | ------------- | -------------------- | ----------: | ------: | -------: | ----: | ------: | | **CrossFormer-S** | FPN | 80K | **34.3M** | **209.8G** | **46.4** | - | | **CrossFormer-B** | FPN | 80K | **55.6M** | **320.1G** | **48.0** | - | | **CrossFormer-L** | FPN | 80K | **95.4M** | **482.7G** | **49.1** | - | | ResNet-101 | UPerNet | 160K | 86.0M | 1029.G | 44.9 | - | | **CrossFormer-S** | UPerNet | 160K | **62.3M** | **979.5G** | **47.6** | **48.4** | | **CrossFormer-B** | UPerNet | 160K | **83.6M** | **1089.7G** | **49.7** | **50.6** | | **CrossFormer-L** | UPerNet | 160K | **125.5M** | **1257.8G** | **50.4** | **51.4** | *MS IOU means IOU with multi-scale testing.* More results and pretrained models for semantic segmentation: [segmentation/README.md](./segmentation/README.md). ## Citing Us ``` @article{wang2021crossformer, title={CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention}, author={Wang, Wenxiao and Yao, Lu and Chen, Long and Cai, Deng and He, Xiaofei and Liu, Wei}, journal={arXiv preprint arXiv:2108.00154}, year={2021} } ``` ## Acknowledgement Part of the code of this repository refers to [Swin Transformer](https://github.com/microsoft/Swin-Transformer).