# oclip

**Repository Path**: ByteDance/oclip

## Basic Information

- **Project Name**: oclip
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2022-10-03
- **Last Updated**: 2026-02-18

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# oCLIP

This repository is the official implementation for the following paper:

[Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting](https://arxiv.org/abs/2203.03911)

Chuhui Xue, Wenqing Zhang, Yu Hao, Shijian Lu, Philip Torr, Song Bai, ECCV 2022 (Oral)

Part of the code is inherited from [open_clip](https://github.com/mlfoundations/open_clip).

# Models

* English

| Backbone | Pre-train Data | Pre-train Model | Fine-tune Data | Fine-tune Model ([PSENet](https://github.com/whai362/PSENet)) | Precision | Recall | F-score |
|----------|----------------|-----------------|----------------|----------------|-----------|-----------|-----------|
| ResNet-50 | [SynthText](https://www.robots.ox.ac.uk/~vgg/data/scenetext/) | [Link](https://1drv.ms/u/s!Al-Eh1QsezDHhONgVq_Q-h28M5NSTg?e=Elvcsj) | [Total-Text](https://github.com/cs-chan/Total-Text-Dataset) | [Link](https://1drv.ms/u/s!Al-Eh1QsezDHhONkmHtayDoqTHLgkg?e=zu64Iv) | 89.9 | 81.6 | 85.5 |
| ResNet-101 | [SynthText](https://www.robots.ox.ac.uk/~vgg/data/scenetext/) | [Link](https://1drv.ms/u/s!Al-Eh1QsezDHhONibwNePlvoa49oUg?e=2jjbEa) | [Total-Text](https://github.com/cs-chan/Total-Text-Dataset) | [Link](https://1drv.ms/u/s!Al-Eh1QsezDHhONm7Jdoi58GXb67gg?e=X0K39L) | 89.9 | 82.2 | 85.9 |
| ResNet-50 | Web Image | [Link](https://1drv.ms/u/s!Al-Eh1QsezDHhONhJgZBfPf0ymJ3YA?e=6SZR4w) | [Total-Text](https://github.com/cs-chan/Total-Text-Dataset) | [Link](https://1drv.ms/u/s!Al-Eh1QsezDHhONlCkGsx0zLJkSrbw?e=ng4hSt) | 90.1 | 83.5 | 86.7 |

* Chinese

| Backbone | Pre-train Data | Pre-train Model |
|----------|----------------|----------------|
| ResNet-50 | [LSVT-Weak Annotation](https://rrc.cvc.uab.es/?ch=16&com=introduction) | [Link](https://1drv.ms/u/s!Al-Eh1QsezDHhONjIta2HhLeSNPhgw?e=ziQaBe) |

# Training oCLIP

## Conda
```Bash
conda create -n oclip python=3.7
conda activate oclip
pip install -r requirement.txt

git clone https://github.com/bytedance/oclip.git
cd oclip
export PYTHONPATH="$PYTHONPATH:$PWD/src"
```

## Data
Download [SynthText](https://www.robots.ox.ac.uk/~vgg/data/scenetext/) and put it to ./data.

You may use the provided script to generate the annotation for pre-training:
```Bash
python tools/convert_synthtext_csv.py --data_dir data/SynthText/ --save_dir data/SynthText/
```
* Note we use [space] to represent the masked characters. For customized datasets, you may modify codes in src/training/data.py and your data annotation accordingly.

## Train
Sample running code for training:
```Bash
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -u src/training/main.py \
    --save-frequency 3 \
    --report-to tensorboard \
    --train-data="data/SynthText/train_char.csv"  \
    --char-dict-pth="data/SynthText/char_dict" \
    --csv-img-key filepath \
    --csv-caption-key title \
    --warmup 10000 \
    --batch-size=64 \
    --lr=1e-4\
    --wd=0.1 \
    --epochs=100 \
    --workers=8 \
    --model RN50 \
    --logs='output/RN50_synthtext' 
```
# Visualization
We also provide a script for visualization of attention maps in the pre-trained model.

Download the [pre-trained model](https://entuedu-my.sharepoint.com/:u:/r/personal/xuec0003_e_ntu_edu_sg/Documents/opensource/oCLIP/Pre-trained%20Models/RN50_synthtext.pt?csf=1&web=1&e=uVnGWs) to ./pretrained.

```Bash
python3 tools/visualize_attn.py --model_path pretrained/RN50_synthtext.pt --char_dict_path data/SynthText/char_dict --model_config_file src/training/model_configs/RN50.json --im_fn demo/sample.jpg --text_list "ST LING" "STRLIN " "A GYLL'S" " ODGINGS" --demo_path demo/
```

| Input Image | Image Attenion Map | "ST LING" | "STRLIN " | "A GYLL'S" | " ODGINGS" |
|-------------|--------------------|-----------|-----------|------------|------------|
|![Input Image](demo/sample.jpg)|![Image Attenion Map](demo/im_attn_demo.jpg)|![Char Attention Map 0](demo/char_attn_demo_0.jpg)|![Char Attention Map 1](demo/char_attn_demo_1.jpg)|![Char Attention Map 2](demo/char_attn_demo_2.jpg)|![Char Attention Map 3](demo/char_attn_demo_3.jpg)|


# Fine-tune in MMOCR
We provide a script for converting model parameter names, thus it could be used in the dev-1.x branch of [MMOCR](https://github.com/open-mmlab/mmocr/tree/dev-1.x)
```Bash
# first modify the model_path and save_path in tools/convert2mmocr.py
python tools/convert2mmocr.py
```

# Citation
```Text
@article{xue2022language,
  title={Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting},
  author={Xue, Chuhui and Zhang, Wenqing and Hao, Yu and Lu, Shijian and Torr, Philip and Bai, Song},
  journal={Proceedings of the European Conference on Computer Vision (ECCV)},
  year={2022}
}
```