# BUTD_model

**Repository Path**: xuan-zhang0509/BUTD_model

## Basic Information

- **Project Name**: BUTD_model
- **Description**: BUTD    Model
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2022-12-26
- **Last Updated**: 2022-12-27

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# BUTD_model

 - A pytorch implementation of "[Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering](https://openaccess.thecvf.com/content_cvpr_2018/html/Anderson_Bottom-Up_and_Top-Down_CVPR_2018_paper.html)" for image captioning.
 - SCST training from "[Self-critical Sequence Training for Image Captioning](https://openaccess.thecvf.com/content_cvpr_2017/html/Rennie_Self-Critical_Sequence_Training_CVPR_2017_paper.html)".
 - Clear and easy to learn.

## Environment
 - Python 3.7
 - Pytorch 1.3.1

## Method
### 1. Architecture
![Architecture](./method_figs/Architecture.png)

### 2. Main Process
 - Top-Down Attention LSTM Input  
 ![Formula1](./method_figs/Formula1.png)
 - Attend  
 ![Formula2](./method_figs/Formula2.png)
 - Language LSTM Input  
 ![Formula3](./method_figs/Formula3.png)

## Usage
### 1. Preprocessing
Extract image features by ResNet-101 (denoted as **grid-based features**) and process coco captions data (from [Karpathy splits](https://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip)) through `preprocess.py`. Need to adjust the parameters, where `resnet101_file` comes from [here](https://drive.google.com/drive/folders/0B7fNdx_jAqhtbVYzOURMdDNHSGM). Image features can also be obtained from [here](https://github.com/peteanderson80/bottom-up-attention) or extracted using [ezeli/bottom_up_features_extract](https://github.com/ezeli/bottom_up_features_extract) repository (using fixed 36 features per image, denoted as **region-based features**).

This project is not limited to the MSCOCO dataset, but you need to process your data according to the data format in the `preprocess.py` file.

### 2. Training
 - First adjust the parameters in `opt.py`:
    - train_mode: 'xe' for pre-training, 'rl' for fine-tuning (+SCST).
    - learning_rate: '4e-4' for xe, '4e-5' for rl.
    - resume: resume training from this checkpoint. required for rl.
    - other parameters can be modified as needed.
 - Run:
    - `python train.py`
    - checkpoint save in `checkpoint` dir, test result save in `result` dir.

### 3. Test
 - `python test.py -t model.pth -i image.jpg`
 - only applicable to the model trained by grid-based features.
 - for region-based features, you can first extract the image feature through [ezeli/bottom_up_features_extract](https://github.com/ezeli/bottom_up_features_extract) repository, and then simply modify the `test.py` file to use.

## Result
### Evaluation metrics
Evaluation tool: [ezeli/caption_eval](https://github.com/ezeli/caption_eval)

*XE* represents Cross-Entropy loss, and *+SCST* means using reinforcement learning to fine-tune the model (using CIDEr reward).

|features|training|Bleu-1|Bleu-2|Bleu-3|Bleu-4|METEOR|ROUGE_L|CIDEr|SPICE|
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|grid-based|XE|75.4|59.1|45.5|34.8|26.9|55.6|109.3|20.2|
|grid-based|+SCST|78.7|62.5|47.6|35.7|27.2|56.7|119.1|20.7|
|region-based|XE|76.0|59.9|46.4|35.8|27.3|56.2|110.9|20.3|
|region-based|+SCST|79.5|63.6|48.8|36.9|27.8|57.6|123.1|21.4|

### Examples

|![COCO_val2014_000000386164](./method_figs/COCO_val2014_000000386164.jpg)|
|:---:|
|a bunch of wooden knives on a wooden table.|