# self-critical.pytorch **Repository Path**: srwpf/self-critical.pytorch ## Basic Information - **Project Name**: self-critical.pytorch - **Description**: Unofficial pytorch implementation for Self-critical Sequence Training for Image Captioning; py3 version in py3 branch (it actually should be compatible to both py2 and py3) - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2019-10-25 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Self-critical Sequence Training for Image Captioning (+ misc.) This repository includes the unofficial implementation [Self-critical Sequence Training for Image Captioning](https://arxiv.org/abs/1612.00563) and [Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering](https://arxiv.org/abs/1707.07998). The author of SCST helped me a lot when I tried to replicate the result. Great thanks. The att2in2 model can achieve more than 1.20 Cider score on Karpathy's test split (with self-critical training, bottom-up feature, large rnn hidden size, without ensemble) This is based on my [ImageCaptioning.pytorch](https://github.com/ruotianluo/ImageCaptioning.pytorch) repository. The modifications is: - Self critical training. - Bottom up feature support from [ref](https://arxiv.org/abs/1707.07998). (Evaluation on arbitrary images is not supported.) - Ensemble - Multi-GPU training - Add transformer (merged from [Transformer_captioning](https://github.com/ruotianluo/Transformer_Captioning)) ## Requirements Python 2.7 (because there is no [coco-caption](https://github.com/tylin/coco-caption) version for python 3) PyTorch 1.0 (along with torchvision) cider (already been added as a submodule) (**Skip if you are using bottom-up feature**): If you want to use resnet to extract image features, you need to download pretrained resnet model for both training and evaluation. The models can be downloaded from [here](https://drive.google.com/open?id=0B7fNdx_jAqhtbVYzOURMdDNHSGM), and should be placed in `data/imagenet_weights`. ## Pretrained models (using resnet101 feature) Pretrained models are provided [here](https://drive.google.com/open?id=0B7fNdx_jAqhtdE1JRXpmeGJudTg). And the performances of each model will be maintained in this [issue](https://github.com/ruotianluo/neuraltalk2.pytorch/issues/10). If you want to do evaluation only, you can then follow [this section](#generate-image-captions) after downloading the pretrained models (and also the pretrained resnet101). ## Train your own network on COCO/Flickr30k ### Prepare data. We now support both flickr30k and COCO. See details in `data/README.md`. (Note: the later sections assume COCO dataset; it should be trivial to use flickr30k.) ### Start training ```bash $ python train.py --id fc --caption_model fc --input_json data/cocotalk.json --input_fc_dir data/cocotalk_fc --input_att_dir data/cocotalk_att --input_label_h5 data/cocotalk_label.h5 --batch_size 10 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path log_fc --save_checkpoint_every 6000 --val_images_use 5000 --max_epochs 30 ``` The train script will dump checkpoints into the folder specified by `--checkpoint_path` (default = `save/`). We only save the best-performing checkpoint on validation and the latest checkpoint to save disk space. To resume training, you can specify `--start_from` option to be the path saving `infos.pkl` and `model.pth` (usually you could just set `--start_from` and `--checkpoint_path` to be the same). If you have tensorflow, the loss histories are automatically dumped into `--checkpoint_path`, and can be visualized using tensorboard. The current command use scheduled sampling, you can also set scheduled_sampling_start to -1 to turn off scheduled sampling. If you'd like to evaluate BLEU/METEOR/CIDEr scores during training in addition to validation cross entropy loss, use `--language_eval 1` option, but don't forget to download the [coco-caption code](https://github.com/tylin/coco-caption) into `coco-caption` directory. For more options, see `opts.py`. **A few notes on training.** To give you an idea, with the default settings one epoch of MS COCO images is about 11000 iterations. After 1 epoch of training results in validation loss ~2.5 and CIDEr score of ~0.68. By iteration 60,000 CIDEr climbs up to about ~0.84 (validation loss at about 2.4 (under scheduled sampling)). ### Train using self critical First you should preprocess the dataset and get the cache for calculating cider score: ``` $ python scripts/prepro_ngrams.py --input_json .../dataset_coco.json --dict_json data/cocotalk.json --output_pkl data/coco-train --split train ``` Then, copy the model from the pretrained model using cross entropy. (It's not mandatory to copy the model, just for back-up) ``` $ bash scripts/copy_model.sh fc fc_rl ``` Then ```bash $ python train.py --id fc_rl --caption_model fc --input_json data/cocotalk.json --input_fc_dir data/cocotalk_fc --input_att_dir data/cocotalk_att --input_label_h5 data/cocotalk_label.h5 --batch_size 10 --learning_rate 5e-5 --start_from log_fc_rl --checkpoint_path log_fc_rl --save_checkpoint_every 6000 --language_eval 1 --val_images_use 5000 --self_critical_after 30 --cached_tokens coco-train-idxs ``` You will see a huge boost on Cider score, : ). **A few notes on training.** Starting self-critical training after 30 epochs, the CIDEr score goes up to 1.05 after 600k iterations (including the 30 epochs pertraining). ### Caption images after training ## Generate image captions ### Evaluate on raw images Now place all your images of interest into a folder, e.g. `blah`, and run the eval script: ```bash $ python eval.py --model model.pth --infos_path infos.pkl --image_folder blah --num_images 10 ``` This tells the `eval` script to run up to 10 images from the given folder. If you have a big GPU you can speed up the evaluation by increasing `batch_size`. Use `--num_images -1` to process all images. The eval script will create an `vis.json` file inside the `vis` folder, which can then be visualized with the provided HTML interface: ```bash $ cd vis $ python -m SimpleHTTPServer ``` Now visit `localhost:8000` in your browser and you should see your predicted captions. ### Evaluate on Karpathy's test split ```bash $ python eval.py --dump_images 0 --num_images 5000 --model model.pth --infos_path infos.pkl --language_eval 1 ``` The defualt split to evaluate is test. The default inference method is greedy decoding (`--sample_method greedy`), to sample from the posterior, set `--sample_method sample`. **Beam Search**. Beam search can increase the performance of the search for greedy decoding sequence by ~5%. However, this is a little more expensive. To turn on the beam search, use `--beam_size N`, N should be greater than 1. ## Miscellanea **Using cpu**. The code is currently defaultly using gpu; there is even no option for switching. If someone highly needs a cpu model, please open an issue; I can potentially create a cpu checkpoint and modify the eval.py to run the model on cpu. However, there's no point using cpu to train the model. **Train on other dataset**. It should be trivial to port if you can create a file like `dataset_coco.json` for your own dataset. **Live demo**. Not supported now. Welcome pull request. ## For more advanced features: Checkout `ADVANCED.md`. ## Reference If you find this repo useful, please consider citing (no obligation at all): ``` @article{luo2018discriminability, title={Discriminability objective for training descriptive captions}, author={Luo, Ruotian and Price, Brian and Cohen, Scott and Shakhnarovich, Gregory}, journal={arXiv preprint arXiv:1803.04376}, year={2018} } ``` Of course, please cite the original paper of models you are using (You can find references in the model files). ## Acknowledgements Thanks the original [neuraltalk2](https://github.com/karpathy/neuraltalk2) and awesome PyTorch team.