# transformer-caption **Repository Path**: srwpf/transformer-caption ## Basic Information - **Project Name**: transformer-caption - **Description**: A Bottom-Up and Top-Down Approach for Image Captioning using Transformer - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2019-10-26 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # caption-transformer This code implements the work A Bottom-Up and Top-Down Approach for Image Captioning using Transformer accepted at ICVGIP 2018. ### Disclaimer This code is modified from [tensor2tensor](https://github.com/tensorflow/tensor2tensor), [show-attend-tell](https://github.com/yunjey/show-attend-and-tell) and uses features obtained from [bottom-up-attention](https://github.com/peteanderson80/bottom-up-attention). Please refer to these links for further README information (for example, relating to other models and datasets included in the repo) and appropriate citations for these works. ### Requirements: software 1. python2.7 2. tensor2tensor==1.5.1 (see: [tensor2tensor](https://github.com/tensorflow/tensor2tensor)) 3. tensorflow-gpu==1.4.1 4. Python packages you might not have: `Ipython`, `Matplotlib`, `scikit-image` ### Data Setup Download the [MSCOCO dataset](http://mscoco.org/dataset/#download) into data/tmp/t2t_datagen_caption/train2014.zip, data/tmp/t2t_datagen_caption/val2014.zip, data/tmp/t2t_datagen_caption/test2014.zip, data/tmp/t2t_datagen_caption/ and captions_train-val2014.zip. Split the images according to the Karpathy split and place it into data/tmp/train2014,data/tmp/val2014 and data/tmp/test2014. The bottom-up features are obtained from [bottom-up-attention](https://github.com/peteanderson80/bottom-up-attention), for the entire coco dataset.Place it in the data/tmp/bottom_up_fatures folder. ### Data Generation t2t-datagen --data_dir=data/t2t_data_caption_bottom --tmp_dir=data/tmp/t2t_datagen_caption --problem=image_ms_coco_tokens32k ### Training t2t-trainer --data_dir=data/t2t_data_caption_bottom --problems=image_ms_coco_tokens32k --model=transformer --hparams_set=transformer_base_single_gpu --output_dir=data/t2t_train/image_ms_coco_tokens32k/transformer-transformer_base_single_gpu --keep_checkpoint_max=10 --hparams="num_heads=8" --local_eval_frequency=20000 --train_steps=550000 ### Decoding t2t-decoder --data_dir=data/t2t_data_caption_bottom --problems=image_ms_coco_tokens32k --model=transformer --hparams_set=transformer_base_single_gpu --output_dir=/data/t2t_train/image_ms_coco_tokens32k/transformer-transformer_base_single_gpu --decode_hparams="beam_size=1,save_images=True" --decode_to_file=dec --hparams="num_heads=8" Evaluate generated captions with evaluate/evaluate_generated_captions.py and visualize attention with visualization/TransformerVisualizationCaption.py