# DynamicCoT
**Repository Path**: ByteDance/DynamicCoT
## Basic Information
- **Project Name**: DynamicCoT
- **Description**: 🔥 [EMNLP 2025] Official open-source repo for Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-10-11
- **Last Updated**: 2026-01-19
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
## 🚀 News
- **2025.10.09** **arXiv** preprint released.
- **2025.10.09** Code released.
- **2025.08.21** 🎉 DynamicCoT is accepted by EMNLP 2025.
## 📝 Introduction
Multi-modal keyphrase prediction (MMKP) aims to advance beyond text-only methods by incorporating multiple modalities of input information to produce a set of conclusive phrases. Traditional multi-modal approaches have been proven to have significant limitations in handling the challenging absence and unseen scenarios. Additionally, we identify shortcomings in existing benchmarks that overestimate model capability due to significant overlap in training tests. In this work, we propose leveraging vision-language models (VLMs) for the MMKP task. Firstly, we use two widely-used strategies, e.g., zero-shot and supervised fine-tuning (SFT) to assess the lower bound performance of VLMs. Next, to improve the complex reasoning capabilities of VLMs, we adopt Fine-tune-CoT, which leverages high-quality CoT reasoning data generated by a teacher model to finetune smaller models. Finally, to address the “overthinking” phenomenon, we propose a dynamic CoT strategy which adaptively injects CoT data during training, allowing the model to flexibly leverage its reasoning capabilities during the inference stage. We evaluate the proposed strategies on various datasets and the experimental results demonstrate the effectiveness of the proposed approaches. The code and datasets will be made publicly available upon acceptance of the paper.
## 🔧 Get Started
#### Installation and Data Preparation
step 1. Prepare environment.
```shell
pip3 install -e ".[torch,metrics,deepspeed]"
# we use transformers==4.52.1 for InternVL3 and transformers==4.49.0 for other models
pip3 install transformers
pip3 install vllm==0.7.3
```
step 2. Prepare dataset. You need to download raw images from [CMKP](https://github.com/yuewang-cuhk/CMKP).
```shell
python3 data/preprocess_datasets.py /path/to/images data/
```
#### Train model
```shell
bash train_full_sft.sh {/path/to/model} {/path/to/output} --template {template} --run_name {wandb_run_name} --dataset {train_dataset} --per_device_train_batch_size 1 --num_train_epochs {epoch}
```
#### Test model
```shell
# for InternVL3, source_txt in data/mmkp_source/
bash eval_internvl.sh {/path/to/model} {/path/to/source_txt} --template {template} --dataset {test_dataset}
# for other models
bash eval_full_sft.sh {/path/to/model} {/path/to/source_txt} --template {template} --dataset {test_dataset}
```
## 💡 Method

## 💡 Case Study

## 🙏 Acknowledgement
This project is not possible without multiple great open-sourced code bases. We list some notable examples below.
- [transformers](https://github.com/huggingface/transformers)
- [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)
- [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)
- [InternVL](https://github.com/OpenGVLab/InternVL)
## 📃 Bibtex
If this work is helpful for your research, please consider citing the following BibTeX entry.
```
@article{ma2025dynamiccot,
title={Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models},
author={Ma, Qihang and Li, Shengyu and Tang, Jie and Yang, Dingkang and Chen, shaodong and Zhang, Yingyi and Feng, Chao and Ran, Jiao},
journal={arXiv preprint arXiv:2510.09358},
year={2025}
}
```