# DynamicCoT **Repository Path**: ByteDance/DynamicCoT ## Basic Information - **Project Name**: DynamicCoT - **Description**: 🔥 [EMNLP 2025] Official open-source repo for Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-11 - **Last Updated**: 2026-03-22 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models

If you like DynamicCoT, please give us a star ⭐ on GitHub for the latest update.

[**Qihang Ma**](https://scholar.google.com/citations?user=MAfVfFsAAAAJ) · **Shengyu Li** · **Jie Tang** · [**Dingkang Yang**](https://scholar.google.com/citations?user=jvlDhkcAAAAJ) · **Shaodong Chen** · [**Yingyi Zhang**](https://scholar.google.com/citations?user=IKW4zlAAAAAJ) · [**Chao Feng**](https://scholar.google.com/citations?hl=zh-CN&user=4eEryIsAAAAJ)⁺ · **Jiao Ran**^{ByteDance Douyin Content Group

⁺corresponding authors

![demo](./assets/demo.png)}

## 🚀 News - **2025.10.09** **arXiv** preprint released. - **2025.10.09** Code released. - **2025.08.21** 🎉 DynamicCoT is accepted by EMNLP 2025. ## 📝 Introduction Multi-modal keyphrase prediction (MMKP) aims to advance beyond text-only methods by incorporating multiple modalities of input information to produce a set of conclusive phrases. Traditional multi-modal approaches have been proven to have significant limitations in handling the challenging absence and unseen scenarios. Additionally, we identify shortcomings in existing benchmarks that overestimate model capability due to significant overlap in training tests. In this work, we propose leveraging vision-language models (VLMs) for the MMKP task. Firstly, we use two widely-used strategies, e.g., zero-shot and supervised fine-tuning (SFT) to assess the lower bound performance of VLMs. Next, to improve the complex reasoning capabilities of VLMs, we adopt Fine-tune-CoT, which leverages high-quality CoT reasoning data generated by a teacher model to finetune smaller models. Finally, to address the “overthinking” phenomenon, we propose a dynamic CoT strategy which adaptively injects CoT data during training, allowing the model to flexibly leverage its reasoning capabilities during the inference stage. We evaluate the proposed strategies on various datasets and the experimental results demonstrate the effectiveness of the proposed approaches. The code and datasets will be made publicly available upon acceptance of the paper. ## 🔧 Get Started #### Installation and Data Preparation step 1. Prepare environment. ```shell pip3 install -e ".[torch,metrics,deepspeed]" # we use transformers==4.52.1 for InternVL3 and transformers==4.49.0 for other models pip3 install transformers pip3 install vllm==0.7.3 ``` step 2. Prepare dataset. You need to download raw images from [CMKP](https://github.com/yuewang-cuhk/CMKP). ```shell python3 data/preprocess_datasets.py /path/to/images data/ ``` #### Train model ```shell bash train_full_sft.sh {/path/to/model} {/path/to/output} --template {template} --run_name {wandb_run_name} --dataset {train_dataset} --per_device_train_batch_size 1 --num_train_epochs {epoch} ``` #### Test model ```shell # for InternVL3, source_txt in data/mmkp_source/ bash eval_internvl.sh {/path/to/model} {/path/to/source_txt} --template {template} --dataset {test_dataset} # for other models bash eval_full_sft.sh {/path/to/model} {/path/to/source_txt} --template {template} --dataset {test_dataset} ``` ## 💡 Method ![overview](./assets/main_framework.png) ## 💡 Case Study ![case_study](./assets/case_study.png) ## 🙏 Acknowledgement This project is not possible without multiple great open-sourced code bases. We list some notable examples below. - [transformers](https://github.com/huggingface/transformers) - [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) - [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) - [InternVL](https://github.com/OpenGVLab/InternVL) ## 📃 Bibtex If this work is helpful for your research, please consider citing the following BibTeX entry. ``` @article{ma2025dynamiccot, title={Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models}, author={Ma, Qihang and Li, Shengyu and Tang, Jie and Yang, Dingkang and Chen, shaodong and Zhang, Yingyi and Feng, Chao and Ran, Jiao}, journal={arXiv preprint arXiv:2510.09358}, year={2025} } ```