# video-SALMONN-2 **Repository Path**: ByteDance/video-SALMONN-2 ## Basic Information - **Project Name**: video-SALMONN-2 - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-06-28 - **Last Updated**: 2026-02-26 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models 🚀🚀 Welcome to the repo of **video-SALMONN 2**! video-SALMONN 2 is a powerful audio-visual large language model (LLM) that **generates high-quality audio-visual video captions**, which is developed by the Department of Electronic Engineering at Tsinghua University and ByteDance.

## 🔥 News - **2026-02-24**: We release the minimal inference code for video-SALMONN 2+ 3B and 7B. - **2026-02-15**: We release audio-aligned video-SALMONN 2+ 3B checkpoint. - **2026-01-28**: We release the audio-aligned model of video-SALMONN 2+ 72B for finetuning larger audio-visual models. - **2025-12-18**: We have released the audio-aligned model of video-SALMONN 2+ 7B for further finetuning. - **2025-09-26**: A new version (Version-2509) of video-SALMONN 2+ is released, containing minor [code](https://github.com/bytedance/video-SALMONN-2/tree/main/video_SALMONN2_plus) revision, an update for [7B model](https://huggingface.co/tsinghua-ee/video-SALMONN-2_plus_7B) and [72B model](https://huggingface.co/tsinghua-ee/video-SALMONN-2_plus_72B), as well as the addition of [3B model](https://huggingface.co/tsinghua-ee/video-SALMONN-2_plus_3B). The upgraded video-SALMONN 2+ further enhances audio-visual and visual-only understanding capability on various benchmarks. - **2025-07-17**: We release the code and checkpoint of video-SALMONN 2+ at [video-SALMONN 2+](https://github.com/bytedance/video-SALMONN-2/tree/main/video_SALMONN2_plus) （Version-2507). video-SALMONN 2+ achieves SOTA results on [Video-MME](https://video-mme.github.io/home_page.html) benchmark. - **2025-07-08**: We release the 7B version of video-SALMONN 2. - **2025-06-18**: We release the code of video-SALMONN 2. ## ⚡️ Results We evaluate the models on audio-visual QA benchmarks including Video-MME, WorldSense, AVUT, Video-Holmes, and DailyOmni, and visual-only benchmarks including MLVU and LVBench. Our 3B and 7B models achieve SOTA results at comparable scales, while the 72B model surpasses all other open-source systems.

## 🌈 How to Use ### For video-SALMONN 2+, please refer to [video_SALMONN2_plus](https://github.com/bytedance/video-SALMONN-2/tree/main/video_SALMONN2_plus) ### How to train video-SALMONN 2 1. Prepare the dataset following `scripts/example_sft.json` and `scripts/example_dpo.json`. 2. Download LLaVA-OneVision Model from [huggingface](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov). 3. Modify the parameters in `scripts/train_sft.sh` and `scripts/train_dpo.sh`. 4. Run `bash scripts/train_sft.sh` or `bash scripts/train_dpo.sh`. ### How to evaluate a checkpoint 1. Prepare the dataset following `scripts/example_sft.json`. 2. Modify the parameters in `scripts/eval.sh`. 3. Run `bash scripts/eval.sh`. ## 👀 Team **Team Tsinghua**: Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Chao Zhang **Team ByteDance**: Wei Li, Zejun Ma ## ✨ Citation If you find video-SALMONN 2 useful, please cite the paper: ``` @article{tang2025video, title={{video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models}}, author={Changli Tang and Yixuan Li and Yudong Yang and Jimin Zhuang and Guangzhi Sun and Wei Li and Zejun Ma and Chao Zhang}, journal={arXiv preprint arXiv:2506.15220}, year={2025}, } ```