# video-SALMONN-2 **Repository Path**: ByteDance/video-SALMONN-2 ## Basic Information - **Project Name**: video-SALMONN-2 - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-06-28 - **Last Updated**: 2026-02-26 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models ๐๐ Welcome to the repo of **video-SALMONN 2**! video-SALMONN 2 is a powerful audio-visual large language model (LLM) that **generates high-quality audio-visual video captions**, which is developed by the Department of Electronic Engineering at Tsinghua University and ByteDance.
## ๐ฅ News - **2026-02-24**: We release the minimal inference code for video-SALMONN 2+ 3B and 7B. - **2026-02-15**: We release audio-aligned video-SALMONN 2+ 3B checkpoint. - **2026-01-28**: We release the audio-aligned model of video-SALMONN 2+ 72B for finetuning larger audio-visual models. - **2025-12-18**: We have released the audio-aligned model of video-SALMONN 2+ 7B for further finetuning. - **2025-09-26**: A new version (Version-2509) of video-SALMONN 2+ is released, containing minor [code](https://github.com/bytedance/video-SALMONN-2/tree/main/video_SALMONN2_plus) revision, an update for [7B model](https://huggingface.co/tsinghua-ee/video-SALMONN-2_plus_7B) and [72B model](https://huggingface.co/tsinghua-ee/video-SALMONN-2_plus_72B), as well as the addition of [3B model](https://huggingface.co/tsinghua-ee/video-SALMONN-2_plus_3B). The upgraded video-SALMONN 2+ further enhances audio-visual and visual-only understanding capability on various benchmarks. - **2025-07-17**: We release the code and checkpoint of video-SALMONN 2+ at [video-SALMONN 2+](https://github.com/bytedance/video-SALMONN-2/tree/main/video_SALMONN2_plus) ๏ผVersion-2507). video-SALMONN 2+ achieves SOTA results on [Video-MME](https://video-mme.github.io/home_page.html) benchmark. - **2025-07-08**: We release the 7B version of video-SALMONN 2. - **2025-06-18**: We release the code of video-SALMONN 2. ## โก๏ธ Results We evaluate the models on audio-visual QA benchmarks including Video-MME, WorldSense, AVUT, Video-Holmes, and DailyOmni, and visual-only benchmarks including MLVU and LVBench. Our 3B and 7B models achieve SOTA results at comparable scales, while the 72B model surpasses all other open-source systems.