# ml-streambridge

**Repository Path**: mirrors_apple/ml-streambridge

## Basic Information

- **Project Name**: ml-streambridge
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-11-15
- **Last Updated**: 2026-03-21

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

<p align="center">
    <img src="assets/fig_logo.png" width="100" style="margin-bottom: 0.2;"/>
<p>
<h2 align="center"> <a href="https://arxiv.org/abs/2505.05467">[NeurIPS 2025] StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant</a></h2>

<h5 align="center">

[![arXiv](https://img.shields.io/badge/Arxiv-2505.05467-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2505.05467)

</h5>

<div align="center">
  <img src="assets/fig_model.png"/>
</div><br/>

🌟 **StreamBridge** is a simple yet powerful framework that enables offline Video-LLMs to perform effectively in streaming scenarios. It features:
- A **memory buffer** with round-decayed compression for long-context, multi-turn interactions.
- A **decoupled and lightweight activation model** that enables proactive, timely responses without affecting the base model’s reasoning capabilities.
- A newly built dataset, **Stream-IT**, tailored for streaming video understanding with interleaved video-text sequences and diverse instructions.

> [!IMPORTANT]
> For copyright reasons, we can’t release model weights trained on YouTube or other videos that may contain IP-protected content. However, we’re open-sourcing the model implementation and the synthetic data used for training.

---

## 🛠️ Install
1. Clone this repository and navigate to folder
```bash
git clone https://github.com/apple/ml-streambridge
cd ml-streambridge
```
2. Install package
```bash
conda create -n ml-streambridge python=3.10.14
conda activate ml-streambridge
pip install -e .
pip install flash-attn==2.3.3 --no-build-isolation
```

## 🚀 Demo for Quick Start
1. Download checkpoints:
TBD due to video copyright reasons.

- Organize as:
```
├── /your/path/to/checkpoints
│   └── llava-onevision-qwen2-0.5b-ov-hf-seperated
│   └── activation_0.5_ratio_anet_coin_yc2_s2s_fa_mhego_hacs_cha_et_llava-ov_epoch_5.pth
│   └── LLaVA-OV-7B-du2e2hjxik
│   └── Oryx-1.5-7B-jfsvkb3hn8
│   └── Qwen2-VL-7B-jh6p673iyp
```

2. Run a demo
- Update the `your_weight_path` in `demo.py` to match the weight directory above:
```bash
python demo.py # activation threshold is set for the response frequency
```
- You should see output like:
```
18 seconds:  Pour the cooked noodles.
32 seconds:  Cut the lemon.
44 seconds:  Cut the olives in half.
55 seconds:  Chop the parsley.
68 seconds:  Squeeze the lemon juice into the measuring cup.
78 seconds:  Pound the chicken.
...
```

## 💡 Evaluation on OVO-Bench (multi-turn streaming) and VideoMME (single-turn offline)
1. You can download the raw videos for OVO-Bench from [[🤗HF](https://huggingface.co/datasets/JoeLeelyf/OVO-Bench)] and VideoMME from [[🤗HF](https://huggingface.co/datasets/lmms-lab/Video-MME)]. And reorganize the folder as follows:
```
├── /your/path/to/ovo_bench
│   └── videos
│   └── ovo_bench.json
│   └── ...
├── /your/path/to/videomme
│   └── videos
│   └── videomme.json
│   └── ...
```
- Here, we provide the OVO-Bench's `ovo_bench.json` and VideoMME's `videomme.json` in `./assets`.

2.  Run evaluation script
- Set `ANNO_PATH` and `VIDEO_PATH` in `scripts/eval.sh` for the OVO-Bench and VideoMME you download above, and then run:
```bash
bash scripts/eval.sh
```
- Evaluate different models by modifiying `MODEL` and `CKPT` in the script.
- By default, 8 A100-80G GPUs are used; you can adjust `NUM_GPUS` and `MAX_IMG_TOKEN` to reduce memory usage.

3. Report the results
<!-- > [!WARNING]
> Due to the copyright limitations mentioned above, reproduced numbers may slightly differ from those reported in the paper. -->

```bash
python eval/metric_report.py
```
- And you should reproduce the results below (see our paper for more details):

<!-- | Model Name                | OVO-Bench-Real-Time (OCR/ACR/ATR/STU/FPD/OJR/AVG.) | VideoMME (w/o subs) |
|:---------------------------:|:-------------------------------------------:|:------:|
| Qwen2-VL-StreamBridge     | 84.56/71.56/74.14/49.44/75.25/72.83/71.30 | 64.4 |
| Oryx-1.5-StreamBridge     | 84.56/75.23/70.69/50.56/74.26/71.74/71.17 | 65.5 |
| LLaVA-OV-StreamBridge     | 74.50/77.06/70.69/54.49/73.27/69.57/69.93 | 61.2 | -->

<!-- Open-released Version -->

| Model Name                | OVO-Bench-Real-Time (OCR/ACR/ATR/STU/FPD/OJR/AVG.) | VideoMME (w/o subs) |
|:---------------------------:|:-------------------------------------------:|:------:|
| Qwen2-VL-StreamBridge     | 85.24/67.89/75.00/52.25/70.30/72.28/70.49 | 63.0 |
| Oryx-1.5-StreamBridge     | 81.21/70.64/70.69/49.44/74.26/68.48/69.12 | 64.2 |
| LLaVA-OV-StreamBridge     | 74.50/78.90/72.41/52.81/78.22/68.68/70.89 | 61.0 |

## 🎬 StreamingQA-120K Dataset
- The raw 1.28 million videos of StreamingQA-120K are sourced from [[🤗WebVid](https://huggingface.co/datasets/WHB139426/Grounded-VideoLLM/tree/main/webvid-703k)], [[🤗InternVid](https://huggingface.co/datasets/WHB139426/Grounded-VideoLLM/tree/main/internvid)] and [[🤗Panda](https://huggingface.co/datasets/WHB139426/Grounded-VideoLLM/tree/main/panda70m_2m)]. You can also download them from their official repos [[WebVid-10M](https://huggingface.co/datasets/TempoFunk/webvid-10M)] [[InternVid-10M](https://huggingface.co/datasets/OpenGVLab/InternVid)] [[Panda-70M](https://github.com/snap-research/Panda-70M)]
- We concatenate videos with higher similarites from these three datasets and annotate QA pairs for them. We provide the [similarity-ordered json file](https://ml-site.cdn-apple.com/datasets/streambridge/qa_groups.json). You can dynamically control the grouping size via `GROUP_LEN`:

```python
import json

def load_json(path):
    with open(path) as f:
        data = json.load(f)
    return data

GROUP_LEN = 10

anns = load_json("/your/path/to/qa_groups.json")
groups = [i for i in range(len(anns))]
groups = [groups[i : i + GROUP_LEN] for i in range(0, len(groups), GROUP_LEN)]
grouped_anns = []
for group in groups:
    if len(group) != GROUP_LEN:
        continue
    grouped_anns.append(
        {
            "video_ids": [anns[i]["video_id"] for i in group],
            "video_files": [anns[i]["video_file"] for i in group],
            "captions": [anns[i]["caption"] for i in group],
            "questions": [anns[i]["question"] for i in group],
            "answers": [anns[i]["answer"] for i in group],
            "options": [anns[i]["options"] for i in group],
            "types": [anns[i]["type"] for i in group],
        }
    )

print(grouped_anns[0])
```

## 📜 License
This software and accompanying data and models have been released under the 
following licenses:
- Code: [Apple Sample Code License (ASCL)](./LICENSE)
- Data: [CC-BY-NC-ND](./LICENSE_DATA) [Deed](https://creativecommons.org/licenses/by-nc-nd/4.0/)

## ✏️ Citation
If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil:.

```BibTeX
@article{wang2025streambridge,
  title={StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant},
  author={Wang, Haibo and Feng, Bo and Lai, Zhengfeng and Xu, Mingze and Li, Shiyu and Ge, Weifeng and Dehghan, Afshin and Cao, Meng and Huang, Ping},
  journal={arXiv preprint arXiv:2505.05467},
  year={2025}
}
```