# Flash-VStream **Repository Path**: ByteDance/Flash-VStream ## Basic Information - **Project Name**: Flash-VStream - **Description**: This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams" - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 6 - **Forks**: 0 - **Created**: 2024-06-21 - **Last Updated**: 2025-08-18 ## Categories & Tags **Categories**: multimedia **Tags**: None ## README # Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams Haoji Zhang\*, Yiqin Wang\*, Yansong Tang †, Yong Liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin†‡ \* Equally contributing first authors, †Correspondence, ‑Project Lead **Work done when interning at Bytedance.** [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/flash-vstream-memory-based-real-time/zeroshot-video-question-answer-on-msvd-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msvd-qa?p=flash-vstream-memory-based-real-time) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/flash-vstream-memory-based-real-time/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=flash-vstream-memory-based-real-time) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/flash-vstream-memory-based-real-time/question-answering-on-next-qa-open-ended)](https://paperswithcode.com/sota/question-answering-on-next-qa-open-ended?p=flash-vstream-memory-based-real-time) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/flash-vstream-memory-based-real-time/zeroshot-video-question-answer-on-activitynet)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-activitynet?p=flash-vstream-memory-based-real-time) We presented **Flash-VStream**, a noval LMM able to process extremely long video streams in real-time and respond to user queries simultaneously. We also proposed **VStream-QA**, a novel question answering benchmark specifically designed for online video streaming understanding.

## News - [2024/06/20] πŸ”₯ Flash-VStream is coming! We release the [homepage](https://invinciblewyq.github.io/vstream-page), [paper](https://arxiv.org/abs/2406.08085v1) for Flash-VStream. ## Contents - [Install](#install) - [Model](#model) - [Preparation](#preparation) - [Train](#train) - [Evaluation](#evaluation) - [Real-time CLI Inference](#Real-time-CLI-Inference) - [VStream-QA Benchmark](#VStream-QA-Benchmark) - [Citation](#citation) - [Acknowledgement](#acknowledgement) - [License](#license) ## Install Please follow the instructions below to install the required packages. 1. Clone this repository 2. Install Package ```bash conda create -n vstream python=3.10 -y conda activate vstream cd Flash-VStream pip install --upgrade pip pip install -e . ``` 3. Install additional packages for training cases ```bash pip install ninja pip install flash-attn --no-build-isolation ``` ## Model We provide our Flash-VStream models after Stage 1 and 2 finetuning: | Model | Weight | Initialized from LLM | Initialized from ViT | | --- | --- | --- | --- | | Flash-VStream-7b | [Flash-VStream-7b](https://huggingface.co/IVGSZ/Flash-VStream-7b) | [lmsys/vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5) | [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) | ## Preparation ### Dataset **Image VQA Dataset.** Please organize the training Image VQA training data following [this](https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md) and evaluation data following [this](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md). Please put the pretraining data, finetuning data, and evaluation data in `pretrain`, `finetune`, and `eval_video` folder following [Structure](#structure). **Video VQA Dataset.** please download the 2.5M subset from [WebVid](https://maxbain.com/webvid-dataset/) and ActivityNet dataset from [official website](http://activity-net.org/download.html) or [video-chatgpt](https://github.com/mbzuai-oryx/Video-ChatGPT/blob/main/docs/train_video_chatgpt.md). If you want to perform evaluation, please also download corresponding files of [ActivityNet-QA](https://github.com/mbzuai-oryx/Video-ChatGPT/blob/main/quantitative_evaluation/README.md) and [NExT-QA-OE](https://github.com/doc-doc/NExT-QA). You can download [MSVD-QA](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155186668_link_cuhk_edu_hk/EUNEXqg8pctPq3WZPHb4Fd8BYIxHO5qPCnU6aWsrV1O4JQ?e=guynwu) and [MSRVTT-QA](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155186668_link_cuhk_edu_hk/EcEXh1HfTXhLrRnuwHbl15IBJeRop-d50Q90njHmhvLwtA?e=SE24eG) from LLaMA-VID. **Meta Info.** For meta info of training data, please download the following files and organize them as in [Structure](#structure). | Training Stage | Data file name | Size | | --- | --- | ---: | | Pretrain | [llava_558k_with_webvid.json](https://huggingface.co/datasets/YanweiLi/LLaMA-VID-Data) | 254 MB | | Finetune | [llava_v1_5_mix665k_with_video_chatgpt.json](https://huggingface.co/datasets/YanweiLi/LLaMA-VID-Data) | 860 MB | For meta info of evaluation data, please reformat each QA list to a json file named `test_qa.json` under [Structure](#structure) with format like this: ```json [ { "video_id": "v_1QIUV7WYKXg", "question": "is the athlete wearing trousers", "id": "v_1QIUV7WYKXg_3", "answer": "no", "answer_type": 3, "duration": 9.88 }, { "video_id": "v_9eniCub7u60", "question": "does the girl in black clothes have long hair", "id": "v_9eniCub7u60_2", "answer": "yes", "answer_type": 3, "duration": 19.43 }, ] ``` ### Pretrained Weights We recommend users to download the pretrained weights from the following link [Vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5), [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14), and put them in `ckpt` following [Structure](#structure). ### Feature Extraction We recommend users to extract ViT features of training and evaluation data, which accelerates training and evaluating a lot. If you do so, just replace `.mp4` with `.safetensors` in video filename and put them in `image_features` and `video_features` folder. If not, ignore the `image_features` and `video_features` folder. We load video feature at fps=1 and arrange them in the time order. Each `.safetensors` file should contain a dict like this: ```python { 'feature': torch.Tensor() with shape=[256, 1024] for image and shape=[Length, 256, 1024] for video. } ``` ### Structure The folder structure should be organized as follows before training. ``` Flash-VStream β”œβ”€β”€ checkpoints-finetune β”œβ”€β”€ checkpoints-pretrain β”œβ”€β”€ ckpt β”‚ β”œβ”€β”€ clip-vit-large-patch14 β”‚ β”œβ”€β”€ vicuna-7b-v1.5 β”œβ”€β”€ data β”‚ β”œβ”€β”€ pretrain β”‚ β”‚ β”œβ”€β”€ llava_558k_with_webvid.json β”‚ β”‚ β”œβ”€β”€ image_features β”‚ β”‚ β”œβ”€β”€ images β”‚ β”‚ β”œβ”€β”€ video_features β”‚ β”‚ β”œβ”€β”€ videos β”‚ β”œβ”€β”€ finetune β”‚ β”‚ β”œβ”€β”€ llava_v1_5_mix665k_with_video_chatgpt.json β”‚ β”‚ β”œβ”€β”€ activitynet β”‚ β”‚ β”œβ”€β”€ coco β”‚ β”‚ β”œβ”€β”€ gqa β”‚ β”‚ β”œβ”€β”€ image_features β”‚ β”‚ β”‚ β”œβ”€β”€ coco β”‚ β”‚ β”‚ β”œβ”€β”€ gqa β”‚ β”‚ β”‚ β”œβ”€β”€ ocr_vqa β”‚ β”‚ β”‚ β”œβ”€β”€ textvqa β”‚ β”‚ β”‚ β”œβ”€β”€ vg β”‚ β”‚ β”œβ”€β”€ ocr_vqa β”‚ β”‚ β”œβ”€β”€ textvqa β”‚ β”‚ β”œβ”€β”€ vg β”‚ β”‚ β”œβ”€β”€ video_features β”‚ β”‚ β”‚ β”œβ”€β”€ activitynet β”‚ β”œβ”€β”€ eval_video β”‚ β”‚ β”œβ”€β”€ ActivityNet-QA β”‚ β”‚ β”‚ β”œβ”€β”€ video_features β”‚ β”‚ β”‚ β”œβ”€β”€ test_qa.json β”‚ β”‚ β”œβ”€β”€ MSRVTT-QA β”‚ β”‚ β”‚ β”œβ”€β”€ video_features β”‚ β”‚ β”‚ β”œβ”€β”€ test_qa.json β”‚ β”‚ β”œβ”€β”€ MSVD-QA β”‚ β”‚ β”‚ β”œβ”€β”€ video_features β”‚ β”‚ β”‚ β”œβ”€β”€ test_qa.json β”‚ β”‚ β”œβ”€β”€ nextqa β”‚ β”‚ β”‚ β”œβ”€β”€ video_features β”‚ β”‚ β”‚ β”œβ”€β”€ test_qa.json β”‚ β”‚ β”œβ”€β”€ vstream β”‚ β”‚ β”‚ β”œβ”€β”€ video_features β”‚ β”‚ β”‚ β”œβ”€β”€ test_qa.json β”‚ β”‚ β”œβ”€β”€ vstream-realtime β”‚ β”‚ β”‚ β”œβ”€β”€ video_features β”‚ β”‚ β”‚ β”œβ”€β”€ test_qa.json β”œβ”€β”€ flash_vstream β”œβ”€β”€ scripts ``` ## Train Flash-VStream is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the `per_device_train_batch_size` and increase the `gradient_accumulation_steps` accordingly. Always keep the global batch size the same: `per_device_train_batch_size` x `gradient_accumulation_steps` x `num_gpus`. If your GPUs have less than 80GB memory, you may try ZeRO-2 and ZeRO-3 stages. Please make sure you download and organize the data following [Preparation](#preparation) before training. Like LLaVA, Flash-VStream has two training stages: pretrain and finetune. Their checkpoints will be saved in `checkpoints-pretrain` and `checkpoints-finetune` folder. These two stages will take about 15 hours on 8 A100 GPUs in total. If you want to train Flash-VStream from pretrained LLM and evaluate it, please run the following command: ```bash bash scripts/train_and_eval.sh ``` ## Evaluation Please make sure you download and organize the data following [Preparation](#preparation) before evaluation. If you want to evaluate a Flash-VStream model, please run the following command: ```bash bash scripts/eval.sh ``` ## Real-time CLI Inference We provide a real-time CLI inference script, which simulates video stream input by reading frames of a video file at a fixed frame speed. You can ask any question and get the answer at any timestamp of the video stream. Run the following command and have a try: ```bash bash scripts/realtime_cli.sh ``` ## VStream-QA Benchmark Please download VStream-QA Benchmark following [this](https://huggingface.co/datasets/IVGSZ/VStream-QA) repo. ## Citation If you find this project useful in your research, please consider citing: ``` @article{flashvstream, title={Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams}, author={Haoji Zhang and Yiqin Wang and Yansong Tang and Yong Liu and Jiashi Feng and Jifeng Dai and Xiaojie Jin}, year={2024}, eprint={2406.08085}, archivePrefix={arXiv}, primaryClass={cs.CV} } ``` ## Acknowledgement We would like to thank the following repos for their great work: - This work is built upon the [LLaVA](https://github.com/haotian-liu/LLaVA). - This work utilizes LLMs from [Vicuna](https://github.com/lm-sys/FastChat). - Some code is borrowed from [LLaMA-VID](https://github.com/dvlab-research/LLaMA-VID). - We perform video-based evaluation from [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT). ## License [![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-yellow.svg)](LICENSE) This project is licensed under the [Apache-2.0 License](LICENSE).