# DeepVideoDiscovery

**Repository Path**: mirrors_microsoft/DeepVideoDiscovery

## Basic Information

- **Project Name**: DeepVideoDiscovery
- **Description**: **Deep Video Discovery (DVD)** is a deep-research style question answering agent designed for understanding extra-long videos.
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-07-04
- **Last Updated**: 2026-02-28

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

[![arXiv](https://img.shields.io/badge/arXiv-2504.16082-A42C25?style=flat&logo=arXiv&logoColor=A42C25)](https://arxiv.org/abs/2505.18079)
[![License](https://img.shields.io/badge/License-MIT-red.svg)](https://opensource.org/licenses/MIT)
[![Hugging Face Spaces](https://img.shields.io/badge/Spaces-Demo-blueviolet?logo=huggingface&logoColor=white)](https://87fc7dc81d4b38ed01.gradio.live)


This repository contains the official implementation of the paper [Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding](https://arxiv.org/abs/2505.18079), which achieves the state-of-the-art performance by a large margin on multiple long video benchmarks including the challenging [LVBench](https://lvbench.github.io/).

![image](https://github.com/user-attachments/assets/ac1c7f0a-3c10-4c4c-88d1-7bfe0e2010e1)


## Update

- **2025/10/15**: Add a markdown to help to reproduce the LVBench results: [REPRODUCE.md](reproduce/README.md). Email me (xiaoyizhang [/at/] microsoft.com) if your issue gets no response in 24 hours.
- **2025/09/19**: Accepted by NeurIPS 2025 🎉
- **2025/08/04**: Upload captions on benchmarks for reproduction: [LVBench](https://1drv.ms/u/c/f029f6f5a52c17c4/ETR7ogx7YCtBgtDu66a4R14B7RKLZJoz20D4Z5I1KD6HTg?e=404kKg), [LVBench w/ transcripts](https://1drv.ms/u/c/f029f6f5a52c17c4/EcqO2lC_hRxGn-0t0IBNKZcBts3HDCEg8mZo4ltN6kXFUQ?e=XmabUn), [Video-MME](https://1drv.ms/u/c/f029f6f5a52c17c4/EVKjXQnPjeZGi-onOxEMb8UBxqI9NexKzccHuYEe8-0Lig?e=a4SxCU), [LongVideoBench](https://1drv.ms/u/c/f029f6f5a52c17c4/EQp_PABeb3ZIiysjIn-_5gEBbkhtfcBwCM1pel9xl3JHPg?e=TLpQXQ) and [EgoSchema](https://1drv.ms/u/c/f029f6f5a52c17c4/Ec0oEX3tO5pIknRdEqT9LDQB0hbS9vR9fUJaVbRfCQPJKg?e=bszgh6).
- **2025/08/02**: Support auto subtitle in the demo.
- **2025/07/17**: Add gradio demo.
- **2025/07/16**: Add `lite_mode` to enable a lightweight version of the agent that uses only subtitles. Good for Youtube podcast analysis!
- **2025/07/14**: Support OpenAI API and Azure OpenAI API.
- **2025/07/08**: Initial release of the Deep Video Discovery codebase.

## Introduction

**Deep Video Discovery (DVD)** is a deep-research style question answering agent designed for understanding extra-long videos. Leveraging the powerful capabilities of large language models (LLMs), DVD effectively interprets and processes extensive video content to answer complex user queries.


https://github.com/user-attachments/assets/26d4d524-bdf0-48a5-9d33-ce19fa7779fb

DVD Achieves state-of-the-art performance by a large margin on multiple long video benchmarks using OpenAI o3.

![radar](./asset/overview.png)


The core design of DVD includes:

- 🎬 **Treating segmented video clips as exploration environments**
- 🤖 **Autonomous planning and reasoning**, dynamically formulating strategies to solve problems efficiently
- 🛠️ **Selecting appropriate multi-granular tools**, iteratively extracting relevant information from the video environment
- 📝 **Summarizing and reflecting on observations**, to provide comprehensive and accurate answers to user questions


## Installation

1. **Clone the repository:**
   ```bash
   git clone https://github.com/microsoft/deepvideodiscovery.git
   cd DeepVideoDiscovery
   ```

2. **Create a virtual environment and install the dependencies:**
   ```bash
   pip install -r requirements.txt
   ```

3. (Optional) **Install gradio for demo:**
   ```bash
   pip install gradio
   ```

## Usage

Note: Set up your configuration by updating the variables in  `config.py`.

### Example

The `local_run.py` script provides an example of how to run the Deep Video Discovery agent by providing a youtube url and question about it.

    ```bash
    python local_run.py https://www.youtube.com/watch?v=PQFQ-3d2J-8 "what did the main speaker talk about in the last part of video?"
    ```

## TODO

- [x] Support OpenAI API key configuration.
- [ ] Implement MCP server.
- [ ] Release evaluation trajectory data on long video benchmarks.


## Changes

Compared to the original implementation, we have made the following changes:
- Refactored the code for better readability and maintainability.
- In `global_browse_tool` we leverage the textual description (rather than original video pixels) of multiple video clips to provide global overview of the video content to improve efficency.


## Citation

If you find our work useful, please consider citing:

```bibtex
@article{zhang2025deep,
  title={Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding},
  author={Zhang, Xiaoyi and Jia, Zhaoyang and Guo, Zongyu and Li, Jiahao and Li, Bin and Li, Houqiang and Lu, Yan},
  journal={arXiv preprint arXiv:2505.18079},
  year={2025}
}
```