[![arXiv](https://img.shields.io/badge/arXiv%20paper-2510.00438-b31b1b.svg)](https://arxiv.org/pdf/2510.00438) [![project page](https://img.shields.io/badge/Project_page-More_visualizations-green)](https://lzy-dot.github.io/BindWeave/)

Zhaoyang Li^1,2, Dongjun Qian², Kai Su^2*, Qishuai Diao², Xiangyang Xia², Chang Liu², Wenfei Yang¹, Tianzhu Zhang^1*, Zehuan Yuan²

¹University of Science and Technology of China ²ByteDance
^*Corresponding Author

## 🔥 News * Jan 27, 2026: 🎉 BindWeave has been accepted by ICLR 2026! * Nov 08, 2025: 🙏 Special thanks to **Kijai** for adapting [ComfyUI](https://github.com/kijai/ComfyUI-WanVideoWrapper) for BindWeave and providing an [FP8‑quantized Hugging Face model](https://huggingface.co/Kijai/WanVideo_comfy_fp8_scaled/tree/main/Bindweave)! Feel free to try them out. * Nov 04, 2025: 🔥 BindWeave-Wan-14B model is now available at [HuggingFace](https://huggingface.co/ByteDance/BindWeave) * Nov 04, 2025: 🔥 Released code for model inference and training. ## 🗓️ Todo List - [x] Release inference code - [x] Release checkpoint of BindWeave_Wan_14B - [x] Release training code of BindWeave ## 📖 Overview BindWeave is a unified subject-consistent video generation framework for single- and multi-subject prompts, built on an MLLM-DiT architecture that couples a pretrained multimodal large language model with a diffusion transformer. It achieves cross-modal integration via entity grounding and representation alignment, leveraging the MLLM to parse complex prompts and produce subject-aware hidden states that condition the DiT for high-fidelity generation. ## 🎥 Demo ![BindWeave Video Generation Demo](assets/bindweave_demo.gif) Before running the inference code, you need to download the original 14B model of WanX 2.1. This is crucial because BindWeave depends on its components like the VAE and text encoder. 1. **Download the Pre-trained Model:** First, use the Hugging Face CLI to download the model weights. The command below will place them in the `./pretrained_model/wanx/` directory. ```bash pip install "huggingface_hub[cli]" huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P-Diffusers --local-dir ./pretrained_model/wanx/Wan2.1-I2V-14B-720P-Diffusers ``` 2. **Update the Configuration File:** After the download is complete, you must update the configuration file at `configs/inference/inference_model_s2v.json`. Ensure that the paths for the following components correctly point to the directories you just downloaded: * `vae` * `tokenizer` * `text_encoder` * `image_encoder` Then download the BindWeave model: ```bash huggingface-cli download ByteDance/BindWeave --local-dir ./BindWeave_14B ``` #### Weight Conversion After downloading the BindWeave model, you need to convert the transformer weights to the MM format. Run the conversion script as follows: ``` python convert_ckpt.py \ --source_path ./BindWeave_14B/ \ --target_path ./BindWeave_14B/ \ --mode convert_to_mm ``` Run Subject-to-Video Generation ```bash bash script/inference_s2v.sh ``` You can modify the corresponding paths in `'BindWeave/configs/inference/inference_model_s2v.json'`, where: - `BASE_IMG_DIR`: Root directory of the reference images. - `META_PATH`: Sample JSON file used during inference. - `OUT_DIR`: Output directory for inference results. Using the provided sample cases (i.e., the default path configuration), running `bash script/inference_s2v.sh` will produce the following generated results:

Reference Images	Generated Videos (720P)

> The GIF videos are compressed. ## OpenS2V-Eval Performance 🏆 BindWeave achieves a solid score of 57.61 on the [OpenS2V-Eval](https://huggingface.co/spaces/BestWishYsh/OpenS2V-Eval) benchmark, highlighting its robust capabilities across multiple evaluation dimensions and demonstrating competitive performance against several leading open-source and commercial systems. | Model | TotalScore↑ | AestheticScore↑ | MotionSmoothness↑ | MotionAmplitude↑ | FaceSim↑ | GmeScore↑ | NexusScore↑ | NaturalScore↑ | |------|----|----|----|----|----|----|----|----| | [BindWeave](https://lzy-dot.github.io/BindWeave/) | 57.61% | 45.55% | 95.90% | 13.91% | 53.71% | 67.79% | 46.84% | 66.85% | | [VACE-14B](https://github.com/ali-vilab/VACE) | 57.55% | 47.21% | 94.97% | 15.02% | 55.09% | 67.27% | 44.08% | 67.04% | | [Phantom-14B](https://github.com/Phantom-video/Phantom) | 56.77% | 46.39% | 96.31% | 33.42% | 51.46% | 70.65% | 37.43% | 69.35% | | [Kling1.6(20250503)](https://app.klingai.com/cn/) | 56.23% | 44.59% | 86.93% | 41.6% | 40.1% | 66.2% | 45.89% | 74.59% | | [Phantom-1.3B](https://github.com/Phantom-video/Phantom) | 54.89% | 46.67% | 93.3% | 14.29% | 48.56% | 69.43% | 42.48% | 62.5% | | [MAGREF-480P](https://github.com/MAGREF-Video/MAGREF) | 52.51% | 45.02% | 93.17% | 21.81% | 30.83% | 70.47% | 43.04% | 66.9% | | [SkyReels-A2-P14B](https://github.com/SkyworkAI/SkyReels-A2) | 52.25% | 39.41% | 87.93% | 25.6% | 45.95% | 64.54% | 43.75% | 60.32% | | [Vidu2.0(20250503)](https://www.vidu.cn/) | 51.95% | 41.48% | 90.45% | 13.52% | 35.11% | 67.57% | 43.37% | 65.88% | | [Pika2.1(20250503)](https://pika.art/) | 51.88% | 46.88% | 87.06% | 24.71% | 30.38% | 69.19% | 45.4% | 63.32% | | [VACE-1.3B](https://github.com/ali-vilab/VACE) | 49.89% | 48.24% | 97.2% | 18.83% | 20.57% | 71.26% | 37.91% | 65.46% | | [VACE-P1.3B](https://github.com/ali-vilab/VACE) | 48.98% | 47.34% | 96.8% | 12.03% | 16.59% | 71.38% | 40.19% | 64.31% | ## ⭐ Citation If you find BindWeave useful, please consider giving our repository a star (⭐) and citing our [paper](https://arxiv.org/pdf/2510.00438). ### BibTeX ```bibtex @article{li2025bindweave, title={BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration}, author={Li, Zhaoyang and Qian, Dongjun and Su, Kai and Diao, Qishuai and Xia, Xiangyang and Liu, Chang and Yang, Wenfei and Zhang, Tianzhu and Yuan, Zehuan}, journal={arXiv preprint arXiv:2510.00438}, year={2025} }