# BindWeave **Repository Path**: ByteDance/BindWeave ## Basic Information - **Project Name**: BindWeave - **Description**: Official Repo For "BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration" - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-11-04 - **Last Updated**: 2026-01-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration
Zhaoyang Li 1,2, Dongjun Qian 2, Kai Su 2*, Qishuai Diao 2, Xiangyang Xia 2, Chang Liu 2, Wenfei Yang 1, Tianzhu Zhang 1*, Zehuan Yuan 2
1University of Science and Technology of China 2ByteDance
*Corresponding Author
## š„ News * Nov 08, 2025: š Special thanks to **Kijai** for adapting [ComfyUI](https://github.com/kijai/ComfyUI-WanVideoWrapper) for BindWeave and providing an [FP8āquantized Hugging Face model](https://huggingface.co/Kijai/WanVideo_comfy_fp8_scaled/tree/main/Bindweave)! Feel free to try them out. * Nov 04, 2025: š„ BindWeave-Wan-14B model is now available at [HuggingFace](https://huggingface.co/ByteDance/BindWeave) * Nov 04, 2025: š„ Released code for model inference and training. ## šļø Todo List - [x] Release inference code - [x] Release checkpoint of BindWeave_Wan_14B - [x] Release training code of BindWeave ## š Overview BindWeave is a unified subject-consistent video generation framework for single- and multi-subject prompts, built on an MLLM-DiT architecture that couples a pretrained multimodal large language model with a diffusion transformer. It achieves cross-modal integration via entity grounding and representation alignment, leveraging the MLLM to parse complex prompts and produce subject-aware hidden states that condition the DiT for high-fidelity generation. ## š„ Demo  Before running the inference code, you need to download the original 14B model of WanX 2.1. This is crucial because BindWeave depends on its components like the VAE and text encoder. 1. **Download the Pre-trained Model:** First, use the Hugging Face CLI to download the model weights. The command below will place them in the `./pretrained_model/wanx/` directory. ```bash pip install "huggingface_hub[cli]" huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P-Diffusers --local-dir ./pretrained_model/wanx/Wan2.1-I2V-14B-720P-Diffusers ``` 2. **Update the Configuration File:** After the download is complete, you must update the configuration file at `configs/inference/inference_model_s2v.json`. Ensure that the paths for the following components correctly point to the directories you just downloaded: * `vae` * `tokenizer` * `text_encoder` * `image_encoder` Then download the BindWeave model: ```bash huggingface-cli download ByteDance/BindWeave --local-dir ./BindWeave_14B ``` #### Weight Conversion After downloading the BindWeave model, you need to convert the transformer weights to the MM format. Run the conversion script as follows: ``` python convert_ckpt.py \ --source_path ./BindWeave_14B/ \ --target_path ./BindWeave_14B/ \ --mode convert_to_mm ``` Run Subject-to-Video Generation ```bash bash script/inference_s2v.sh ``` You can modify the corresponding paths in `'BindWeave/configs/inference/inference_model_s2v.json'`, where: - `BASE_IMG_DIR`: Root directory of the reference images. - `META_PATH`: Sample JSON file used during inference. - `OUT_DIR`: Output directory for inference results. Using the provided sample cases (i.e., the default path configuration), running `bash script/inference_s2v.sh` will produce the following generated results:
| Reference Images | Generated Videos (720P) |
|---|---|
|
|
|
|
|
|