# Diffuman4D **Repository Path**: jiumao-admin/Diffuman4D ## Basic Information - **Project Name**: Diffuman4D - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-09-30 - **Last Updated**: 2025-09-30 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
Diffuman4D enables high-fidelity free-viewpoint rendering of human performances from sparse-view videos.
## Interactive Demo [Click here](https://www.4dv.ai/viewer/diffuman4d_fdvai_dance_colored?showdemo=diffuman4d) to experience immersive 4DGS rendering.
## Quick Start
**1. Install.** For inference and data preprocessing, please install the environment via:
```sh
conda create -n diffuman4d python=3.12
conda activate diffuman4d
# for inference
pip install -r requirements.txt
# for 3D/4D reconstruction and data processing
pip install git+https://github.com/zju3dv/EasyVolcap.git --no-deps
```
**2. Download Example Data.** Please download the [example test data](https://huggingface.co/datasets/krahets/diffuman4d_example) using:
```sh
python scripts/download/download_dataset.py --repo_id "krahets/diffuman4d_example" --types='["images", "fmasks", "skeletons", "cameras"]'
```
The extracted data is structured as `{scene_label}/{data_type}/{camera_label}/{frame_label}{file_ext}`:
```sh
└── 0023_06 # scene label
├── fmasks # foreground masks
│ ├── 00 # camera label
│ │ ├── 000000.png # frame label
│ │ ├── 000001.png
│ │ └── ... (148 more items)
│ └── 01
│ ├── 000000.png
│ ├── 000001.png
│ └── ... (148 more items)
│ └── ... (46 more items)
├── images # rgb images
│ ├── 00
│ └── ... (47 more items)
├── skeletons # skeleton maps
│ ├── 00
│ └── ... (47 more items)
├── sparse_pcd.ply # sparse point cloud
└── transforms.json # cameras in nerfstudio format
```
> [!tip]
> If you want to test the model on more DNA-Rendering scenes, please see the [Dataset section](#processed-dna-rendering-dataset).
**3. (Optional) Download Pretrained Model.** The inference code will attempt to download the model from Hugging Face. If you encounter network issues, please manually download [the model](https://huggingface.co/krahets/Diffuman4D) to `./models/` via:
```sh
hf download krahets/Diffuman4D --local-dir ./models/models--krahets--Diffuman4D
```
**4. Inference.** Run inference with the following code, and the sampling results will be saved in `./output/results/dna_rendering/0023_06`. It is recommended to run `exp=demo_3d` or `exp=demo_4d_tiny` if using a single-GPU server for quicker testing.
```sh
# generate a tiny 4D image grid (4 input cameras * 15 frames -> 44 cameras * 15 frames)
python inference.py exp=demo_4d_tiny data.scene_label=0023_06 data.data_dir=./data/datasets--krahets--diffuman4d_example
# generate a 3D image grid (4 input cameras -> 44 cameras)
python inference.py exp=demo_3d data.scene_label=0023_06 data.data_dir=./data/datasets--krahets--diffuman4d_example
# generate entire 4D image grid (4 input cameras * 150 frames -> 44 cameras * 150 frames)
python inference.py exp=demo_4d data.scene_label=0023_06 data.data_dir=./data/datasets--krahets--diffuman4d_example
```
**5. Reconstruct 3DGS Model.** Please first install [nerfstudio](https://github.com/nerfstudio-project/nerfstudio/), then train the human 3DGS model via:
```sh
ns-train splatfacto --data "./output/results/dna_rendering_tiny/0023_06/transforms.json"
```
**6. Reconstruct 4DGS Model.** Since [LongVolcap](https://zju3dv.github.io/longvolcap/) has not been open-sourced, we will attempt to provide the alternative [4D-Gaussian-Splatting](https://github.com/fudan-zvg/4d-gaussian-splatting) reconstruction scripts.
## Processed DNA-Rendering Dataset
To enable model training, we meticulously process the [DNA-Rendering dataset](https://dna-rendering.github.io/index.html) by recalibrating camera parameters, optimizing image color correction matrices (CCMs), predicting foreground masks, and estimating human skeletons.
To promote future research in the field of human-centric 3D/4D generation, we have open-sourced our re-annotated labels for the DNA-Rendering dataset in [dna_rendering_processed](https://huggingface.co/datasets/krahets/dna_rendering_processed), which includes 1000+ human multi-view video sequences. Each sequence contains 48 cameras, 225 (or 150) frames, totaling 10 million images.
> [!note]
> If you find our method or dataset helpful, please give us a Star ⭐ and [cite our work](#cite). Thank you!
### Dataset Download
Before starting, please install the requirements via:
```sh
pip install -U huggingface_hub datasets pyarrow pandas
pip install git+https://github.com/zju3dv/EasyVolcap.git --no-deps
```
Download re-annotated labels (foreground masks, 2D skeletons, 3D skeletons, camera parameters) for the DNA-Rendering dataset:
1. Please concurrently **(1)** fill out [this form](https://docs.google.com/forms/d/1v-X0bnl5GUO9ewYW5eY2wk-yrwQx4_u2lEWbLIao4VU) and **(2)** request access to the dataset on [this page](https://huggingface.co/datasets/krahets/dna_rendering_processed).
2. Download the dataset using [the script](scripts/download/download_dataset.py) below.
```sh
# Download and extract the entire dataset
python scripts/download/download_dataset.py --out_dir "./data/dna_rendering_processed"
# Download specific scenes and data types
python scripts/download/download_dataset.py --out_dir "./data/dna_rendering_processed" --scenes '["0007_01"]' --types '["fmasks"]'
```
Download the corresponding RGB images:
1. Download the raw data from the [official DNA-Rendering website](https://dna-rendering.github.io/inner-download.html).
2. Extract the RGB images from the archived dataset files using [the script](scripts/download/extract_dnar_images.py) below.
```sh
# Extract images from all scenes in `processed_root`. You may replace `raw_root` with your own path.
# You can also specify scenes by passing `--scenes '["0007_01"]'`
python scripts/download/extract_dnar_images.py --raw_root "./data/dna_rendering_release_data" --processed_root "./data/dna_rendering_processed"
```
The dataset file structure looks like:
```sh
└── 0007_01 # scene label
├── cameras # intermediate camera files
│ ├── ccm # easyvolcap cameras used to correct image color
│ ├── colmap # easyvolcap cameras used to undistort images
│ ├── intri.yml # easyvolcap camera intrinsics
│ └── extri.yml # easyvolcap camera extrinsics
├── fmasks # foreground masks
│ ├── 00 # camera label
│ │ ├── 000000.png # frame label
│ │ └── 000001.png
│ │ └── ... (148 more items)
│ └── 01
│ ├── 000000.png
│ └── 000001.png
│ └── ... (148 more items)
│ └── ... (46 more items)
├── images # rgb images
│ ├── 00
│ │ ├── 000000.webp
│ │ └── ... (149 more items)
│ └── ... (47 more items)
├── poses_2d # 2D projections of poses_3d
│ ├── 00
│ │ ├── 000000.json
│ │ └── ... (149 more items)
│ └── ... (47 more items)
├── poses_3d # 3D poses triangulated from Sapiens 2D poses
│ ├── 000000.json
│ └── ... (149 more items)
├── skeletons # rgb maps drawn from poses_2d
│ ├── 00
│ │ ├── 000000.webp
│ │ └── ... (149 more items)
│ └── ... (47 more items)
├── sparse_pcd.ply # sparse point cloud of the first frame
└── transforms.json # cameras in nerfstudio format
└── ... (1037 more items)
```
> [!tip]
> nerfstudio use the OpenGL/Blender coordinate convention for cameras. If you need the Colmap/OpenCV coordinate convention, please flip the Y and Z axes of the `transform_matrix`. For more details, see the [nerfstudio documentation](https://docs.nerf.studio/quickstart/data_conventions.html).
### Custom Data Processing
1. **Install**. You can use the inference environment to run all data processing scripts except `predict_keypoints.py`. If you want to predict keypoints with your data, please install `sapiens-lite` following the guidance in [lite/README.md](https://github.com/facebookresearch/sapiens/blob/main/lite/README.md) and [POSE_README.md](https://github.com/facebookresearch/sapiens/blob/main/lite/docs/POSE_README.md). Note that sapiens-lite requires pytorch<=2.4.1 (See this [issue](https://github.com/open-mmlab/mmdetection/issues/12008)). It is recommended to create a new conda environment to run it.
2. **Prepare the data**. Organize your data in the following directory structure. Note that:
- The recorded multi-view video data must be time-synchronized (i.e., the set of images under the same frame label is captured at the same moment).
- It is required to add a new element `camera_label` for each frame in `transforms.json` to indicate the corresponding camera.
```sh
{YOUR_DATA_DIR}
├── images # foreground masks
│ ├── 00 # camera label
│ │ ├── 000000.jpg # frame label
│ │ ├── 000001.jpg
│ │ └── ... (m more items)
│ └── 01
│ ├── 000000.jpg
│ ├── 000001.jpg
│ └── ... (m more items)
│ └── ... (n more items)
└── transforms.json # cameras in nerfstudio format
```
3. **Process the data**. Run the following script to preprocess the data, including predicting foreground masks, predicting 2D keypoints using Sapiens, triangulating and projecting keypoints, and drawing skeletons.
```sh
# Run all preprocessing scripts
bash scripts/preprocess/preprocess.sh --data_dir YOUR_DATA_DIR
# Run specific actions
bash scripts/preprocess/preprocess.sh --data_dir YOUR_DATA_DIR --actions triangulate_skeleton,draw_skeleton
```
## Todos
- [x] Release project page and paper.
- [x] Release inference code and models.
- [x] Release processed DNA-Rendering dataset.
- [x] Release custom data preprocessing scripts.
## Cite
```
@inproceedings{jin2025diffuman4d,
title={Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models},
author={Jin, Yudong and Peng, Sida and Wang, Xuan and Xie, Tao and Xu, Zhen and Yang, Yifan and Shen, Yujun and Bao, Hujun and Zhou, Xiaowei},
booktitle={International Conference on Computer Vision (ICCV)},
year={2025}
}
```