# PantoMatrix **Repository Path**: cvsuser/PantoMatrix ## Basic Information - **Project Name**: PantoMatrix - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-12-19 - **Last Updated**: 2025-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

PantoMatrix
Generating Face and Body Animation from Speech

PantoMatrix is an Open-Source and research project to generate 3D body and face animation from speech. It works as an API inputs speech audio and outputs body and face motion parameters. You may transfer these motion parameters to other formats such as Iphone ARKit Blendshape Weights or Vicon Skeleton bvh files.

1. News

Welcome volunteers to contribute and collaborate on related topics. Feel free to submit the pull requests! Currently this repo is mainly maintained by haiyangliu1997@gmail.com in freetime since 2022. - **[2025/01]** Demo of how to set up inference and training is available on [Colab](https://colab.research.google.com/drive/1MeuZtBv8yUUG9FFeN8UGy78Plk4gzxT4?usp=sharing)! - **[2025/01]** New inference api, visualization api, evaluation api, training codebase, are available! - **[2024/07]** [download smplx motion (in .npz) file](https://huggingface.co/spaces/H-Liu1997/EMAGE), visualize with our blender addon and retarget to your avatar! - **[2024/04]** Thanks to [@camenduru](https://twitter.com/camenduru), Replicate version EMAGE is available! you can directly call EMAGE via API! - **[2024/03]** Thanks to [@sunday9999](https://github.com/sunday9999) for speeding up the inference video rendering from 1000s to 25s! - **[2024/02]** Thanks to [@wubowen416](https://github.com/wubowen416) for the [scripts of automatic video visualization #83](https://github.com/PantoMatrix/PantoMatrix/issues/83) during inference! - **[2023/05]** [BEAT_GENEA](https://drive.google.com/file/d/1wYW7eWAYPkYZ7WPOrZ9Z_GIll13-FZfx/view?usp=share_link) is allowed for pretraining in [GENEA2023](https://genea-workshop.github.io/2023/challenge/)! Thanks for GENEA's organizers!
## 2. Models and Tools List | Model | Paper | Inputs | Outputs** | Language (Train) | Full Body FGD | Weights | |--------|-----------------|--------|------------------|--------------------------|-------------------------------|---------| | DisCo | ACMMM 2022 | Audio | Upper + Hands | English (Speaker 2) | 2.233 | [Link](https://huggingface.co/H-Liu1997/disco_audio/tree/main) | | CaMN | ECCV 2022 | Audio | Upper + Hands | English (Speaker 2) | 2.120 | [Link](https://huggingface.co/H-Liu1997/camn_audio/tree/main) | | EMAGE | CVPR 2024 | Audio | Full Body + Face | English (Speaker 2) | 0.615 | [Link](https://huggingface.co/H-Liu1997/emage_audio/tree/main) | ** Outputs are in SMPLX and FLAME parameters. | | | | | |--------|----------------------|--------------------|---------------------| | Datasets | [BEAT2 (SMPLX+FLAME)](https://huggingface.co/datasets/H-Liu1997/BEAT2) | [BEAT (BVH + ARKit)](https://huggingface.co/datasets/H-Liu1997/BEAT) | [Rendered Skeleton Videos](https://huggingface.co/datasets/H-Liu1997/BEAT_Rendered_Videos) | | Blender Tools | [Blender Addon](https://huggingface.co/datasets/H-Liu1997/BEAT2_Tools/blob/main/smplx_blender_addon_20230921.zip) | [Character on BEAT](https://huggingface.co/datasets/H-Liu1997/BEAT2_Tools/blob/main/BEAT_Avatars.zip) | [Blender Render Scripts](https://huggingface.co/datasets/H-Liu1997/BEAT2_Tools/blob/main/rendervideo.zip) | | SMPLX Tools | [SMPLX-FLAME Model](https://huggingface.co/H-Liu1997/emage_evaltools/blob/main/smplx_models/smplx/SMPLX_NEUTRAL_2020.npz) | [ARKit2FLAME Sripts](https://huggingface.co/datasets/H-Liu1997/BEAT2_Tools/blob/main/ARkit_FLAME.zip) | [ARkit2FLAME Weights](https://huggingface.co/datasets/H-Liu1997/BEAT2_Tools/blob/main/mat_final.npy) | | Weights | [FGD on BEAT2](https://huggingface.co/H-Liu1997/emage_evaltools/blob/main/AESKConv_240_100.bin) | [FGD on BEAT](https://huggingface.co/datasets/H-Liu1997/BEAT2_Tools/blob/main/beat_weights/ae_300.bin) | [Text Vocab](https://huggingface.co/datasets/H-Liu1997/BEAT2_Tools/blob/main/emage_weights/vocab.pkl) |
## 3. Quick Start (Inference) #### Approach 1: Using Hugging Face Space Upload your audio and directly download the results from our Hugging Face Space.

#### Approach 2: Local Setup Clone the repository and set up locally.
Demo of how to set up is available on [Colab](https://colab.research.google.com/drive/1MeuZtBv8yUUG9FFeN8UGy78Plk4gzxT4?usp=sharing). ```bash git clone https://github.com/PantoMatrix/PantoMatrix.git cd PantoMatrix/ bash setup.sh source /content/py39/bin/activate python test_camn_audio.py --visualization # if you have trouble in install pytroch3d # use --nopytorch3d, this will not render the 2D openpose style video python test_camn_audio.py --visualization --nopytorch3d # try differnet models with your data, put your audio in --audio_folder # DisCo (ACMMM2022), upper body motion, with data resampling and rhythm content disentanglement. python test_disco_audio.py --visualization --audio_folder ./examples/audio --save_folder ./examples/motion # BEAT (ECCV2022), upper body motion, with body2hands decoder python test_camn_audio.py --visualization --audio_folder ./examples/audio --save_folder ./examples/motion # EMAGE (CVPR2024), full body + face animation python test_emage_audio.py --visualization --audio_folder ./examples/audio --save_folder ./examples/motion ``` #### Approach 3: Call API Directly ```python # copy the ./models folder iin your project folder from .model.camn_audio import CaMNAudioModel model = CaMNAudioModel.from_pretrained("H-Liu1997/huggingface-model/camn_audio") model.cuda().eval() import librosa import numpy as np import torch # copy the ./emage_utils folder in your project folder from emage_utils import beat_format_save audio_np, sr = librosa.load("/audio_path.wav", sr=model.cfg.audio_sr) audio = torch.from_numpy(audio_np).float().cuda().unsqueeze(0) motion_pred = model(audio)["motion_axis_angle"] motion_pred_np = motion_pred.cpu().numpy() beat_format_save(motion_pred_np, "/result_motion.npz") ```
## 4. Visualization When you run the test scripts, there is an parameter `--visualization` to automatic enable visualizaion.
Besides, you could also try visualiztion by the below. #### Approach 1: Blender (Recommended) Render the output using Blender by download the [blender addon](https://huggingface.co/datasets/H-Liu1997/BEAT2_Tools/blob/main/smplx_blender_addon_20230921.zip) #### Approach 2: 3D mesh ```python # render a npz file to a mesh video from emage_utils import fast_render fast_render.render_one_sequence_no_gt("/result_motion.npz", "/audio_path.wav", "/result_video.mp4", remove_global=True) ```

DisCo (Mesh)	CaMN (Mesh)	EMAGE (Mesh)

#### Approach 3: 2D OpenPose style video (Require Pytorch3D) ```python from trochvision.io import write_video from emage_utils.format_transfer import render2d from emage_utils import fast_render motion_dict = np.load(npz_path, allow_pickle=True) # face v2d_face = render2d(motion_dict, (512, 512), face_only=True, remove_global=True) write_video(npz_path.replace(".npz", "_2dface.mp4"), v2d_face.permute(0, 2, 3, 1), fps=30) fast_render.add_audio_to_video(npz_path.replace(".npz", "_2dface.mp4"), audio_path, npz_path.replace(".npz", "_2dface_audio.mp4")) # body v2d_body = render2d(motion_dict, (720, 480), face_only=False, remove_global=True) write_video(npz_path.replace(".npz", "_2dbody.mp4"), v2d_body.permute(0, 2, 3, 1), fps=30) fast_render.add_audio_to_video(npz_path.replace(".npz", "_2dbody.mp4"), audio_path, npz_path.replace(".npz", "_2dbody_audio.mp4")) ```

DisCo (2D Pose)	CaMN (2D Pose)	EMAGE (2D Pose)	EMAGE-Face (2D Pose)

## 5. Evaluation For academic users, the evaluation code is organized into an evaluation API. ```python # copy the ./emage_evaltools folder into your folder from emage_evaltools.metric import FGD, BC, L1Div, LVDFace, MSEFace # init fgd_evaluator = FGD(download_path="./emage_evaltools/") bc_evaluator = BC(download_path="./emage_evaltools/", sigma=0.3, order=7) l1div_evaluator= L1div() lvd_evaluator = LVDFace() mse_evaluator = MSEFace() # Example usage for motion_pred in all_motion_pred: # bc and l1 require position representation motion_position_pred = get_motion_rep_numpy(motion_pred, device=device, betas=betas)["position"] # t*55*3 motion_position_pred = motion_position_pred.reshape(t, -1) # ignore the start and end 2s, this may for beat dataset only audio_beat = bc_evaluator.load_audio(test_file["audio_path"], t_start=2 * 16000, t_end=int((t-60)/30*16000)) motion_beat = bc_evaluator.load_motion(motion_position_pred, t_start=60, t_end=t-60, pose_fps=30, without_file=True) bc_evaluator.compute(audio_beat, motion_beat, length=t-120, pose_fps=30) l1_evaluator.compute(motion_position_pred) face_position_pred = get_motion_rep_numpy(motion_pred, device=device, expressions=expressions_pred, expression_only=True, betas=betas)["vertices"] # t -1 face_position_gt = get_motion_rep_numpy(motion_gt, device=device, expressions=expressions_gt, expression_only=True, betas=betas)["vertices"] lvd_evaluator.compute(face_position_pred, face_position_gt) mse_evaluator.compute(face_position_pred, face_position_gt) # fgd requires rotation 6d representaiton motion_gt = torch.from_numpy(motion_gt).to(device).unsqueeze(0) motion_pred = torch.from_numpy(motion_pred).to(device).unsqueeze(0) motion_gt = rc.axis_angle_to_rotation_6d(motion_gt.reshape(1, t, 55, 3)).reshape(1, t, 55*6) motion_pred = rc.axis_angle_to_rotation_6d(motion_pred.reshape(1, t, 55, 3)).reshape(1, t, 55*6) fgd_evaluator.update(motion_pred.float(), motion_gt.float()) metrics = {} metrics["fgd"] = fgd_evaluator.compute() metrics["bc"] = bc_evaluator.avg() metrics["l1"] = l1_evaluator.avg() metrics["lvd"] = lvd_evaluator.avg() metrics["mse"] = mse_evaluator.avg() ``` Hyperparameters may vary depending on the dataset.
For example, for the BEAT dataset, we use `(0.3, 7)`; for the TalkShow dataset, we use `(0.5, 7)`. You may adjust based on your data.
## 6. Training This new codebase only have the audio-only version model for better real-world applications.
For reproducing audio+text results in the paper, please check and reference the previous codebase below. | Model | Inputs (Paper) | Old Codebase | Input (Current Codebase) | |--------|---------------------|-----------------------------|---------------------| | DisCo | Audio + Text | [link](https://github.com/PantoMatrix/PantoMatrix/tree/6ca70b9541285b124da2eeedcd80f7c5a54eb111/scripts/DisCo_2022) | Audio | | CaMN | Audio + Text + Emotion + Facial | [link](https://github.com/PantoMatrix/PantoMatrix/tree/6ca70b9541285b124da2eeedcd80f7c5a54eb111/scripts/BEAT_2022) | Audio | | EMAGE | Audio + Text | [link](https://github.com/PantoMatrix/PantoMatrix/tree/6ca70b9541285b124da2eeedcd80f7c5a54eb111/scripts/EMAGE_2024) | Audio | #### Before Start Environment setup, skip if you already setup the inference. ```bash # if you didn't run test, run the below four commands. # git clone https://github.com/PantoMatrix/PantoMatrix.git # cd PantoMatrix/ # bash setup.sh # source /content/py39/bin/activate # Download the BEAT2 sudo apt-get update sudo apt-get install git-lfs git lfs install git clone https://huggingface.co/datasets/H-Liu1997/BEAT2 ``` Your folder should like follows for the correct path ```bash /your_root/ |-- PantoMatrix |-- BEAT2 `-- train_emage_audio.py ``` #### Method 1: Training EMAGE #### ```bash # Preprocessing Extract the foot contact data python ./datasets/foot_contact.py # (todo) train the vqvae # train the audio2motion model torchrun --nproc_per_node 1 --nnodes 1 train_emage_audio.py --config ./configs/emage_audio.yaml --evaluation ``` Use these flags as needed: - `--evaluation`: Calculate the test metric. - `--wandb`: Activate logging to WandB. - `--visualization`: Render test results (slow; disable for efficiency). - `--test`: Test mode; load last checkpoint and evaluate. - `--debug`: Debug mode; iterate one data point for fast testing. #### Method 2: Training CaMN ```bash torchrun --nproc_per_node 1 --nnodes 1 train_camn_audio.py --config ./configs/camn_audio.yaml --evaluation ``` #### Method 3: Training DisCo ```bash # (optional) Extract the cluster information # python ./datasets/clustering.py # train audio2motion torchrun --nproc_per_node 1 --nnodes 1 train_disco_audio.py --config ./configs/disco_audio.yaml --evaluation ```

Reference

EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling (CVPR 2024)

Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Naoya Iwamoto, Bo Zheng, Michael J. Black

^{(*Equal Contribution)}

- Project Page - Paper - Video - Code - Demo - Dataset - Blender Add-On -

------------

BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis (ECCV 2022)

Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, Bo Zheng

- Project Page - Paper - Video - Code - Colab Demo - Dataset - Benchmark -

------------

DisCo: Disentangled Implicit Content and Rhythm Learning for Diverse Co-Speech Gesture Synthesis (ACMMM 2022)

Haiyang Liu, Naoya Iwamoto, Zihao Zhu, Zhengqing Li, You Zhou, Elif Bozkurt, Bo Zheng

- Project Page - Paper - Video - Code -

PantoMatrixGenerating Face and Body Animation from Speech