# mdt_policy **Repository Path**: weleen/mdt_policy ## Basic Information - **Project Name**: mdt_policy - **Description**: [RSS 2024] Code for "Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals" for CALVIN experiments with pre-trained weights - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-03-01 - **Last Updated**: 2025-03-27 ## Categories & Tags **Categories**: Uncategorized **Tags**: Robotics ## README # Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals [Paper](https://arxiv.org/pdf/2407.05996), [Project Page](), [RSS 2024](https://roboticsconference.org/program/papers/121/) [Moritz Reuss](https://mbreuss.github.io/moritzreuss/)1, [Ömer Erdinç Yağmurlu](https://scholar.google.com/citations?user=I_Mxp5cAAAAJ&hl=en), Fabian Wenzel, [Rudolf Lioutikov](http://rudolf.intuitive-robots.net/)1 1Intuitive Robots Lab, Karlsruhe Institute of Technology ![MDT Architecture](https://intuitive-robots.github.io/mdt_policy/static/image/mdt-v-figure.png) This is the official code repository for the paper [Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals](https://openreview.net/pdf?id=Pt6fLfXMRW). Pre-trained models are available [here](https://drive.google.com/drive/folders/13EDBcdYyOV7FsF9Z7Eb0YN8aMTrtsAsi). ## Performance Results on the [CALVIN](https://github.com/mees/calvin) benchmark (1000 chains): | Train | Method | 1 | 2 | 3 | 4 | 5 | **Avg. Len.** | |-------|--------|---|---|---|---|---|---------------| | **D** | HULC | 82.5% | 66.8% | 52.0% | 39.3% | 27.5% | 2.68±(0.11) | | | LAD | 88.7% | 69.9% | 54.5% | 42.7% | 32.2% | 2.88±(0.19) | | | Distill-D | 86.7% | 71.5% | 57.0% | 45.9% | 35.6% | 2.97±(0.04) | | | MT-ACT | 88.4% | 72.2% | 57.2% | 44.9% | 35.3% | 2.98±(0.05) | | | **MDT (ours)** | **93.3%** | **82.4%** | **71.5%** | **60.9%** | **51.1%** | **3.59±(0.07)** | | | **MDT-V (ours)** | **93.9%** | **83.8%** | **73.5%** | **63.9%** | **54.9%** | **3.70±(0.03)*** | | **ABCD** | HULC | 88.9% | 73.3% | 58.7% | 47.5% | 38.3% | 3.06±(0.07) | | | Distill-D | 86.3% | 72.7% | 60.1% | 51.2% | 41.7% | 3.16±(0.06) | | | MT-ACT | 87.1% | 69.8% | 53.4% | 40.0% | 29.3% | 2.80±(0.03) | | | RoboFlamingo | 96.4% | 89.6% | 82.4% | 74.0% | 66.0% | 4.09±(0.00) | | | **MDT (ours)** | **97.8%** | **93.8%** | **88.8%** | **83.1%** | **77.0%** | **4.41±(0.03)** | | | **MDT-V (ours)** | **99.1%** | **96.8%** | **92.8%** | **88.5%** | **83.1%** | **4.60±(0.05)*** | *: 3.72±(0.05) (D) and 4.52±(0.02) (ABCD) in the paper. Performance is higher than reported given some fixes in the camera-ready code version. ## Installation To begin, clone the repository locally: ```bash git clone --recurse-submodules git@github.com:intuitive-robots/mdt_policy.git export MDT_ROOT=$(pwd)/mdt_policy ``` Install the requirements: (Note we provided a changed verison of pyhash, given numerous problems when installing it manually) ```bash cd $MDT_ROOT conda create -n mdt_env python=3.8 conda activate mdt_env cd calvin_env/tacto pip install -e . cd .. pip install -e . cd .. pip install setuptools==57.5.0 cd pyhash-0.9.3 python setup.py build python setup.py install cd .. ``` Next we can install the rest of the missing packages: ``` pip install -r requirements.txt ``` ## Evaluation ### Step 1 - Download CALVIN Datasets If you want to train on the [CALVIN](https://github.com/mees/calvin) dataset, choose a split with: ```bash cd $MDT_ROOT/dataset sh download_data.sh D | ABCD ``` #### (Optional) Preprocessing with CALVIN Since MDT uses action chunking, it needs to load multiple (~10) `episode_{}.npz` files for each inference. In combination with batching, this results in a large disk bandwidth needed for each iteration (usually ~2000MB/iteration). This has the potential of significantly reducing your GPU utilization rate during training depending on your hardware. Therefore, you can use the script `extract_by_key.py` to extract the data into a single file, avoiding opening too many episode files when using the CALVIN dataset. ##### Usage example: ```shell python preprocess/extract_by_key.py -i /YOUR/PATH/TO/CALVIN/ \ --in_task all ``` ##### Params: Run this command to see more detailed information: ```shell python preprocess/extract_by_key.py -h ``` Important params: * `--in_root`: `/YOUR/PATH/TO/CALVIN/`, e.g `/data3/geyuan/datasets/CALVIN/` * `--extract_key`: A key of `dict(episode_xxx.npz)`, default is **'rel_actions'**, the saved file name depends on this (i.e `ep_{extract_key}.npy`) Optional params: * `--in_task`: default is **'all'**, meaning all task folders (e.g `task_ABCD_D/`) of CALVIN * `--in_split`: default is **'all'**, meaning both `training/` and `validation/` * `--out_dir`: optional, default is **'None'**, and will be converted to `{in_root}/{in_task}/{in_split}/extracted/` * `--force`: whether to overwrite existing extracted data Thanks to @ygtxr1997 for debugging the GPU utilization and providing a merge request. ### Step 2 - Download Pre-trained Models | Name | Split | 1 | 2 | 3 | 4 | 5 | **Avg. Len.** | Model | Seed | eval: sigma-min | | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | | [mdtv-1-abcd](https://drive.google.com/drive/folders/1oHKR2BS5H0NJR20hbmqfppnl8pQGEGmt) | ABCD -> D | 99.6% | 97.5% | 94.0% | 90.2% | 84.7% | **4.66** | MDT-V | 142 | 1.000 | | [mdtv-2-abcd](https://drive.google.com/drive/folders/1O_Oxl7LNRmnFStU9pPgk_THsPmJeAW3I) | ABCD -> D | 99.0% | 96.2% | 92.4% | 88.3% | 82.5% | **4.58** | MDT-V | 242 | 1.000 | | [mdtv-3-abcd](https://drive.google.com/drive/folders/1dwCitl3b9I0h45v109hjj8JzBcRUGWYL) | ABCD -> D | 98.9% | 96.7% | 92.1% | 87.1% | 82.1% | **4.57** | MDT-V | 42 | 1.000 | | [mdtv-1-d](https://drive.google.com/drive/folders/1Sr-VkWglmAr-9sS4MJsbfigTX6JVLBrk) | D -> D | 93.8% | 83.4% | 72.6% | 63.2% | 54.4% | **3.67** | MDT-V | 142 | 0.001 | | [mdtv-2-d](https://drive.google.com/drive/folders/1mkEfjM2Cdb7OaRIGGvN_UEy_8TGwqluW) | D -> D | 94.0% | 84.0% | 73.3% | 63.5% | 54.2% | **3.69** | MDT-V | 242 | 0.001 | | [mdtv-3-d](https://drive.google.com/drive/folders/1nsSYEyhR8f-UfzRR0LfLWUkWlLc3qgfI) | D -> D | 93.9% | 84.0% | 74.6% | 65.0% | 56.3% | **3.74** | MDT-V | 42 | 1.000 | You can find all of the aforementioned models under [here](https://drive.google.com/drive/folders/13EDBcdYyOV7FsF9Z7Eb0YN8aMTrtsAsi). ### Step 3 - Run Adjust `conf/mdt_evaluate.conf` according to the model you downloaded. Important keys are: - `voltron_cache`: set a rw path in order to avoid downloading Voltron from huggingface each run. - `num_videos`: `n` first rollouts of the model will be recorded as a gif. - `dataset_path` and `train_folder`: location of the downloaded files. - `sigma_min`: use the values in the table for best results with a given model. Then run: ```bash python mdt/evaluation/mdt_evaluate.py ``` ## Training To train the MDT model with the maximum amount of available GPUS, run: ``` python mdt/training.py ``` For replication of the originial training results I recommend to use 4 GPUs with a batch_size of 128 and train them for 30 (20) epochs for ABCD (D only). You can find the exact configuration used for each of our pretrained models in `.hydra/config.yaml` under the respective model's folder. --- ## Acknowledgements This work is only possible because of the code from the following open-source projects and datasets: #### CALVIN Original: [https://github.com/mees/calvin](https://github.com/mees/calvin) License: [MIT](https://github.com/mees/calvin/blob/main/LICENSE) #### BESO Original: [https://github.com/intuitive-robots/beso](https://github.com/intuitive-robots/beso) License: [MIT](https://github.com/intuitive-robots/beso/blob/main/LICENSE) #### Voltron Original: [https://github.com/siddk/voltron-robotics](https://github.com/siddk/voltron-robotics) License: [MIT](https://github.com/siddk/voltron-robotics/blob/main/LICENSE) #### OpenAI CLIP Original: [https://github.com/openai/CLIP](https://github.com/openai/CLIP) License: [MIT](https://github.com/openai/CLIP/blob/main/LICENSE) #### HULC Original: [https://github.com/lukashermann/hulc](https://github.com/lukashermann/hulc) License: [MIT](https://github.com/lukashermann/hulc/blob/main/LICENSE) ## Citation If you found the code usefull, please cite our work: ```bibtex @inproceedings{ reuss2024multimodal, title={Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals}, author={Moritz Reuss and {\"O}mer Erdin{\c{c}} Ya{\u{g}}murlu and Fabian Wenzel and Rudolf Lioutikov}, booktitle={Robotics: Science and Systems}, year={2024} } ```