# VLA-Adapter **Repository Path**: mirrors_trending/VLA-Adapter ## Basic Information - **Project Name**: VLA-Adapter - **Description**: VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 0 - **Created**: 2025-10-09 - **Last Updated**: 2026-01-24 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
[![Paper](https://img.shields.io/badge/Paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/pdf/2509.09372) [![Hugging Face Collection](https://img.shields.io/badge/Models-fcd022?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/VLA-Adapter) [![Twitter](https://img.shields.io/badge/AK-%23000000.svg?style=for-the-badge&logo=x&logoColor=white)](https://x.com/_akhaliq/status/1966610780838621241) [![WeChat](https://img.shields.io/badge/WeChat--Group-07C160?style=for-the-badge&logo=wechat&logoColor=white)](https://github.com/OpenHelix-Team/VLA-Adapter/issues/1)
### The official implementation of **VLA-Adapter**.

> **๐Ÿ“ Paper: https://arxiv.org/abs/2509.09372**
> **๐ŸŒ Project page: https://vla-adapter.github.io/**
> **๐Ÿค— HuggingFace: https://huggingface.co/VLA-Adapter**
> **Github: https://github.com/OpenHelix-Team/VLA-Adapter**
## :loudspeaker: News! - **[2025/09/22]** We released our codes! An enhanced **Pro** version is also released (this version conforms to the pipeline in the original paper, but is optimized in implementation). Everyone is welcome to use it!๐ŸŽ‰ - **[2025/09/13]** Our paper won the ๐Ÿฅ‡**first place** in the [daily list](https://huggingface.co/papers/date/2025-09-12), the ๐Ÿฅˆ**second place** in the [weekly list](https://huggingface.co/papers/week/2025-W37), and ๐Ÿฅ‰**third place** in the [Monthly list](https://huggingface.co/papers/month/2025-09) in HF! โญ - **[2025/09/13]** Our paper listed in the [Trending Paper](https://huggingface.co/papers/trending) in HF! โญ - **[2025/09/12]** We released the original version of the VLA-Adapter for four LIBERO models on [HuggingFace](https://huggingface.co/VLA-Adapter). - **[2025/09/11]** We released our paper on [ArXiv](https://arxiv.org/abs/2509.09372).
## :black_nib: TODO List - [x] Release **checkpoints** for reproduction. - [x] Release [VLA-Adapter v2 paper](https://arxiv.org/abs/2509.09372). - [ ] A more **powerful version**, **VLA-Adapter++**, and a detailed **technical report** ๐Ÿ“ will be released soon.
- [ ] Continue to update the code to adapt to various **real-world systems** deployments, including the configuration of our paper, Franka, UR-5, and AGILE Piper.
- [ ] It will soon be compatible with **various foundation models**, including but not limited to [VPP](https://arxiv.org/abs/2412.14803), [ฯ€0.5](https://arxiv.org/abs/2504.16054).
- [ ] We will update the **diffusion transformers** and **flow matching** policy networks in the future, and the results will be updated in the subsequent VLA-Adapter++ technical report. - [ ] We will also update and give more experiments on **Frozen backbone**. - [ ] We will expand its **generalization** further in the future. Work is in progress! So please stay tuned! - [ ] **RL post-training** is also in progress. Interested researchers are welcome to join us in building this foundation! - [ ] **The dual-system compatibility** of VLA-Adapter is under exploration!
## ๐ŸŒŸ Table of Contents - [:rocket: Quick Start](#rocket-quick-start) - [Conda Environment of VLA-Adapter](#conda-environment-of-vla-adapter) - [Install Dependencies](#install-dependencies) - [:pencil: Data Preparation](#pencil-data-preparation) - [LIBERO Benchmark](#libero-benchmark) - [CALVIN Benchmark](#calvin-benchmark) - [:video_game: Our Dependencies](#video_game-our-dependencies) - [:pushpin: Benchmark Location](#pushpin-benchmark-location) - [โš“ VLM backbone](#vlm) - [:fire: Training for Different Configurations](#fire-training-for-different-configurations)   => Provides **training configurations** for GPUs ranging from **10GB** to **80GB** of VRAM. - [:books: Related File for Training](#books-related-file-for-training) - [:ledger: How to Train on Extremely Limited VRAM GPUs](#ledger-how-to-train-on-extremely-limited-vram-gpus)   => A card with 10GB-12GB *(e.g. NVIDIA GeForce RTX 2080Ti, 3060, 3080, 4070, 4080, and 5070)* - [:ledger: How to Train on Low VRAM GPUs](#ledger-how-to-train-on-low-vram-gpus)   => A card with 24GB *(e.g. NVIDIA GeForce RTX 3090 and 4090)* - [:ledger: How to Train on Larger VRAM GPUs](#ledger-how-to-train-on-larger-vram-gpus)   => A Consumer GPU with 32GB *(e.g. NVIDIA GeForce RTX 5090)*   A Professional-Grade GPU with 40GB-48GB *(e.g. NVIDIA A100-40GB, A800-40GB, L20, and RTX A6000).* - [:ledger: How to Train on Sufficient VRAM GPUs](#ledger-how-to-train-on-sufficient-vram-gpus)   => Professional-Grade GPUs with โ‰ฅ80GB *(e.g. NVIDIA A100-80GB, A800-80GB, H100, H800, H20-NVLink, and GB200).* - [:mechanical_arm: Inference](#mechanical_arm-inference) - [:books: Related File for Inference](#books-related-file-for-inference) - [๐Ÿค— Checkpoint of VLA-Adapter](#ckpts) - [:notebook: How to Eval](#evals) - [๐ŸŒˆ Success Rate Comparison](#results) - [๐Ÿ“ Citation](#cite) - [:heart: Acknowledgment](#heart-acknowledgment)
## :rocket: Quick Start ### Conda Environment of VLA-Adapter ```bash # Create and activate conda environment conda create -n vla-adapter python=3.10.16 -y conda activate vla-adapter ``` ### Install Dependencies ```bash # Install PyTorch # Use a command specific to your machine: https://pytorch.org/get-started/locally/ pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 # Clone vla-adapter repo and pip install to download dependencies git clone https://github.com/OpenHelix-Team/VLA-Adapter.git cd VLA-Adapter pip install -e . pip install packaging ninja ninja --version; echo $? # Verify Ninja --> should return exit code "0" # Install Flash Attention 2 for training (https://github.com/Dao-AILab/flash-attention) pip install "flash-attn==2.5.5" --no-build-isolation # If you run into difficulty, try `pip cache remove flash_attn` first, or visit the # website to download it. (https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.5) # You can download the corresponding `.whl` file according to the cuda version of `nvidia-smi`, # and then run `pip install flash_attn-2.5.5+cuXX...whl` to install it. # We use the `flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl` file. ```

## :pencil: Data Preparation ### LIBERO Benchmark - **(Optional)** Clone and install the [LIBERO repo](https://github.com/Lifelong-Robot-Learning/LIBERO) and required packages: ```bash git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git pip install -e LIBERO pip install -r experiments/robot/libero/libero_requirements.txt # From vla-adapter base dir ``` To download the [LIBERO datasets](https://huggingface.co/datasets/openvla/modified_libero_rlds) that we used in our fine-tuning experiments, run the command below. This will download the `Spatial`, `Object`, `Goal`, and `Long` datasets in `RLDS` format, i.e., `libero_spatial_no_noops`, `libero_object_no_noops`, `libero_goal_no_noops`, `libero_10_no_noops`. (`"_no_noops"` stands for no no-op actions, i.e., training samples with near-zero actions are filtered out). These datasets require `~10GB` of memory in total. If needed, see details on how to download the original non-RLDS datasets [here](https://github.com/openvla/openvla?tab=readme-ov-file#libero-setup). You can use these to fine-tune Prismatic-VLMs (built on Qwen2.5-0.5B) or other VLMs. ```bash git clone git@hf.co:datasets/openvla/modified_libero_rlds ``` ๐ŸŒŸ Attention! The dataset downloaded in this way needs to remove of the ``modified_`` word to adapt to the path of - [:pushpin: Benchmark Location](#pushpin-benchmark-location)!!! When using LIBERO, you may get an error message like `AttributeError: 'NoneType' object has no attribute 'eglQueryString'`. You can use: ```bash sudo apt-get update sudo apt-get install libgl1-mesa-dev libegl1-mesa-dev libgles2-mesa-dev libglew-dev ``` ### CALVIN Benchmark - **(Optional)** ```bash git clone --recurse-submodules https://github.com/mees/calvin.git export CALVIN_ROOT=$(pwd)/calvin cd $CALVIN_ROOT # Installation of `pyhash` may fail on some machines. If it fails, you can solve it by lowering the `setuptools` version: `pip install setuptools==57.5.0` sh install.sh ``` To download the [CALVIN ABCโ†’D datasets](https://github.com/mees/calvin/tree/main/dataset) that we used in our fine-tuning experiments, run the command below. ```bash cd $CALVIN_ROOT/dataset sh download_data.sh ABC ``` If you want to download the RLDS format, you can visit [here](https://huggingface.co/datasets/zhouhongyi/calvin_abc_rlds) to download it. This dataset requires `~50GB` of memory. When using CALVIN, you may get an error message like `AttributeError: 'NoneType' object has no attribute 'eglQueryString'`. You can use: ```bash sudo apt-get update sudo apt-get install libgl1-mesa-dev libegl1-mesa-dev libgles2-mesa-dev libglew-dev ``` ### :video_game: Our Dependencies - **(including LIBERO and CALVIN)** At this point, the environment is fully installed. If you want to confirm whether the environment is correct, you can see the `our_envs.txt` file we released. ### :pushpin: Benchmark Location The downloaded dataset can be placed in the `/data` folder. The overall directory structure is as follows: ``` ยท โ”œโ”€โ”€ data ยท โ”œโ”€โ”€ libero โ”‚ โ”œโ”€โ”€ libero_10_no_noops โ”‚ โ”‚ โ””โ”€โ”€ 1.0.0 (It contains some json files and 32 tfrecord files) โ”‚ โ”œโ”€โ”€ libero_goal_no_noops โ”‚ โ”‚ โ””โ”€โ”€ 1.0.0 (It contains some json files and 16 tfrecord files) โ”‚ โ”œโ”€โ”€ libero_object_no_noops โ”‚ โ”‚ โ””โ”€โ”€ 1.0.0 (It contains some json files and 32 tfrecord files) โ”‚ โ”œโ”€โ”€ libero_spatial_no_noops โ”‚ โ”‚ โ””โ”€โ”€ 1.0.0 (It contains some json files and 16 tfrecord files) โ”‚ โ”œโ”€โ”€ calvin_abc โ”‚ โ””โ”€โ”€ 1.0.0 (It contains some json files, 512 train tfrecord files, and 32 valid tfrecord files) โ”‚ โ””โ”€โ”€ other benchmarks ... ```

## โš“ VLM backbone We use the `Prismatic-VLMs` architecture. Since the file is large, please download it from [here](https://huggingface.co/Stanford-ILIAD/prism-qwen25-extra-dinosiglip-224px-0_5b). Then put it in the `/pretrained_models` folder. The file structure is: ``` ยท โ”œโ”€โ”€ pretrained_models ยท โ”œโ”€โ”€ configs โ””โ”€โ”€ prism-qwen25-extra-dinosiglip-224px-0_5b ```

## :fire: Training for Different Configurations **We provide different training configurations for different users. You can choose the configuration suitable for training based on your GPU card type.** ### :books: Related File for Training * `vla-scripts/finetune.py`: VLA fine-tuning script ### :ledger: How to Train on Extremely Limited VRAM GPUs ***=> Extremely Limited VRAM (A card with 10GB-12GB) (e.g. NVIDIA GeForce RTX 2080Ti, 3060, 3080, 4070, 4080, and 5070).*** >***About `batch_size`, `lora_rank`, `grad_accumulation_steps`, and `max_steps`.*** If your resources are extremely limited, you can set `--batch_size 1` and `--lora_rank 64`, it only requires `9.6GB` of VRAM. Certainly, `batch size = 1` will cause gradient updates to be greatly affected by extreme values, and loss convergence will be unstable. In this case, you can modify the `grad_accumulation_steps` parameter to simulate a similar effect. For example, `--batch_size 1` with `--grad_accumulation_steps 8` has a similar effect to `--batch_size 8`, but the training speed will be slower. This means that you can't use the [OpenVLA-OFT](https://github.com/moojink/openvla-oft) model on a card with `10GB` because even with `batch size = 1`, it requires `25GB` of VRAM. Fortunately, you can use VLA-Adapter. However, the `batch size` is still small, you can increase `--max_steps` to achieve the performance reported in the paper. >***About `vlm_path`.*** The VLM in the VLA-Adapter uses the Prismatic-VLMs architecture, with the LLM backbone being `Qwen2.5-0.5B`. You can download it from https://huggingface.co/Stanford-ILIAD/prism-qwen25-extra-dinosiglip-224px-0_5b and place it in `/pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b`. >***About `data_name`.*** Launch the fine-tuning script with the vla-adapter configuration below. It can run in the background, and the running progress can be seen in the `/logs` folder. You can replace `libero_spatial_no_noops` with `libero_object_no_noops`, `libero_goal_no_noops`, or `libero_10_no_noops`. If you are using the `CALVIN` benchmark, you need to delete `\libero` in `--data_root_dir` and replace `libero_spatial_no_noops` with `calvin_abc`. >***About `use_pro_version`.*** In addition, we recently released an enhanced version `Pro` of the VLA-Adapter. While its framework remains consistent with the original paper, it has been enhanced in the implementation, resulting in significantly improved performance. **Therefore, we strongly recommend using the Pro version!** The `Pro` version's `Policy` size is `207MB`, and training speed is virtually unchanged. The `original version` is nearly `1GB` smaller than the `pro version`, requiring only `8.6GB` of VRAM. You can choose whether to use the `Pro` version by setting the `use_pro_version` parameter, i.e., the `Pro` version is `--use_pro_version True`. ```bash data_name=libero_spatial_no_noops CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/finetune.py \ --vlm_path pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b \ --config_file_path pretrained_models/configs \ --data_root_dir data/libero \ --dataset_name $data_name \ --run_root_dir outputs \ --use_film False \ --num_images_in_input 2 \ --use_proprio True \ --use_lora True \ --use_fz False \ --use_minivlm True \ --image_aug True \ --num_steps_before_decay 400000 \ --max_steps 400005 \ --save_freq 5000 \ --save_latest_checkpoint_only False \ --merge_lora_during_training True \ --batch_size 1 \ --grad_accumulation_steps 8 \ --learning_rate 2e-4 \ --lora_rank 64 \ --use_pro_version True \ --wandb_entity "YOUR_WANDB_ENTITY" \ --wandb_project "$data_name" \ --run_id_note VLA-Adapter--libero_spatial_no_noops--$current_time \ > logs/VLA-Adapter--libero_spatial_no_noops--$current_time.log 2>&1 & ``` Please note that the obtained models will be stored in the `/outputs` folder. Each model will take up nearly `3GB` of memory, so you need to reserve enough space. We strongly recommend that you get our trained model from [VLA-Adapter HuggingFace](https://huggingface.co/VLA-Adapter) and place it in this folder for inference.
### :ledger: How to Train on Low VRAM GPUs ***=> Low VRAM (A card with 24GB) (e.g. NVIDIA GeForce RTX 3090 and 4090).*** >***About `batch_size`, `lora_rank`, `grad_accumulation_steps`, and `max_steps`.*** If you have such a device, you can increase the `batch size` and `lora rank`: `--batch_size 4` and `--lora_rank 64`. This only takes nearly `20GB`. This is consistent with the rank in our paper. This means that you can't use the [OpenVLA-OFT](https://github.com/moojink/openvla-oft) model on a card with `24GB` because even with `batch size = 1`, it requires `25GB` of VRAM. Fortunately, you can use VLA-Adapter. However, the `batch size` is still small, you can increase `--max_steps` to achieve the performance reported in the paper. >***About `vlm_path`.*** The VLM in the VLA-Adapter uses the Prismatic-VLMs architecture, with the LLM backbone being `Qwen2.5-0.5B`. You can download it from https://huggingface.co/Stanford-ILIAD/prism-qwen25-extra-dinosiglip-224px-0_5b and place it in `/pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b`. >***About `data_name`.*** Launch the fine-tuning script with the vla-adapter configuration below. It can run in the background, and the running progress can be seen in the `/logs` folder. You can replace `libero_spatial_no_noops` with `libero_object_no_noops`, `libero_goal_no_noops`, or `libero_10_no_noops`. If you are using the `CALVIN` benchmark, you need to delete `\libero` in `--data_root_dir` and replace `libero_spatial_no_noops` with `calvin_abc`. >***About `use_pro_version`.*** In addition, we recently released an enhanced version `Pro` of the VLA-Adapter. While its framework remains consistent with the original paper, it has been enhanced in the implementation, resulting in significantly improved performance. **Therefore, we strongly recommend using the Pro version!** The `Pro` version's `Policy` size is `207MB`, and training speed is virtually unchanged. The `original version` is nearly `1GB` smaller than the `pro version` (1 batch), requiring only `17.6GB` of VRAM. You can choose whether to use the `Pro` version by setting the `use_pro_version` parameter, i.e., the `Pro` version is `--use_pro_version True`. ```bash data_name=libero_spatial_no_noops CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/finetune.py \ --vlm_path pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b \ --config_file_path pretrained_models/configs \ --data_root_dir data/libero \ --dataset_name $data_name \ --run_root_dir outputs \ --use_film False \ --num_images_in_input 2 \ --use_proprio True \ --use_lora True \ --use_fz False \ --use_minivlm True \ --image_aug True \ --num_steps_before_decay 200000 \ --max_steps 200005 \ --save_freq 5000 \ --save_latest_checkpoint_only False \ --merge_lora_during_training True \ --batch_size 4 \ --grad_accumulation_steps 4 \ --learning_rate 2e-4 \ --lora_rank 64 \ --use_pro_version True \ --wandb_entity "YOUR_WANDB_ENTITY" \ --wandb_project "$data_name" \ --run_id_note VLA-Adapter--libero_spatial_no_noops--$current_time \ > logs/VLA-Adapter--libero_spatial_no_noops--$current_time.log 2>&1 & ``` Please note that the obtained models will be stored in the `/outputs` folder. Each model will take up nearly `3GB` of memory, so you need to reserve enough space. We strongly recommend that you get our trained model from [VLA-Adapter HuggingFace](https://huggingface.co/VLA-Adapter) and place it in this folder for inference.
### :ledger: How to Train on Larger VRAM GPUs ***=> A Consumer GPU with 32GB (e.g. NVIDIA GeForce RTX 5090)
=> A Professional-Grade GPU with 40GB-48GB (e.g. NVIDIA A100-40GB, A800-40GB, L20, and RTX A6000).*** >***About `batch_size`, `lora_rank`, `grad_accumulation_steps`, and `max_steps`.*** If you have such a device, you can increase the `batch size` and `lora rank`: `--batch_size 8` and `--lora_rank 64`. This only takes nearly `29GB`. >***About `vlm_path`.*** The VLM in the VLA-Adapter uses the Prismatic-VLMs architecture, with the LLM backbone being `Qwen2.5-0.5B`. You can download it from https://huggingface.co/Stanford-ILIAD/prism-qwen25-extra-dinosiglip-224px-0_5b and place it in `/pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b`. >***About `data_name`.*** Launch the fine-tuning script with the vla-adapter configuration below. It can run in the background, and the running progress can be seen in the `/logs` folder. You can replace `libero_spatial_no_noops` with `libero_object_no_noops`, `libero_goal_no_noops`, or `libero_10_no_noops`. If you are using the `CALVIN` benchmark, you need to delete `\libero` in `--data_root_dir` and replace `libero_spatial_no_noops` with `calvin_abc`. With this configuration, you can achieve the same results as in our paper on the `LIBERO-Object` benchmark, achieving a `99.2%` success rate, in just `8 hours`. The `LIBERO-Spatial` benchmark requires approximately 10 hours of training. However, the `LIBERO-Long` benchmark takes longer because its tasks are longer and more difficult, requiring more training steps to achieve superior performance. >***About `use_pro_version`.*** In addition, we recently released an enhanced version `Pro` of the VLA-Adapter. While its framework remains consistent with the original paper, it has been enhanced in the implementation, resulting in significantly improved performance. **Therefore, we strongly recommend using the Pro version!** The `Pro` version's `Policy` size is `207MB`, and training speed is virtually unchanged. The `original version` is nearly `1GB` smaller than the `pro version` (1 batch). You can choose whether to use the `Pro` version by setting the `use_pro_version` parameter, i.e., the `Pro` version is `--use_pro_version True`. ```bash data_name=libero_spatial_no_noops CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts/finetune.py \ --vlm_path pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b \ --config_file_path pretrained_models/configs \ --data_root_dir data/libero \ --dataset_name $data_name \ --run_root_dir outputs \ --use_film False \ --num_images_in_input 2 \ --use_proprio True \ --use_lora True \ --use_fz False \ --use_minivlm True \ --image_aug True \ --num_steps_before_decay 200000 \ --max_steps 200005 \ --save_freq 5000 \ --save_latest_checkpoint_only False \ --merge_lora_during_training True \ --batch_size 8 \ --grad_accumulation_steps 2 \ --learning_rate 2e-4 \ --lora_rank 64 \ --use_pro_version True \ --wandb_entity "YOUR_WANDB_ENTITY" \ --wandb_project "$data_name" \ --run_id_note VLA-Adapter--libero_spatial_no_noops--$current_time \ > logs/VLA-Adapter--libero_spatial_no_noops--$current_time.log 2>&1 & ``` Please note that the obtained models will be stored in the `/outputs` folder. Each model will take up nearly `3GB` of memory, so you need to reserve enough space. We strongly recommend that you get our trained model from [VLA-Adapter HuggingFace](https://huggingface.co/VLA-Adapter) and place it in this folder for inference.
### :ledger: How to Train on Sufficient VRAM GPUs ***=> Professional-Grade GPUs with โ‰ฅ80GB (e.g. NVIDIA A100-80GB, A800-80GB, H100, H800, H20-NVLink, and GB200).*** >***About `batch_size`, `lora_rank`, `grad_accumulation_steps`, and `max_steps`.*** You can use 1 to 8 GPUs for training by changing the number of `CUDA_VISIBLE_DEVICES` to the GPU number and the number of GPUs after `--nproc-per-node`. In our paper, we use 4ร—H100 GPU for training. In this configuration, the four suites of the LIBERO benchmark, `Spatial` (only five hours), `Object` (less than one hour), `Goal` (three hours), and `Long` (half a day); the `CALVIN` benchmark (eight hours) >***About `vlm_path`.*** The VLM in the VLA-Adapter uses the Prismatic-VLMs architecture, with the LLM backbone being `Qwen2.5-0.5B`. You can download it from https://huggingface.co/Stanford-ILIAD/prism-qwen25-extra-dinosiglip-224px-0_5b and place it in `/pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b`. >***About `data_name`.*** Launch the fine-tuning script with the vla-adapter configuration below. It can run in the background, and the running progress can be seen in the `/logs` folder. You can replace `libero_spatial_no_noops` with `libero_object_no_noops`, `libero_goal_no_noops`, or `libero_10_no_noops`. If you are using the `CALVIN` benchmark, you need to delete `\libero` in `--data_root_dir` and replace `libero_spatial_no_noops` with `calvin_abc`. >***About `use_pro_version`.*** In addition, we recently released an enhanced version `Pro` of the VLA-Adapter. While its framework remains consistent with the original paper, it has been enhanced in the implementation, resulting in significantly improved performance. **Therefore, we strongly recommend using the Pro version!** The `Pro` version's `Policy` size is `207MB`, and training speed is virtually unchanged. The `original version` is nearly `1GB` smaller than the `pro version` (1 batch). You can choose whether to use the `Pro` version by setting the `use_pro_version` parameter, i.e., the `Pro` version is `--use_pro_version True`. ```bash data_name=libero_spatial_no_noops CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nnodes 1 --nproc-per-node 4 vla-scripts/finetune.py \ --vlm_path pretrained_models/prism-qwen25-extra-dinosiglip-224px-0_5b \ --config_file_path pretrained_models/configs \ --data_root_dir data/libero \ --dataset_name $data_name \ --run_root_dir outputs \ --use_film False \ --num_images_in_input 2 \ --use_proprio True \ --use_lora True \ --use_fz False \ --use_minivlm True \ --image_aug True \ --num_steps_before_decay 150000 \ --max_steps 150005 \ --save_freq 5000 \ --save_latest_checkpoint_only False \ --merge_lora_during_training True \ --batch_size 16 \ --grad_accumulation_steps 1 \ --learning_rate 2e-4 \ --lora_rank 64 \ --use_pro_version True \ --wandb_entity "YOUR_WANDB_ENTITY" \ --wandb_project "$data_name" \ --run_id_note VLA-Adapter--spatial--$current_time \ > logs/VLA-Adapter--spatial--$current_time.log 2>&1 & ``` Please note that the obtained models will be stored in the `/outputs` folder. Each model will take up nearly `3GB` of memory, so you need to reserve enough space. We strongly recommend that you get our trained model from [VLA-Adapter HuggingFace](https://huggingface.co/VLA-Adapter) and place it in this folder for inference. ## :mechanical_arm: Inference ### :books: Related File for Inference * `experiments/robot/libero/`: LIBERO eval files * `run_libero_eval.py`: LIBERO eval script * `libero_utils.py`: LIBERO eval utils * `experiments/robot/`: General eval utils files * `openvla_utils.py`: VLA-specific eval utils * `robot_utils.py`: Other eval utils
### ๐Ÿค— Checkpoint of VLA-Adapter We fine-tuned `Qwen2.5-0.5B` with our adapter bridge paradigm on four LIBERO task suites independently: `LIBERO-Spatial`, `LIBERO-Object`, `LIBERO-Goal`, and `LIBERO-Long`. The four VLA-Adapter checkpoints for LIBERO are available on Hugging Face: * [VLA-Adapter/LIBERO-Spatial](https://huggingface.co/VLA-Adapter/LIBERO-Spatial) * [VLA-Adapter/LIBERO-Object](https://huggingface.co/VLA-Adapter/LIBERO-Object) * [VLA-Adapter/LIBERO-Goal](https://huggingface.co/VLA-Adapter/LIBERO-Goal) * [VLA-Adapter/LIBERO-Long](https://huggingface.co/VLA-Adapter/LIBERO-Long) In addition, we also provide a `Pro` version, we used `4*H100` GPUs for training, `--batch_size 16`, `--lora rank 64`, and the `--max_steps 100000`. The Pro checkpoints is: * [VLA-Adapter/LIBERO-Spatial-Pro](https://huggingface.co/VLA-Adapter/LIBERO-Spatial-Pro) `(97.8 -> 99.6)` * [VLA-Adapter/LIBERO-Object-Pro](https://huggingface.co/VLA-Adapter/LIBERO-Object-Pro) `(99.2 -> 99.6)` * [VLA-Adapter/LIBERO-Goal-Pro](https://huggingface.co/VLA-Adapter/LIBERO-Goal-Pro) `(97.2 -> 98.2)` * [VLA-Adapter/LIBERO-Long-Pro](https://huggingface.co/VLA-Adapter/LIBERO-Long-Pro) `(95.0 -> 96.4)` * [VLA-Adapter/CALVIN-ABC-Pro](https://huggingface.co/VLA-Adapter/CALVIN-ABC-Pro) `(4.42 -> 4.50)` These files need to be placed in the `/output` folder. If you trained your own models, it will also be stored here. The subsequent eval code will call the model in this folder for inference.
### :notebook: How to Eval **We strongly recommend that you use our open source `Pro` version of the model, which has stronger performance.** To start evaluations with one of these checkpoints, run one of the commands below. Each will automatically download the appropriate checkpoint listed above. If you want to use the original version of the model, you only need to adjust the `-- use_pro_version` parameter to `False` and pass the original version of the model to the `--pretrained_checkpoint` parameter. Finally, the inference results will be displayed in the `/eval_logs` folder, and the inference video will be displayed in the `/rollouts/vla-adapter` folder. ```bash # Launch LIBERO-Spatial-Pro evals (Background running) CUDA_VISIBLE_DEVICES=0 python experiments/robot/libero/run_libero_eval.py \ --use_proprio True \ --num_images_in_input 2 \ --use_film False \ --pretrained_checkpoint outputs/LIBERO-Spatial-Pro \ --task_suite_name libero_spatial \ --use_pro_version True \ > eval_logs/Spatial--chkpt.log 2>&1 & # Launch LIBERO-Object-Pro evals (Background running) CUDA_VISIBLE_DEVICES=0 python experiments/robot/libero/run_libero_eval.py \ --use_proprio True \ --num_images_in_input 2 \ --use_film False \ --pretrained_checkpoint outputs/LIBERO-Object-Pro \ --task_suite_name libero_object \ --use_pro_version True \ > eval_logs/Object--chkpt.log 2>&1 & # Launch LIBERO-Goal-Pro evals (Background running) CUDA_VISIBLE_DEVICES=0 python experiments/robot/libero/run_libero_eval.py \ --use_proprio True \ --num_images_in_input 2 \ --use_film False \ --pretrained_checkpoint outputs/LIBERO-Goal-Pro \ --task_suite_name libero_goal \ --use_pro_version True \ > eval_logs/Goal--chkpt.log 2>&1 & # Launch LIBERO-Long-Pro (LIBERO-10) evals (Background running) CUDA_VISIBLE_DEVICES=0 python experiments/robot/libero/run_libero_eval.py \ --use_proprio True \ --num_images_in_input 2 \ --use_film False \ --pretrained_checkpoint outputs/LIBERO-long-Pro \ --task_suite_name libero_10 \ --use_pro_version True \ > eval_logs/Long--chkpt.log 2>&1 & # Launch CALVIN ABCโ†’D-Pro evals (Background running) CUDA_VISIBLE_DEVICES=0 python vla-scripts/evaluate_calvin.py \ --pretrained_checkpoint outputs/CALVIN-ABC-Pro \ > eval_logs/CALVIN--ABC.log 2>&1 & ``` If you want to get the inference **throughput**, you can run it in the `run_libero_eval.py` file. You can add `start = time.time()` and `end = time.time()` before and after `lines 334--345` and calculate the difference between the two. This difference is the time it takes to generate `8 chunks`. This gives you the inference throughput. We measured it multiple times and took the average value of `0.036s`.
## ๐ŸŒˆ Success Rate Comparison All our results are inferred on `H100`. You can find the inference `log` file in the model released on [HF](https://huggingface.co/VLA-Adapter) for viewing. The evaluation script will run 500 trials by default (10 tasks x 50 episodes each) in LIBERO and 1,000 task sequences in CALVIN. Use the same card for training and inference whenever possible. **Note that results may vary slightly if you use a different GPU than the H100.** This phenomenon is also mentioned in the OpenVLA-OFT readme file. ### Performance on LIBERO benchmark. XX represents the best performance, XX represents the second best performance, and XX* represents the third best performance.
LIBERO Methods Scale Spatial Object Goal Long Avg.
Large-scaleFlowVLA (Zhong et al., 2025) 8.5B93.295.091.672.688.1
UnifiedVLA (Wang et al., 2025) 8.5B95.498.8* 93.6 94.0 95.5
OpenVLA (Kim et al., 2024) 7B84.788.479.253.776.5
OpenVLA-OFT (Kim et al., 2025) 7B97.6*98.497.994.5*97.1*
UniVLA (Bu et al., 2025) 7B96.5 96.8 95.6 92.0 95.2
CoT-VLA (Zhao et al., 2025) 7B87.5 91.6 87.6 69.0 81.1
WorldVLA (Cen et al., 2025) 7B87.6 96.2 83.4 60.0 81.8
TraceVLA (Zheng et al., 2025) 7B84.6 85.2 75.1 54.1 74.8
MolmoAct (Lee et al., 2025) 7B87.0 95.4 87.6 77.2 86.6
ThinkAct (Huang et al., 2025) 7B88.3 91.4 87.1 70.9 84.4
Small-scale4D-VLA (Zhang et al., 2025) 4B88.9 95.2 90.9 79.1 88.6
SpatialVLA (Qu et al., 2025) 4B88.2 89.9 78.6 55.5 78.1
ฯ€0 (Black et al., 2024) 3B96.898.8*95.8 85.2 94.2
ฯ€0-FAST (Pertsch et al., 2025) 3B96.4 96.8 88.6 60.2 85.5
NORA (Hung et al., 2025) 3B92.2 95.4 89.4 74.6 87.9
SmolVLA (Shukor et al., 2025) 2.2B93.0 94.0 91.0 77.0 88.8
GR00T N1 (NVIDIA et al., 2025) 2B94.4 97.6 93.0 90.6 93.9
Tiny-scaleSeer (Tian et al., 2025) 0.57B- - - 78.7 78.7
VLA-OS (Gao et al., 2025) 0.5B87.0 96.5 92.7 66.0 85.6
Diffusion Policy (Chi et al., 2023) -78.3 92.5 68.3 50.5 72.4
VLA-Adapter (Ours) 0.5B97.899.297.2* 95.0 97.3
VLA-Adapter-Pro (Ours) 0.5B99.699.6 98.296.498.5
### Performance on CALVIN ABCโ†’D benchmark. XX represents the best performance, XX represents the second best performance, and XX* represents the third best performance.
CALVIN Methods Scale 1 2 3 4 5 Avg. len
Large-scaleUniVLA (Bu et al., 2025) 7B 95.5 85.8 75.4 66.9 56.5 3.80
OpenVLA (Kim et al., 2024) 7B 91.3 77.8 62.0 52.1 43.5 3.27
OpenVLA-OFT (Kim et al., 2025) 7B 96.3 89.1 82.4 75.8 66.5 4.10
VLAS (Zhao et al., 2025b) 7B 87.2 64.2 40.9 28.1 19.6 2.40
LCB (Shentu et al., 2024) 7B 73.6 50.2 28.5 16.0 9.9 1.78
RoboDual (Bu et al., 2024a) 7B 94.4 82.7 72.1 62.4 54.4 3.66
OpenHelix (Cui et al., 2025) 7B 97.1* 91.4 82.8 72.6 64.1 4.08
ReconVLA (Song et al., 2025c) 7B 95.6 87.6 76.9 69.3 64.1 3.95
Small-scaleDeeR (Yue et al., 2024) 3B 86.2 70.1 51.8 41.5 30.4 2.82
RoboFlamingo (Li et al., 2024b) 3B 82.4 61.9 46.6 33.1 23.5 2.48
VPP (Hu et al., 2025) 1.5B 95.7 91.2 86.3* 81.0* 75.0* 4.33*
SuSIE (Black et al., 2024)1.3B 87.0 69.0 49.0 38.0 26.0 2.69
Tiny-scaleSeer-Large (Tian et al., 2025)0.57B 96.3 91.6* 86.1 80.3 74.0 4.28
MoDE (Reuss et al., 2025) 0.44B 96.2 88.9 81.1 71.8 63.5 4.01
Seer (Tian et al., 2025) 0.32B 94.4 87.2 79.9 72.2 64.3 3.98
VLA-Adapter (Ours) 0.5B99.1 94.6 88.8 82.8 76.5 4.42
VLA-Adapter-Pro (Ours) 0.5B98.595.0 90.585.380.04.50

## ๐Ÿ“ Citation ### ๐Ÿซถ If you feel that this paper, models, or codes are helpful, please cite our paper, thanks for your support of VLA-Adapter! ```bibtex @article{wang2025vlaadapter, author={Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin}, title={VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model}, journal={arXiv preprint arXiv:2509.09372}, year={2025} } ``` ## :heart: Acknowledgment We thank [OpenVLA-OFT](https://github.com/moojink/openvla-oft), [MiniVLA](https://github.com/Stanford-ILIAD/openvla-mini), and [RoboDual](https://github.com/OpenDriveLab/RoboDual) for their open-sourced work! ## ๐ŸŒŸ Star History