diff --git a/docs/mindformers/docs/source_en/advanced_development/multi_modal_dev.md b/docs/mindformers/docs/source_en/advanced_development/multi_modal_dev.md
deleted file mode 100644
index de695523d8a4326fac0a67b1ec5701041f7f3468..0000000000000000000000000000000000000000
--- a/docs/mindformers/docs/source_en/advanced_development/multi_modal_dev.md
+++ /dev/null
@@ -1,329 +0,0 @@
-# Multimodal Model Development
-
-[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/advanced_development/multi_modal_dev.md)
-
-Multimodal models refer to artificial intelligence models capable of processing and combining information from different modalities (such as text, images, audio, video, etc.) for learning and inference. Traditional single-modality models typically focus on a single type of data, such as text classification models handling only text data or image recognition models handling only image data. In contrast, multimodal models integrate data from different sources to accomplish more complex tasks, enabling them to understand and generate richer and more comprehensive content.
-
-This document aims to introduce the multimodal models in MindSpore Transformers, providing detailed steps and examples to guide users in building custom multimodal models and data processing modules using MindSpore Transformers. Additionally, users can follow the document to complete tasks such as model training and inference.
-
-The unified architecture of multimodal models in **MindSpore Transformers** primarily includes the following components:
-
-- [Dataset Construction](#dataset-construction)
-- [Data Processing Modules](#data-processing-modules)
-- [Model Construction](#model-construction)
-    - [Model Configuration Class](#model-configuration-class)
-    - [Non-text Modality Processing Module](#non-text-modality-processing-module)
-    - [Cross-Modal Interaction Module](#cross-modal-interaction-module)
-    - [Text Generation Module](#text-generation-module)
-- [Multimodal Model Practice](#multimodal-model-practice)
-
-## Dataset Construction
-
-Before training a multimodal model, it is often necessary to first construct a multimodal dataset. MindSpore Transformers currently provides `dataset` and `dataloader` classes for multimodal data, which users can directly utilize:
-
-- [BaseMultiModalDataLoader](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/dataset/dataloader/multi_modal_dataloader.py) is the multimodal dataset loading class. It handles the functionality of reading data from a `json` file.
-- [ModalToTextSFTDataset](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/dataset/modal_to_text_sft_dataset.py) is the multimodal dataset processing class. It handles multimodal data processing, as well as operations like batch processing and data repetition. For more details on multimodal data processing, refer to the [Data Processing Modules](#data-processing-modules).
-
-Below is an example of part of the training dataset `json` file for the `CogVLM2-Video` model:
-
-```json
-[{
-    "id": "v_p1QGn0IzfW0.mp4",
-    "conversations": [
-      {
-        "from": "user",
-        "value": "<|reserved_special_token_3|>/path/VideoChatGPT/convert/v_p1QGn0IzfW0.mp4<|reserved_special_token_4|>What equipment is visible in the gym where the boy is doing his routine?"
-      },
-      {
-        "from": "assistant",
-        "value": "There is other equipment visible in the gym like a high bar and still rings."
-      }
-    ]
-}]
-```
-
-In the dataset, `<|reserved_special_token_3|>` and `<|reserved_special_token_3|>` are placeholders used to represent video paths in the `CogVLM2-Video` model.
-
-Users can construct custom `json` files as needed. The file format should be a list containing multiple dictionaries, where each dictionary represents a data sample. In each sample, the `id` field denotes the data identifier, and the `conversations` field represents the multi-turn conversation content.
-
-After constructing the `json` file, you can run the following example code to view the data samples from the dataset:
-
-```python
-from mindformers.dataset.dataloader.multi_modal_dataloader import BaseMultiModalDataLoader
-
-# build data loader
-dataset_loader = BaseMultiModalDataLoader(
-  annotation_file = '/path/dataset.json', shuffle=False
-)
-print(dataset_loader[0])
-
-# ([['user', '<|reserved_special_token_3|>/path/VideoChatGPT/convert/v_p1QGn0IzfW0.mp4<|reserved_special_token_4|>What equipment is visible in the gym where the boy is doing his routine?'], ['assistant', 'There is other equipment visible in the gym like a high bar and still rings.']],)
-```
-
-## Data Processing Modules
-
-During the training and inference of multimodal models, the data processing modules are required to perform preprocessing on multimodal data. This module is invoked during training in the ModalToTextSFTDataset, and during inference in the [MultiModalToTextPipeline](https://www.mindspore.cn/mindformers/docs/en/dev/pipeline/mindformers.pipeline.MultiModalToTextPipeline.html#mindformers.pipeline.MultiModalToTextPipeline).
-
-Below is a flowchart of the multimodal data processing. The custom modules in the diagram need to be implemented by the user according to their specific requirements, while other modules can be directly invoked.
-
-![multi_modal.png](./images/multi_modal.png)
-
-Then, using the [CogVLM2-Video model data preprocessing module](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/cogvlm2/cogvlm2_processor.py) as an example, we will introduce the functionality of the components of the multimodal data processing module.
-
-1. BaseXModalToTextProcessor is mainly used to receive raw multimodal data for inference and perform preprocessing operations. It also implements post-processing operations for inference results, and users can directly use this class.
-2. BaseXModalToTextTransform is mainly used to process the data returned by `BaseXModalToTextProcessor` or the multimodal dataset into data suitable for inference or training. This class can also be directly used by users.
-3. [ModalContentTransformTemplate](https://www.mindspore.cn/mindformers/docs/en/dev/models/mindformers.models.multi_modal.ModalContentTransformTemplate.html#mindformers.models.multi_modal.ModalContentTransformTemplate) is the abstract class for all modality-specific data construction modules. Since data operations are model-dependent, users need to implement corresponding custom data construction classes based on their needs. In the `CogVLM2-Video` model, the `CogVLM2ContentTransformTemplate` class is implemented to handle both video and text data.
-4. ModalContentBuilder is the abstract class for single-modality data processing. If the model needs to handle data from multiple modalities, corresponding single-modality data processing classes need to be created during the initialization of the custom data construction class. In the `CogVLM2-Video` model, the `CogVLM2VideoContentBuilder` class is implemented to handle video data, while the general text data processing class `BaseTextContentBuilder` is used to process text data.
-
-Below is an example of the data preprocessing code for training and inference in the `CogVLM2-Video` model.
-
-### Model Training Data Processing
-
-In multimodal model training tasks, data preprocessing configurations are typically written in the `train_dataset` section. The following is an example of the dataset-related configuration in the `CogVLM2-Video` model training configuration file:
-
-[finetune_cogvlm2_video_llama3_chat_13b_lora.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/finetune_cogvlm2_video_llama3_chat_13b_lora.yaml)
-
-```yaml
-train_dataset: &train_dataset
-  data_loader:
-    type: BaseMultiModalDataLoader
-    annotation_file: "/path/train_data.json"
-    shuffle: True
-  modal_to_text_transform:
-    type: BaseXModalToTextTransform
-    max_length: 2048
-    model_transform_template:
-      type: CogVLM2ContentTransformTemplate
-      output_columns: [ "input_ids", "images", "video_context_pos", "position_ids", "labels" ]
-      signal_type: "chat"
-      mode: 'train'
-      pos_pad_length: 2048
-  tokenizer:
-    add_bos_token: False
-    add_eos_token: False
-    max_length: 2048
-    pad_token: "<|reserved_special_token_0|>"
-    vocab_file: "/path/tokenizer.model"
-    type: CogVLM2Tokenizer
-```
-
-The `annotation_file` is the path to the training data's `json` file. Both `modal_to_text_transform` and `tokenizer` should be similar to those in the `processor` section of the inference configuration.
-
-```python
-from mindformers.tools.register.config import MindFormerConfig
-from mindformers.dataset.modal_to_text_sft_dataset import ModalToTextSFTDataset
-
-# load configs
-configs = MindFormerConfig("configs/cogvlm2/finetune_cogvlm2_video_llama3_chat_13b_lora.yaml")
-# build dataset
-multi_modal_dataset = ModalToTextSFTDataset(**configs.train_dataset)
-# iterate dataset
-for item in multi_modal_dataset:
-    print(len(item))
-    break
-# 5, output 5 columns
-```
-
-### Model Inference Data Processing
-
-The data processing module configuration in the `CogVLM2-Video` model inference configuration file is as follows:
-
-[predict_cogvlm2_video_llama3_chat_13b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/predict_cogvlm2_video_llama3_chat_13b.yaml)
-
-```yaml
-processor:
-  type: BaseXModalToTextProcessor
-  model_transform_template:
-    type: CogVLM2ContentTransformTemplate
-    output_columns: [ "input_ids", "position_ids", "images", "video_context_pos" ]
-    vstack_columns: [ "images", "video_context_pos" ]
-    signal_type: "chat"
-    pos_pad_length: 2048
-  tokenizer:
-    add_bos_token: False
-    add_eos_token: False
-    max_length: 2048
-    pad_token: "<|reserved_special_token_0|>"
-    vocab_file: "/path/tokenizer.model"
-    type: CogVLM2Tokenizer
-```
-
-The `vocab_file` is the path to the vocabulary file used, while other parameters are related to the model configuration and can be customized as needed by the user.
-
-Below is an example code for processing multimodal training data. Unlike the training data, the data processing yields a dictionary containing processed data such as `input_ids`, rather than a list.
-
-```python
-from mindformers.tools.register.config import MindFormerConfig
-from mindformers.models.multi_modal.base_multi_modal_processor import BaseXModalToTextProcessor
-from mindformers.models.cogvlm2.cogvlm2_tokenizer import CogVLM2Tokenizer
-
-# build processor
-configs = MindFormerConfig("configs/cogvlm2/predict_cogvlm2_video_llama3_chat_13b.yaml")
-configs.processor.tokenizer = tokenizer = CogVLM2Tokenizer(**configs.processor.tokenizer)
-processor = BaseXModalToTextProcessor(**configs.processor)
-
-# process data
-multi_modal_data = [
-  {'video': "/path/video.mp4"},
-  {'text': "Please describe this video."}
-]
-
-print(processor(multi_modal_data).keys())
-# dict_keys(['input_ids', 'position_ids', 'images', 'video_context_pos'])
-```
-
-After implementing the multimodal dataset construction and data processing modules, the data that can be handled by the multimodal model can be obtained. Below, we will introduce how to construct a multimodal large model.
-
-## Model Construction
-
-A multimodal large model typically consists of three parts: a non-text modality processing module, a cross-modal interaction module, and a text generation module. The non-text modality processing module is usually a vision model pre-trained on large-scale data, the text generation module is typically a large text generation model, and the cross-modal interaction module usually consists of multiple linear layers.
-
-### Model Configuration Class
-
-In MindSpore Transformers, the parameters related to multimodal models are mainly controlled through the model configuration class. Below, we use the `CogVLM2Config` class as an example to explain how to build the model configuration class.  
-For the specific implementation, refer to [CogVLM2Config](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/cogvlm2/cogvlm2_config.py).
-
-```python
-@MindFormerRegister.register(MindFormerModuleType.CONFIG)
-class CogVLM2Config(PretrainedConfig):
-    def __init__(self,
-                 vision_model: PretrainedConfig,
-                 llm_model: PretrainedConfig,
-                 **kwargs):
-        super().__init__(**kwargs)
-        self.vision_model = vision_model
-        self.llm_model = llm_model
-```
-
-Parameter Explanation:
-
-1. `@MindFormerRegister.register(MindFormerModuleType.CONFIG)` is mainly used to register a custom model configuration class. Once registered, the model configuration class can be called by its name in the `yaml` file.
-2. `vision_model` and `llm_model` represent the configuration classes for the vision model and text generation model, respectively. They are passed as parameters to the multimodal model configuration class and processed during the class initialization.
-3. `PretrainedConfig` is the base class for all model configurations. For more details, refer to [PretrainedConfig](https://www.mindspore.cn/mindformers/docs/en/dev/models/mindformers.models.PretrainedConfig.html#mindformers.models.PretrainedConfig).
-
-In the configuration file, the model should be configured as follows.  
-For the specific implementation, refer to [predict_cogvlm2_video_llama3_chat_13b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/predict_cogvlm2_video_llama3_chat_13b.yaml).
-
-```yaml
-model:
-  model_config:
-    type: MultiModalConfig
-    vision_model:
-      arch:
-        type: EVAModel
-      model_config:
-        type: EVA02Config
-        image_size: 224
-        patch_size: 14
-        hidden_size: 1792
-        num_hidden_layers: 63
-        ...
-    llm_model:
-      arch:
-        type: CogVLM2VideoLM
-      model_config:
-        type: LlamaConfig
-        seq_length: 2048
-        hidden_size: 4096
-        num_layers: 32
-        ...
-  arch:
-    type: CogVLM2ForCausalLM
-```
-
-In this configuration file, `EVAModel` and `EVA02Config` are used as the `vision_model` and its configuration class, while `CogVLM2VideoLM` and `LlamaConfig` are used as the `llm_model` and its configuration class.  
-Together, they form the multimodal model `CogVLM2ForCausalLM`. These classes are all pre-implemented modules in MindSpore Transformers. Below, we will explain how to implement custom modules.
-
-### Non-Text Modality Processing Module
-
-MindSpore Transformers provides models like `ViT` and `EVA02` as visual information processing modules. Below, we use the `EVA02` model as an example to explain how to construct a non-text modality processing module.  
-For more details, refer to [EVAModel](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/eva02/eva.py) and [EVA02Config](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/eva02/eva_config.py).
-
-```python
-from mindformers.tools.register import MindFormerRegister, MindFormerModuleType
-from mindformers.models.modeling_utils import PreTrainedModel
-from mindformers.models.eva02.eva_config import EVA02Config
-
-class EVA02PreTrainedModel(PreTrainedModel):
-    config_class = EVA02Config
-    base_model_prefix = "eva02"
-
-@MindFormerRegister.register(MindFormerModuleType.MODELS)
-class EVAModel(EVA02PreTrainedModel):
-    def __init__(self, config=None):
-        config = config if config else EVA02Config()
-        super().__init__(config)
-```
-
-Parameter Explanation:
-
-1. `@MindFormerRegister.register(MindFormerModuleType.MODELS)` is mainly used to register a custom model class. Once registered, the model class can be called by its name in the `yaml` file.
-2. `EVA02PreTrainedModel` inherits from the `PreTrainedModel` class and is mainly used to specify the model configuration class and the prefix for model parameter names. `EVAModel` is the specific implementation of the model, inheriting from the `EVA02PreTrainedModel` class. For more details, refer to the [PreTrainedModel](https://www.mindspore.cn/mindformers/docs/en/dev/models/mindformers.models.PreTrainedModel.html#mindformers.models.PreTrainedModel) API.
-3. `EVAModel` mainly processes visual information in the data and feeds the processed visual features into the **cross-modal interaction module**.
-
-### Cross-Modal Interaction Module
-
-The text generation module is usually a pre-trained large language model, while the non-text modality processing module is a model pre-trained on large-scale non-text data. The output features from these models differ significantly from those in the text features and cannot be directly input into the text generation module for inference. Therefore, a cross-modal interaction module, matching the text generation module, is needed to process visual features into vectors that can be handled by the text generation module.
-
-Below, we use the `VisionMLPAdapter` in the `CogVLM2-Video` model as an example to explain the structure and function of the cross-modal interaction module.
-
-```python
-class VisionMLPAdapter(nn.Cell):
-    def __init__(self, vision_grid_size, vision_hidden_size, text_hidden_size, text_intermediate_size,
-                 compute_dtype=ms.float16, param_init_type=ms.float16):
-        super().__init__()
-        self.grid_size = vision_grid_size
-        self.linear_proj = GLU(in_features=vision_hidden_size,
-                               hidden_size=text_hidden_size,
-                               intermediate_size=text_intermediate_size,
-                               compute_dtype=compute_dtype, param_init_type=param_init_type)
-        self.conv = nn.Conv2d(in_channels=vision_hidden_size, out_channels=vision_hidden_size,
-                              kernel_size=2, stride=2, dtype=param_init_type, has_bias=True).to_float(compute_dtype)
-```
-
-In the `VisionMLPAdapter`, the output of the `EVAModel` is processed through operations such as Linear and Conv2D to match the same dimensionality as the text features. Here, `vision_hidden_size` and `text_hidden_size` represent the dimensionalities of the visual and text features, respectively.
-
-### Text Generation Module
-
-MindSpore Transformers provides large language models such as `Llama2` and `Llama3` as text generation modules, which, together with the non-text modality processing module and cross-modal interaction module, form the multimodal model.
-
-```python
-@MindFormerRegister.register(MindFormerModuleType.MODELS)
-class MultiModalForCausalLM(BaseXModalToTextModel):
-    def __init__(self, config: MultiModalConfig, **kwargs):
-        super().__init__(config, **kwargs)
-        self.config = config
-        self.vision_model = build_network(config.vision_model)
-        self.llm_model = build_network(config.llm_model)
-        self.mlp_adapter = VisionMLPAdapter(**kwargs)
-
-    def prepare_inputs_for_generation(self, input_ids, **kwargs):
-      """Prepare inputs for generation in inference."""
-
-    def prepare_inputs_for_predict_layout(self, input_ids, **kwargs):
-      """Prepare inputs for generation in inference."""
-
-    def set_dynamic_inputs(self, **kwargs):
-      """Set dynamic inputs for model."""
-
-    def construct(self, input_ids, **kwargs):
-      """Model forward."""
-```
-
-Parameter Explanation:
-
-1. `MultiModalForCausalLM`, as the multimodal model class, inherits from the base class `BaseXModalToTextModel`. During the construction of this class, the `build_network` function and the corresponding module configurations are used to initialize the non-text modality processing module `vision_model`, the text generation module `llm_model`, and the cross-modal interaction module `VisionMLPAdapter`.
-2. The `prepare_inputs_for_generation` method preprocesses the input data, ensuring that the processed data can be used for model inference through the `construct` method.
-3. The `prepare_inputs_for_predict_layout` method constructs data that the model can handle. Its return value corresponds to the input parameters of the `construct` method, and the constructed data allows for model compilation.
-4. The `set_dynamic_inputs` method configures dynamic shapes for some input data in the model.
-5. The `construct` method is the common interface for all models and serves as the forward execution function for the multimodal model.
-
-## Multimodal Model Practice
-
-After implementing the multimodal dataset, data processing modules, and multimodal model construction, you can start model pre-training, fine-tuning, inference, and other tasks by using the model configuration file. This requires creating the corresponding model configuration file.
-
-For specific model configuration files, refer to [predict_cogvlm2_video_llama3_chat_13b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/predict_cogvlm2_video_llama3_chat_13b.yaml) and [finetune_cogvlm2_video_llama3_chat_13b_lora.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/finetune_cogvlm2_video_llama3_chat_13b_lora.yaml), which correspond to model inference and fine-tuning, respectively. For the meaning of specific parameters, refer to the [configuration file documentation](https://www.mindspore.cn/mindformers/docs/en/dev/feature/configuration.html).
-
-In the user-defined configuration file, sections such as `model`, `processor`, and `train_dataset` need to correspond to the user's custom **dataset**, **data processing module**, and **multimodal model**.
-
-After editing the custom configuration file, refer to the [CogVLM2-Video model documentation](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/cogvlm2_video.md) to start model [inference](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/cogvlm2_video.md#推理) and [fine-tuning](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/cogvlm2_video.md#微调) tasks.
diff --git a/docs/mindformers/docs/source_en/feature/evaluation.md b/docs/mindformers/docs/source_en/feature/evaluation.md
index 3040735d5a55410d0c75f891a420c9d2ebcf9137..8fddc7b124500d4ee6b74ceb93203c31eec412ec 100644
--- a/docs/mindformers/docs/source_en/feature/evaluation.md
+++ b/docs/mindformers/docs/source_en/feature/evaluation.md
@@ -173,368 +173,10 @@ After executing the evaluation command, the evaluation results will be printed o
 | gsm8k |       3 | flexible-extract |      5 | exact_match | ↑ | 0.5034 | ± | 0.0138 |
 |       |         | strict-match     |      5 | exact_match | ↑ | 0.5011 | ± | 0.0138 |
 
-## VLMEvalKit Evaluation
-
-### Overview
-
-[VLMEvalKit](https://github.com/open-compass/VLMEvalKit)
-is an open source toolkit designed for large visual language model evaluation, supporting one-click evaluation of large visual language models on various benchmarks, without the need for complicated data preparation, making the evaluation process easier. It supports a variety of graphic multimodal evaluation sets and video multimodal evaluation sets, a variety of API models and open source models based on PyTorch and HF, and customized prompts and evaluation metrics. After adapting MindSpore Transformers based on VLMEvalKit evaluation framework, it supports loading multimodal large models in MindSpore Transformers for evaluation.
-
-The currently adapted models and supported evaluation datasets are shown in the table below (the remaining models and evaluation datasets are actively being adapted, please pay attention to version updates):
-
-| Adapted models | Supported evaluation datasets                     |
-|--|---------------------------------------------------|
-| cogvlm2-image-llama3-chat | MME, MMBench, COCO Caption, MMMU_DEV_VAL, TextVQA_VAL |
-| cogvlm2-video-llama3-chat | MMBench-Video, MVBench                             |
-
-### Supported Feature Descriptions
-
-1. Supports automatic download of evaluation datasets;
-2. Generate results with one click.
-
-### Installation
-
-#### Downloading the Code and Compiling, Installing Dependency Packages
-
-1. Download and modify the code: Due to known issues with open source frameworks running MVBench datasets, it is necessary to modify the code by importing patch. Get [eval.patch](https://github.com/user-attachments/files/17956417/eval.patch) and download and place it in the local directory. When importing the patch, use the absolute path of the patch.
-
-    Execute the following command:
-
-    ```bash
-    git clone https://github.com/open-compass/VLMEvalKit.git
-    cd VLMEvalKit
-    git checkout 78a8cef3f02f85734d88d534390ef93ecc4b8bed
-    git apply /path/to/eval.patch
-    ```
-
-2. Install dependency packages
-
-    Find the requirements.txt (VLMEvalKit/requirements.txt) file in the downloaded code and modify it to the following content:
-
-    ```text
-    gradio==4.40.0
-    huggingface_hub==0.24.2
-    imageio==2.35.1
-    matplotlib==3.9.1
-    moviepy==1.0.3
-    numpy==1.26.4
-    omegaconf==2.3.0
-    openai==1.3.5
-    opencv-python==4.10.0.84
-    openpyxl==3.1.5
-    pandas==2.2.2
-    peft==0.12.0
-    pillow==10.4.0
-    portalocker==2.10.1
-    protobuf==5.27.2
-    python-dotenv==1.0.1
-    requests==2.32.3
-    rich==13.7.1
-    sentencepiece==0.2.0
-    setuptools==69.5.1
-    sty==1.0.6
-    tabulate==0.9.0
-    tiktoken==0.7.0
-    timeout-decorator==0.5.0
-    torch==2.5.1
-    tqdm==4.66.4
-    transformers==4.43.3
-    typing_extensions==4.12.2
-    validators==0.33.0
-    xlsxwriter==3.2.0
-    torchvision==0.20.1
-    ```
-
-    Execute Command:
-
-    ```bash
-    pip install -r requirements.txt
-    ```
-
-#### Installing FFmpeg
-
-For Ubuntu systems follow the steps below to install:
-
-1. Update the system package list and install the system dependency libraries required for compiling FFmpeg and decode.
-
-      ```bash
-      apt-get update
-      apt-get -y install autoconf automake build-essential libass-dev libfreetype6-dev libsdl2-dev libtheora-dev libtool libva-dev libvdpau-dev libvorbis-dev libxcb1-dev libxcb-shm0-dev libxcb-xfixes0-dev pkg-config texinfo zlib1g-dev yasm libx264-dev libfdk-aac-dev libmp3lame-dev libopus-dev libvpx-dev
-      ```
-
-2. Download the compressed source code package of FFmpeg4.1.11 from the FFmpeg official website, unzip the source code package and enter the decompressed directory; Configure compilation options for FFmpeg: specify the installation path (absolute path) of FFmpeg, generate shared libraries, enable support for specific codecs, and enable no free and GPL licensed features; Compile and install FFmpeg.
-
-      ```bash
-      wget --no-check-certificate https://www.ffmpeg.org/releases/ffmpeg-4.1.11.tar.gz
-      tar -zxvf ffmpeg-4.1.11.tar.gz
-      cd ffmpeg-4.1.11
-      ./configure --prefix=/{path}/ffmpeg-xxx --enable-shared --enable-libx264 --enable-libfdk-aac --enable-libmp3lame --enable-libopus --enable-libvpx --enable-nonfree --enable-gpl
-      make && make install
-      ```
-
-Install OpenEuler system according to the following steps:
-
-1. Download the compressed source code package of FFmpeg4.1.11 from the FFmpeg official website, unzip the source code package and enter the decompressed directory; Configure compilation options for FFmpeg: specify the installation path (absolute path) for FFmpeg; Compile and install FFmpeg.
-
-      ```bash
-      wget --no-check-certificate https://www.ffmpeg.org/releases/ffmpeg-4.1.11.tar.gz
-      tar -zxvf ffmpeg-4.1.11.tar.gz
-      cd ffmpeg-4.1.11
-      ./configure --enable-shared --disable-x86asm --prefix=/path/to/ffmpeg
-      make && make install
-      ```
-
-2. Configure environment variables, `FFMPEG-PATH` requires specifying the absolute path for installing FFmpeg so that the system can correctly locate and use FFmpeg and its related libraries.
-
-      ```bash
-      vi ~/.bashrc
-      export FFMPEG_PATH=/path/to/ffmpeg/
-      export LD_LIBRARY_PATH=$FFMPEG_PATH/lib:$LD_LIBRARY_PATH
-      source ~/.bashrc
-      ```
-
-#### Installing Decord
-
-Install Ubuntu system according to the following steps:
-
-1. Pull the Decord code, enter the Decord directory, initialize and update Decord dependencies, and execute the following command:
-
-      ```bash
-      git clone --recursive -b v0.6.0 https://github.com/dmlc/decord.git
-      cd decord
-      ```
-
-2. Create and enter the `build` directory, configure the compilation options for Decord, disable CUDA support, enable Release mode (optimize performance), specify the installation path for FFmpeg, and compile the Decord library. Copy the compiled libdecord.so library file to the system library directory and to the `python` directory of `decord`.
-
-      ```bash
-      mkdir build
-      cd build
-      cmake .. -DUSE_CUDA=0 -DCMAKE_BUILD_TYPE=Release -DFFMPEG_DIR=/{path}/ffmpeg-4.1.11 && make
-      cp libdecord.so /usr/local/lib/
-      cp libdecord.so ../python/decord/libdecord.so
-      ```
-
-3. Go to the python folder in the `decord` directory, install the numpy dependency, and install the python package for Decord. Add the library path (absolute path) of FFmpeg to the environment variable `LD_LIBRARY_PATH` to ensure that the runtime can find the shared library of FFmpeg.
-
-      ```bash
-      cd /path/to/decord/python
-      pip install numpy
-      python setup.py install
-      export LD_LIBRARY_PATH=/path/to/ffmpeg-4.1.11/lib/:$LD_LIBRARY_PATH
-      ```
-
-4. Execute Python commands to test if the Decord installation is successful. If there are no errors, it means the installation is successful.
-
-      ```bash
-      python -c "import decord; from decord import VideoReader"
-      ```
-
-For OpenEuler systems follow the steps below to install:
-
-1. Pull the Decord code and enter the `decord` directory.
-
-      ```bash
-      git clone --recursive -b v0.6.0 https://github.com/dmlc/decord
-      cd decord
-      ```
-
-2. Create and enter the build directory, configure the compilation options for Decord, specify the installation path (absolute path) for ffmpeg, and compile the `decord` library; Enter the `python` folder in the `decord` directory, configure environment variables, and specify `PYTHONPATH`; Install the python package for Decord.
-
-      ```bash
-      mkdir build && cd build
-      cmake -DFFMPEG_DIR=/path/ffmpeg-4.1.11 ..
-      make
-      cd ../python
-      pwd=$PWD
-      echo "PYTHONPATH=$PYTHONPATH:$pwd" >> ~/.bashrc
-      source ~/.bashrc
-      python3 setup.py install
-      ```
-
-3. Execute python commands to test if the Decord installation is successful. If there are no errors, it means the installation is successful.
-
-      ```bash
-      python -c "import decord; from decord import VideoReader"
-      ```
-
-### Evaluation
-
-#### Preparations Before Evaluation
-
-1. Create a new directory, for example named `model_dir`, to store the model yaml file;
-2. Place the model inference yaml configuration file (predict_xxx_. yaml) in the directory created in the previous step. For details, Please refer to the inference content of description documents for each model in the [model library](../introduction/models.md);
-3. Configure the yaml file.
-
-    Using [predict_cogvlm2_image_llama3_chat_19b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/predict_cogvlm2_image_llama3_chat_19b.yaml) configuration as an example:
-
-    ```yaml
-    load_checkpoint: "/{path}/model.ckpt"  # Specify the path to the weights file
-    model:
-      model_config:
-        use_past: True                         # Turn on incremental inference
-        is_dynamic: False                       # Turn off dynamic shape
-
-      tokenizer:
-        vocab_file: "/{path}/tokenizer.model"  # Specify the tokenizer file path
-    ```
-
-    Configure the yaml file. Refer to [configuration description](../feature/configuration.md).
-4. The MMBench-Video dataset evaluation requires the use of the GPT-4 Turbo model for evaluation and scoring. Please prepare the corresponding API Key in advance and put it in the VLMEvalKit/.env file as follows:
-
-   ```text
-   OPENAI_API_KEY=your_apikey
-   ```
-
-5. At the beginning of MVBench dataset evaluation, if you are prompted to enter the HuggingFace key, please follow the prompts to ensure the normal execution of subsequent evaluation.
-
-#### Pulling Up the Evaluation Task
-
-Execute the script in the root directory of the MindSpore Transformers local code repository: [run_vlmevalkit.sh](https://gitee.com/mindspore/mindformers/blob/dev/toolkit/benchmarks/run_vlmevalkit.sh).
-
-Execute the following command to initiate the evaluation task:
-
-```shell
-#!/bin/bash
-
-source toolkit/benchmarks/run_vlmevalkit.sh \
- --data MMMU_DEV_VAL \
- --model cogvlm2-image-llama3-chat \
- --verbose \
- --work_dir /path/to/cogvlm2-image-eval-result \
- --model_path model_dir
-```
-
-### Evaluation Parameters
-
-| Parameters      | Type  | Descriptions                                                                                                                               | Compulsory(Y/N)|
-|-----------------|-----|--------------------------------------------------------------------------------------------------------------------------------------------|------|
-| `--data`        | str | Name of the dataset, multiple datasets can be passed in, split by spaces.                                                                  | Y    |
-| `--model`       | str | Name of the model.                                                                                                                         | Y    |
-| `--verbose`     | /   | Outputs logs from the evaluation run.                                                                                                      | N    |
-| `--work_dir`    | str | Directory for storing evaluation results. By default, evaluation results are stored in the `outputs` folder of the current execution directory by default. | N    |
-| `--model_path`  | str | The folder path containing the model configuration file.                                                                                   | Y    |
-| `--register_path`| str | The absolute path of the directory where the cheat code is located. For example, the model directory under the [research](https://gitee.com/mindspore/mindformers/tree/dev/research) directory. | No(The cheat code is required)     |
-
-If the server does not support online downloading of image datasets due to network limitations, you can upload the downloaded .tsv dataset file to the ~/LMUData directory on the server for offline evaluation. (For example: ~/LMUData/MME.tsv or ~/LMUData/MMBench_DEV_EN.tsv or ~/LMUData/COCO_VAL.tsv)
-
-### Viewing Review Results
-
-After evaluating in the above way, find the file ending in .json or .csv in the directory where the evaluation results are stored to view the evaluation results.
-
-The results of the evaluation examples are as follows, where `Bleu` and `ROUGE_L` denote the metrics for evaluating the quality of the translation, and `CIDEr` denotes the metrics for evaluating the image description task.
-
-```json
-{
-   "Bleu": [
-      15.523950970070652,
-      8.971141548228058,
-      4.702477458554666,
-      2.486860744700995
-   ],
-   "ROUGE_L": 15.575063213115946,
-   "CIDEr": 0.01734615519604295
-}
-```
-
-## Using the VideoBench Dataset for Model Evaluation
-
-### Overview
-
-[Video-Bench](https://github.com/PKU-YuanGroup/Video-Bench/tree/main) is the first comprehensive evaluation benchmark for Video-LLMs, featuring a three-level ability assessment that systematically evaluates models in video-exclusive understanding, prior knowledge incorporation, and video-based decision-making abilities.
-
-### Preparations Before Evaluation
-
-1. Download Dataset
-
-    Download [Videos of Video-Bench](https://huggingface.co/datasets/LanguageBind/Video-Bench), place it in the following directory format after decompression:
-
-    ```text
-    egs/VideoBench/
-      └── Eval_video
-            ├── ActivityNet
-            │     ├── v__2txWbQfJrY.mp4
-            │     ...
-            ├── Driving-decision-making
-            │     ├── 1.mp4
-            │     ...
-            ...
-    ```
-
-2. Download Json
-
-    Download [Jsons of Video-Bench](https://github.com/PKU-YuanGroup/Video-Bench/tree/main?tab=readme-ov-file), place it in the following directory format after decompression:
-
-    ```text
-    egs/Video-Bench/
-      └── Eval_QA
-            ├── Youcook2_QA_new.json and other json files
-            ...
-    ```
-
-3. Download the correct answers to all questions
-
-    Download [Answers of Video-Bench](https://huggingface.co/spaces/LanguageBind/Video-Bench/resolve/main/file/ANSWER.json).
-
-> Notes: The text data in Video-Bench is stored in the path format of 'egs/VideoBench/Eval-QA'(The directory should have at least two layers, and the last layer should be `EvalQA`); The video data in Video-Bench is stored in the path format of "egs/VideoBench/Eval_video"(The directory should have at least two layers, and the last layer should be `Eval_video`).
-
-### Evaluation
-
-The execution script path can refer to the link: [eval_with_videobench.py](https://gitee.com/mindspore/mindformers/blob/dev/toolkit/benchmarks/eval_with_videobench.py).
-
-#### Executing Inference Script to Obtain Inference Results
-
-```shell
-python toolkit/benchmarks/eval_with_videobench.py \
---model_path model_path \
---dataset_name dataset_name \
---Eval_QA_root Eval_QA_root \
---Eval_Video_root Eval_Video_root \
- --chat_conversation_output_folder output
-```
-
-> The parameter `Eval_QA_root` path is filled in the previous directory of Eval-QA; The parameter `Eval_Video_root` path is filled in the previous directory of Eval_video.
-
-**Parameters Description**
-
-| **Parameters**                      | **Compulsory(Y/N)** | **Description**                                                                                                 |
-|-------------------------------------|---------------------|-----------------------------------------------------------------------------------------------------------------|
-| `--model_path`                      | Y                   | The folder path for storing model related files, including model configuration files and model vocabulary files. |
-| `--dataset_name`                    | N                   | Evaluation datasets name, default to None, evaluates all subsets of VideoBench.                                 |
-| `--Eval_QA_root`                    | Y                   | Directory for storing JSON files of VideoBench dataset.                         |
-| `--Eval_Video_root`                 | Y                   | The video file directory for storing the VideoBench dataset.                                                    |
-| `--chat_conversation_output_folder` | N                   | Directory for generating result files. By default, it is stored in the Chat_desults folder of the current directory.                                                                         |
-
-After running, a dialogue result file will be generated in the chat_conversation_output_folder directory.
-
-#### Evaluating and Scoring Based on the Generated Results
-
-Video-Bench can evaluate the answers generated by the model using ChatGPT or T5, and ultimately obtain the final scores for 13 subsets of data.
-
-For example, using ChatGPT for evaluation and scoring:
-
-```shell
-python Step2_chatgpt_judge.py \
---model_chat_files_folder ./Chat_results \
---apikey sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
---chatgpt_judge_output_folder ./ChatGPT_Judge
-
-python Step3_merge_into_one_json.py \
---chatgpt_judge_files_folder ./ChatGPT_Judge \
---merge_file ./Video_Bench_Input.json
-```
-
-The script path in the above evaluation scoring command is: [Step2_chatgpt_judge.py](https://github.com/PKU-YuanGroup/Video-Bench/blob/main/Step2_chatgpt_judge.py), or [Step3_merge_into_one_json.py](https://github.com/PKU-YuanGroup/Video-Bench/blob/main/Step3_merge_into_one_json.py).
-
-Since ChatGPT may answer some formatting errors, you need to run below Step2_chatgpt_judge.py multiple times to ensure that each question is validated by chatgpt.
-
 ## FAQ
 
-1. Use Harness or VLMEvalKit for evaluation, when loading the HuggingFace datasets, report `SSLError`:
+1. Use Harness for evaluation, when loading the HuggingFace datasets, report `SSLError`:
 
    Refer to [SSL Error reporting solution](https://stackoverflow.com/questions/71692354/facing-ssl-error-with-huggingface-pretrained-models).
 
-   Note: Turning off SSL verification is risky and may be exposed to MITM. It is only recommended to use it in the test environment or in the connection you fully trust.
-
-2. An `AssertionError` occurs when MVBench dataset is used in VLMEvalKit for evaluation:
-
-   Because the open source framework `VLMEvalKit` has known problems when running `MVBench` dataset. Modify the file by referring to the [issue](https://github.com/open-compass/VLMEvalKit/issues/888) of the open-source framework, or delete the files generated during the evaluation and run the command again (specified by the `--work_dir` parameter, in the `outputs` folder of the current execution directory by default).
\ No newline at end of file
+   Note: Turning off SSL verification is risky and may be exposed to MITM. It is only recommended to use it in the test environment or in the connection you fully trust.
\ No newline at end of file
diff --git a/docs/mindformers/docs/source_en/guide/inference.md b/docs/mindformers/docs/source_en/guide/inference.md
index b252a158d7f0316296e0a359c738540ed087639f..cb9fa02c1668d463108e2ba8da60659c9d40e1fa 100644
--- a/docs/mindformers/docs/source_en/guide/inference.md
+++ b/docs/mindformers/docs/source_en/guide/inference.md
@@ -150,33 +150,6 @@ bash scripts/msrun_launcher.sh "run_mindformer.py \
 
 Inference results are viewed in the same way as multi-card inference.
 
-### Multimodal Inference
-
-Use `cogvlm2-llama3-chat-19B` model as example and see the following process with details:
-
-Modify configuration yaml file [predict_cogvlm2_image_llama3_chat_19b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/predict_cogvlm2_image_llama3_chat_19b.yaml).
-
-```shell
-model:
-  model_config:
-    use_past: True                         # Turn on incremental inference
-    is_dynamic: False                      # Turn off dynamic shape
-
-  tokenizer:
-    vocab_file: "/{path}/tokenizer.model"  # Specify the tokenizer file path
-```
-
-Run inference scripts.
-
-```shell
-python run_mindformer.py \
- --config configs/cogvlm2/predict_cogvlm2_image_llama3_chat_19b.yaml \
- --run_mode predict \
- --predict_data "/path/image.jpg" "Please describe this image." \  # input data,first input is image path,second input is text path.
- --modal_type image text \                                         # modal type for input data, 'image' type for image path, 'text' type for text path.
- --load_checkpoint /{path}/cogvlm2-image-llama3-chat.ckpt
-```
-
 ## More Information
 
 For more inference examples of different models, see [the models supported by MindSpore Transformers](https://www.mindspore.cn/mindformers/docs/en/dev/introduction/models.html).
diff --git a/docs/mindformers/docs/source_en/index.rst b/docs/mindformers/docs/source_en/index.rst
index 589419b5f18b48141c511a69e97047d7af2d80fd..915f7ea73fd359316fd205629931d7c65a45ada6 100644
--- a/docs/mindformers/docs/source_en/index.rst
+++ b/docs/mindformers/docs/source_en/index.rst
@@ -109,7 +109,6 @@ Advanced developing with MindSpore Transformers
 - Model Development
 
   - `Development Migration <https://www.mindspore.cn/mindformers/docs/en/dev/advanced_development/dev_migration.html>`_
-  - `Multimodal Model Development <https://www.mindspore.cn/mindformers/docs/en/dev/advanced_development/multi_modal_dev.html>`_
 
 Environment Variables
 ------------------------------------
@@ -179,7 +178,6 @@ FAQ
    advanced_development/precision_optimization
    advanced_development/performance_optimization
    advanced_development/dev_migration
-   advanced_development/multi_modal_dev
    advanced_development/api
 
 .. toctree::
diff --git a/docs/mindformers/docs/source_zh_cn/advanced_development/multi_modal_dev.md b/docs/mindformers/docs/source_zh_cn/advanced_development/multi_modal_dev.md
deleted file mode 100644
index 49289610aea9e0c5e04cf783fa59fca2c237ef1a..0000000000000000000000000000000000000000
--- a/docs/mindformers/docs/source_zh_cn/advanced_development/multi_modal_dev.md
+++ /dev/null
@@ -1,331 +0,0 @@
-# 多模态理解模型开发
-
-[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/advanced_development/multi_modal_dev.md)
-
-多模态理解模型（Multimodal Model）是指能够处理并结合来自不同模态（如文字、图像、音频、视频等）的信息进行学习和推理的人工智能模型。
-传统的单一模态模型通常只关注单一数据类型，如文本分类模型只处理文本数据，图像识别模型只处理图像数据。而多模态理解模型则通过融合不同来源的数据来完成更复杂的任务，从而能够理解和生成更加丰富、全面的内容。
-
-本文档旨在介绍MindSpore Transformers中的多模态理解模型，文档提供详细的步骤和示例指导用户使用MindSpore Transformers构建自定义的多模态理解模型和数据处理等模块。此外，用户还可以根据文档内容，完成模型的训练和推理等任务。
-
-MindSpore Transformers中多模态理解模型统一架构主要包括如下几个部分的内容：
-
-- [数据集构建](#数据集构建)
-- [数据处理模块](#数据处理模块)
-- [模型构建](#模型构建)
-    - [模型配置类](#模型配置类)
-    - [非文本模态处理模块](#非文本模态处理模块)
-    - [跨模态交互模块](#跨模态交互模块)
-    - [文本生成模块](#文本生成模块)
-- [多模态理解模型实践](#多模态理解模型实践)
-
-## 数据集构建
-
-在训练多模态理解模型之前，通常需要先完成多模态数据集的构建，MindSpore Transformers目前提供多模态数据的`dataset`类和`dataloader`类，用户可直接使用：
-
-- [BaseMultiModalDataLoader](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/dataset/dataloader/multi_modal_dataloader.py)是多模态数据集加载类，主要完成从`json`文件中读取数据的功能；
-- [ModalToTextSFTDataset](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/dataset/modal_to_text_sft_dataset.py)是多模态数据集处理类，主要完成多模态数据处理，以及数据集批处理、数据集重复等操作，具体多模态数据处理可参考[数据处理模块](#数据处理模块)；
-
-以下是`Cogvlm2-Video`模型的训练数据集`json`文件部分内容示例：
-
-```json
-[{
-    "id": "v_p1QGn0IzfW0.mp4",
-    "conversations": [
-      {
-        "from": "user",
-        "value": "<|reserved_special_token_3|>/path/VideoChatGPT/convert/v_p1QGn0IzfW0.mp4<|reserved_special_token_4|>What equipment is visible in the gym where the boy is doing his routine?"
-      },
-      {
-        "from": "assistant",
-        "value": "There is other equipment visible in the gym like a high bar and still rings."
-      }
-    ]
-}]
-```
-
-其中，`<|reserved_special_token_3|>`和`<|reserved_special_token_3|>`是`Cogvlm2-Video`模型中视频路径的标识符。
-
-用户可根据需要构造自定义的`json`文件，文件格式为一个包含多个字典的列表，每个字典代表一个数据样本，样本中`id`字段表示数据标识符，`conversations`字段表示多轮对话内容。
-
-在构造`json`文件之后，可运行下面的示例代码查看数据集中的数据样本：
-
-```python
-from mindformers.dataset.dataloader.multi_modal_dataloader import BaseMultiModalDataLoader
-
-# build data loader
-dataset_loader = BaseMultiModalDataLoader(
-  annotation_file = '/path/dataset.json', shuffle=False
-)
-print(dataset_loader[0])
-
-# ([['user', '<|reserved_special_token_3|>/path/VideoChatGPT/convert/v_p1QGn0IzfW0.mp4<|reserved_special_token_4|>What equipment is visible in the gym where the boy is doing his routine?'], ['assistant', 'There is other equipment visible in the gym like a high bar and still rings.']],)
-```
-
-## 数据处理模块
-
-在多模态理解模型的训练和推理过程中，都需要使用数据处理模块实现对多模态数据的预处理，该模块在训练时会在ModalToTextSFTDataset中被调用，推理时则是在[MultiModalToTextPipeline](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/pipeline/mindformers.pipeline.MultiModalToTextPipeline.html#mindformers.pipeline.MultiModalToTextPipeline)中被调用。
-
-下图是多模态数据的处理流程图，图中的自定义模块需要用户根据实际需求实现，其他模块直接调用即可。
-
-![multi_modal.png](images/multi_modal.png)
-
-下面以[CogVLm2-Video模型数据预处理模块](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/cogvlm2/cogvlm2_processor.py)为例，介绍多模态数据处理模块中各组成部分的功能。
-
-1. BaseXModalToTextProcessor主要用于接收用于推理的多模态原始数据并对进行预处理操作，同时也实现了推理结果后处理操作，该类用户可直接使用；
-2. BaseXModalToTextTransform主要用于将`BaseXModalToTextProcessor`或多模态数据集返回的数据分别处理为推理或训练数据，该类用户可直接使用；
-3. [ModalContentTransformTemplate](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/models/mindformers.models.multi_modal.ModalContentTransformTemplate.html#mindformers.models.multi_modal.ModalContentTransformTemplate)是所有模态训推数据构建模块的抽象类，由于数据具体操作与模型相关，因此用户需要根据需求实现对应的自定义数据构建类，在`Cogvlm2-Video`模型中实现了`CogVLM2ContentTransformTemplate`类，实现了对视频以及文本数据的处理；
-4. ModalContentBuilder是所有单模态数据处理的抽象类，如果模型要处理多个模态的数据，就需要在自定义数据构建类初始化时创建多个对应的单模态数据处理类，在`Cogvlm2-Video`模型中实现了`CogVLM2VideoContentBuilder`类用于处理视频数据，并使用通用文本数据处理类`BaseTextContentBuilder`类处理文本数据。
-
-下面是`Cogvlm2-Video`模型训练、推理数据预处理的示例代码。
-
-### 模型训练数据处理
-
-在多模态理解模型训练任务中，数据预处理的配置通常会写在`train_dataset`中，`Cogvlm2-Video`模型训练配置文件中数据集相关配置如下：
-
-[finetune_cogvlm2_video_llama3_chat_13b_lora.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/finetune_cogvlm2_video_llama3_chat_13b_lora.yaml)
-
-```yaml
-train_dataset: &train_dataset
-  data_loader:
-    type: BaseMultiModalDataLoader
-    annotation_file: "/path/train_data.json"
-    shuffle: True
-  modal_to_text_transform:
-    type: BaseXModalToTextTransform
-    max_length: 2048
-    model_transform_template:
-      type: CogVLM2ContentTransformTemplate
-      output_columns: [ "input_ids", "images", "video_context_pos", "position_ids", "labels" ]
-      signal_type: "chat"
-      mode: 'train'
-      pos_pad_length: 2048
-  tokenizer:
-    add_bos_token: False
-    add_eos_token: False
-    max_length: 2048
-    pad_token: "<|reserved_special_token_0|>"
-    vocab_file: "/path/tokenizer.model"
-    type: CogVLM2Tokenizer
-```
-
-其中，`annotation_file`为训练数据的`json`文件路径，`modal_to_text_transform`与`tokenizer`都应该与推理配置中`processor`中的类似。
-
-```python
-from mindformers.tools.register.config import MindFormerConfig
-from mindformers.dataset.modal_to_text_sft_dataset import ModalToTextSFTDataset
-
-# load configs
-configs = MindFormerConfig("configs/cogvlm2/finetune_cogvlm2_video_llama3_chat_13b_lora.yaml")
-# build dataset
-multi_modal_dataset = ModalToTextSFTDataset(**configs.train_dataset)
-# iterate dataset
-for item in multi_modal_dataset:
-    print(len(item))
-    break
-# 5, output 5 columns
-```
-
-### 模型推理数据处理
-
-`Cogvlm2-Video`模型推理配置文件中数据处理模块的配置如下：
-
-[predict_cogvlm2_video_llama3_chat_13b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/predict_cogvlm2_video_llama3_chat_13b.yaml)
-
-```yaml
-processor:
-  type: BaseXModalToTextProcessor
-  model_transform_template:
-    type: CogVLM2ContentTransformTemplate
-    output_columns: [ "input_ids", "position_ids", "images", "video_context_pos" ]
-    vstack_columns: [ "images", "video_context_pos" ]
-    signal_type: "chat"
-    pos_pad_length: 2048
-  tokenizer:
-    add_bos_token: False
-    add_eos_token: False
-    max_length: 2048
-    pad_token: "<|reserved_special_token_0|>"
-    vocab_file: "/path/tokenizer.model"
-    type: CogVLM2Tokenizer
-```
-
-其中，`vocab_file`为实际使用词表文件路径，其他参数为模型相关配置，用户可按需进行自定义配置。
-
-下面是多模态数训练据处理示例代码，与训练数据不同的是，通过数据处理可以得到一个包含`input_ids`等处理后的数据的字典，而不是一个列表。
-
-```python
-from mindformers.tools.register.config import MindFormerConfig
-from mindformers.models.multi_modal.base_multi_modal_processor import BaseXModalToTextProcessor
-from mindformers.models.cogvlm2.cogvlm2_tokenizer import CogVLM2Tokenizer
-
-# build processor
-configs = MindFormerConfig("configs/cogvlm2/predict_cogvlm2_video_llama3_chat_13b.yaml")
-configs.processor.tokenizer = tokenizer = CogVLM2Tokenizer(**configs.processor.tokenizer)
-processor = BaseXModalToTextProcessor(**configs.processor)
-
-# process data
-multi_modal_data = [
-  {'video': "/path/video.mp4"},
-  {'text': "Please describe this video."}
-]
-
-print(processor(multi_modal_data).keys())
-# dict_keys(['input_ids', 'position_ids', 'images', 'video_context_pos'])
-```
-
-在实现多模态数据集构建以及数据处理模块之后，就可以得到多模态理解模型可以处理的数据，下面将介绍如何构建多模态大模型。
-
-## 模型构建
-
-多模态大模型通常包括非文本模态处理模块、跨模态交互模块以及文本生成模块三个部分，其中非文本模态处理模块通常为经过大规模数据预训练后的视觉模型，
-文本生成模块通常为文本生成大模型，跨模态交互模块通常由多个线性层组成。
-
-### 模型配置类
-
-MindSpore Transformers中多模态理解模型相关参数主要通过模型配置类进行控制，下面以`CogVLM2Config`类为例介绍如何构建模型配置类，
-具体实现可参考[CogVLM2Config](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/cogvlm2/cogvlm2_config.py)。
-
-```python
-@MindFormerRegister.register(MindFormerModuleType.CONFIG)
-class CogVLM2Config(PretrainedConfig):
-    def __init__(self,
-                 vision_model: PretrainedConfig,
-                 llm_model: PretrainedConfig,
-                 **kwargs):
-        super().__init__(**kwargs)
-        self.vision_model = vision_model
-        self.llm_model = llm_model
-```
-
-参数说明：
-
-1. `@MindFormerRegister.register(MindFormerModuleType.CONFIG)`主要用于注册自定义的模型配置类，注册后的模型配置类可在`yaml`文件中通过名称进行调用；
-2. `vision_model`和`llm_model`分别表示视觉模型以及文本生成模型的配置类，作为多模态理解模型配置类的入参，并在类初始化过程中对其进行处理；
-3. `PretrainedConfig`是所有模型配置的基类，具体可参考[PretrainedConfig](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/models/mindformers.models.PretrainedConfig.html#mindformers.models.PretrainedConfig)。
-
-在配置文件中，按如下结构对模型进行配置，
-具体实现可参考[predict_cogvlm2_video_llama3_chat_13b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/predict_cogvlm2_video_llama3_chat_13b.yaml)。
-
-```yaml
-model:
-  model_config:
-    type: MultiModalConfig
-    vision_model:
-      arch:
-        type: EVAModel
-      model_config:
-        type: EVA02Config
-        image_size: 224
-        patch_size: 14
-        hidden_size: 1792
-        num_hidden_layers: 63
-        ...
-    llm_model:
-      arch:
-        type: CogVLM2VideoLM
-      model_config:
-        type: LlamaConfig
-        seq_length: 2048
-        hidden_size: 4096
-        num_layers: 32
-        ...
-  arch:
-    type: CogVLM2ForCausalLM
-```
-
-在该配置文件中，将`EVAModel`、`EVA02Config`作为`vision_model`模型及其配置类，将`CogVLM2VideoLM`、`LlamaConfig`作为`llm_model`模型及其配置类，
-由此构成多模态理解模型`CogVLM2ForCausalLM`，这些类都是MindSpore Transformers已实现的模块，下面将介绍如何实现自定义模块。
-
-### 非文本模态处理模块
-
-MindSpore Transformers提供`ViT`、`EVA02`等模型作为视觉信息处理模块，下面以`EVA02`模型为例介绍如何构建非文本模态处理模块，
-具体可参考[EVAModel](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/eva02/eva.py)和[EVA02Config](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/eva02/eva_config.py)。
-
-```python
-from mindformers.tools.register import MindFormerRegister, MindFormerModuleType
-from mindformers.models.modeling_utils import PreTrainedModel
-from mindformers.models.eva02.eva_config import EVA02Config
-
-class EVA02PreTrainedModel(PreTrainedModel):
-    config_class = EVA02Config
-    base_model_prefix = "eva02"
-
-@MindFormerRegister.register(MindFormerModuleType.MODELS)
-class EVAModel(EVA02PreTrainedModel):
-    def __init__(self, config=None):
-        config = config if config else EVA02Config()
-        super().__init__(config)
-```
-
-参数说明：
-
-1. `@MindFormerRegister.register(MindFormerModuleType.MODELS)`主要用于注册自定义的模型类，注册后的模型类可在`yaml`文件中通过名称进行调用；
-2. `EVA02PreTrainedModel`继承自`PreTrainedModel`类，主要用于指定模型配置类以及模型参数名的前缀，`EVAModel`作为模型的具体实现，承自`EVA02PreTrainedModel`类，相关API说明可参考[PreTrainedModel](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/models/mindformers.models.PreTrainedModel.html#mindformers.models.PreTrainedModel)；
-3. `EVAModel`主要对数据中的视觉信息进行处理，将处理后的视觉特征输入**跨模态交互模块**。
-
-### 跨模态交互模块
-
-文本生成模块通常为经过预训练的大语言模型，而非文本模态处理模块为经过大规模非文本数据预训练后的模型，其输出特征和与文本特征中所包含的信息差异过大，无法直接输入到文本生成模块中进行推理，因此需要构造与文本生成模块相匹配的跨模态交互模块，将视觉特征处理为文本生成模块可处理的向量。
-
-下面以`CogVLM2-Video`模型中的`VisionMLPAdapter`为例，介绍跨模态交互模块的结构与功能。
-
-```python
-class VisionMLPAdapter(nn.Cell):
-    def __init__(self, vision_grid_size, vision_hidden_size, text_hidden_size, text_intermediate_size,
-                 compute_dtype=ms.float16, param_init_type=ms.float16):
-        super().__init__()
-        self.grid_size = vision_grid_size
-        self.linear_proj = GLU(in_features=vision_hidden_size,
-                               hidden_size=text_hidden_size,
-                               intermediate_size=text_intermediate_size,
-                               compute_dtype=compute_dtype, param_init_type=param_init_type)
-        self.conv = nn.Conv2d(in_channels=vision_hidden_size, out_channels=vision_hidden_size,
-                              kernel_size=2, stride=2, dtype=param_init_type, has_bias=True).to_float(compute_dtype)
-```
-
-在`VisionMLPAdapter`中会将`EVAModel`的输出通过Linear、Conv2D等操作处理成与文本特征相同的维度，其中`vision_hidden_size`和`text_hidden_size`分别表示视觉和文本特征维度。
-
-### 文本生成模块
-
-MindSpore Transformers提供`Llama2`、`Llama3`等语言大模型作为文本生成模块，与非文本模态处理模块、跨模态交互模块共同构成多模态理解模型。
-
-```python
-@MindFormerRegister.register(MindFormerModuleType.MODELS)
-class MultiModalForCausalLM(BaseXModalToTextModel):
-    def __init__(self, config: MultiModalConfig, **kwargs):
-        super().__init__(config, **kwargs)
-        self.config = config
-        self.vision_model = build_network(config.vision_model)
-        self.llm_model = build_network(config.llm_model)
-        self.mlp_adapter = VisionMLPAdapter(**kwargs)
-
-    def prepare_inputs_for_generation(self, input_ids, **kwargs):
-      """Prepare inputs for generation in inference."""
-
-    def prepare_inputs_for_predict_layout(self, input_ids, **kwargs):
-      """Prepare inputs for generation in inference."""
-
-    def set_dynamic_inputs(self, **kwargs):
-      """Set dynamic inputs for model."""
-
-    def construct(self, input_ids, **kwargs):
-      """Model forward."""
-```
-
-参数说明：
-
-1. `MultiModalForCausalLM`作为多模态理解模型类，继承自基类`BaseXModalToTextModel`，在该类构建过程中通过`build_network`和对应模块的配置，对非文本模态处理模块`vision_model`、文本生成模块`llm_model`以及跨模态交互模块`VisionMLPAdapter`进行初始化；
-2. `prepare_inputs_for_generation`方法可以对输入数据进行预处理，要求处理后的数据可通过`construct`方法实现模型推理；
-3. `prepare_inputs_for_predict_layout`方法用于构造模型可处理的数据，其返回值与`construct`方法入参对应，通过构造后的数据可实现模型编译；
-4. `set_dynamic_inputs`方法可以为模型入参中的部分数据配置动态shape；
-5. `construct`方法为所有模型通用接口，也是模型前向执行函数。
-
-## 多模态理解模型实践
-
-在实现多模态数据集、数据处理模块以及多模态理解模型构建之后，就可以通过模型配置文件启动模型预训练、微调、推理等任务，为此需要构建对应的模型配置文件。
-
-具体模型配置文件可参考[predict_cogvlm2_video_llama3_chat_13b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/predict_cogvlm2_video_llama3_chat_13b.yaml)和[finetune_cogvlm2_video_llama3_chat_13b_lora.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/finetune_cogvlm2_video_llama3_chat_13b_lora.yaml)分别对应模型推理和微调，其中参数具体含义可查阅[配置文件说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html)。
-
-在用户自定义的配置文件中`model`、`processor`、`train_dataset`等部分内容需要对应用户自定义的**数据集**、**数据处理模块**以及**多模态理解模型**进行设置。
-
-编辑自定义的配置文件之后，参考[CogVLM2-Video模型文档](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/cogvlm2_video.md)启动模型[推理](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/cogvlm2_video.md#推理)和[微调](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/cogvlm2_video.md#微调)任务即可。
diff --git a/docs/mindformers/docs/source_zh_cn/feature/evaluation.md b/docs/mindformers/docs/source_zh_cn/feature/evaluation.md
index 445ce54fdb402aa87fb023bd9a179c27f7bb1d41..2f969f224595eaebc272fc423069c25c4f01b3a4 100644
--- a/docs/mindformers/docs/source_zh_cn/feature/evaluation.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/evaluation.md
@@ -173,368 +173,10 @@ Harness评测支持单机单卡、单机多卡、多机多卡场景，每种场
 | gsm8k |       3 | flexible-extract |      5 | exact_match | ↑ | 0.5034 | ± | 0.0138 |
 |       |         | strict-match     |      5 | exact_match | ↑ | 0.5011 | ± | 0.0138 |
 
-## VLMEvalKit评测
-
-### 基本介绍
-
-[VLMEvalKit](https://github.com/open-compass/VLMEvalKit)
-是一款专为大型视觉语言模型评测而设计的开源工具包，支持在各种基准测试上对大型视觉语言模型进行一键评估，无需进行繁重的数据准备工作，让评估过程更加简便。它支持多种图文多模态评测集和视频多模态评测集，支持多种API模型以及基于PyTorch和HF的开源模型，支持自定义prompt和评测指标。基于VLMEvalKit评测框架对MindSpore Transformers进行适配后，支持加载MindSpore Transformers中多模态大模型进行评测。
-
-目前已适配的模型和支持的评测数据集如下表所示（其余模型和评测数据集正在积极适配中，请关注版本更新）：
-
-| 适配的模型 | 支持的评测任务                                           |
-|--|---------------------------------------------------|
-| cogvlm2-image-llama3-chat | MME、MMBench、COCO Caption、MMMU_DEV_VAL、TextVQA_VAL |
-| cogvlm2-video-llama3-chat | MMBench-Video、MVBench                             |
-
-### 支持特性说明
-
-1. 支持自动下载评测数据集；
-2. 一键生成评测结果。
-
-### 安装
-
-#### 下载代码并编译，安装依赖包
-
-1. 下载并修改代码：由于开源框架在跑MVBench数据集时存在已知问题，所以需要使用导入patch补丁的方式修改源码。获取[eval.patch](https://github.com/user-attachments/files/17956417/eval.patch)，下载放入本地目录中。导入patch时要使用patch文件的绝对路径。
-
-    执行以下命令：
-
-    ```bash
-    git clone https://github.com/open-compass/VLMEvalKit.git
-    cd VLMEvalKit
-    git checkout 78a8cef3f02f85734d88d534390ef93ecc4b8bed
-    git apply /path/to/eval.patch
-    ```
-
-2. 安装依赖包
-
-    在下载好的代码中，找到requirements.txt（VLMEvalKit/requirements.txt）文件，修改成如下内容：
-
-    ```text
-    gradio==4.40.0
-    huggingface_hub==0.24.2
-    imageio==2.35.1
-    matplotlib==3.9.1
-    moviepy==1.0.3
-    numpy==1.26.4
-    omegaconf==2.3.0
-    openai==1.3.5
-    opencv-python==4.10.0.84
-    openpyxl==3.1.5
-    pandas==2.2.2
-    peft==0.12.0
-    pillow==10.4.0
-    portalocker==2.10.1
-    protobuf==5.27.2
-    python-dotenv==1.0.1
-    requests==2.32.3
-    rich==13.7.1
-    sentencepiece==0.2.0
-    setuptools==69.5.1
-    sty==1.0.6
-    tabulate==0.9.0
-    tiktoken==0.7.0
-    timeout-decorator==0.5.0
-    torch==2.5.1
-    tqdm==4.66.4
-    transformers==4.43.3
-    typing_extensions==4.12.2
-    validators==0.33.0
-    xlsxwriter==3.2.0
-    torchvision==0.20.1
-    ```
-
-    执行命令：
-
-    ```bash
-    pip install -r requirements.txt
-    ```
-
-#### 安装FFmpeg
-
-Ubuntu系统按照如下步骤安装：
-
-1. 更新系统包列表，安装编译FFmpeg所需的系统依赖库。
-
-      ```bash
-      apt-get update
-      apt-get -y install autoconf automake build-essential libass-dev libfreetype6-dev libsdl2-dev libtheora-dev libtool libva-dev libvdpau-dev libvorbis-dev libxcb1-dev libxcb-shm0-dev libxcb-xfixes0-dev pkg-config texinfo zlib1g-dev yasm libx264-dev libfdk-aac-dev libmp3lame-dev libopus-dev libvpx-dev
-      ```
-
-2. 从FFmpeg官网下载FFmpeg4.1.11的源码压缩包，解压源码包并进入解压后的目录；配置FFmpeg的编译选项：指定FFmpeg的安装路径（绝对路径），生成共享库，启用对特定编解码器的支持，启用非自由和GPL许可的功能；编译并安装FFmpeg。
-
-      ```bash
-      wget --no-check-certificate https://www.ffmpeg.org/releases/ffmpeg-4.1.11.tar.gz
-      tar -zxvf ffmpeg-4.1.11.tar.gz
-      cd ffmpeg-4.1.11
-      ./configure --prefix=/{path}/ffmpeg-xxx --enable-shared --enable-libx264 --enable-libfdk-aac --enable-libmp3lame --enable-libopus --enable-libvpx --enable-nonfree --enable-gpl
-      make && make install
-      ```
-
-OpenEuler系统按照如下步骤安装：
-
-1. 从FFmpeg官网下载FFmpeg4.1.11的源码压缩包，解压源码包并进入解压后的目录；配置FFmpeg的编译选项：指定FFmpeg的安装路径（绝对路径）；编译并安装FFmpeg。
-
-      ```bash
-      wget --no-check-certificate https://www.ffmpeg.org/releases/ffmpeg-4.1.11.tar.gz
-      tar -zxvf ffmpeg-4.1.11.tar.gz
-      cd ffmpeg-4.1.11
-      ./configure --enable-shared --disable-x86asm --prefix=/path/to/ffmpeg
-      make && make install
-      ```
-
-2. 配置环境变量，`FFMPEG_PATH`需要指定安装FFmpeg的绝对路径，以便系统能够正确找到和使用FFmpeg及其相关库。
-
-      ```bash
-      vi ~/.bashrc
-      export FFMPEG_PATH=/path/to/ffmpeg/
-      export LD_LIBRARY_PATH=$FFMPEG_PATH/lib:$LD_LIBRARY_PATH
-      source ~/.bashrc
-      ```
-
-#### 安装Decord
-
-Ubuntu系统按照如下步骤安装：
-
-1. 拉取Decord代码，进入`decord`目录，执行以下命令：
-
-      ```bash
-      git clone --recursive -b v0.6.0 https://github.com/dmlc/decord.git
-      cd decord
-      ```
-
-2. 创建并进入`build`目录，配置Decord的编译选项，禁用CUDA支持，启用Release模式（优化性能），指定FFmpeg的安装路径，编译Decord库。将编译生成的libdecord.so库文件复制到系统库目录，复制到`decord`的`python`目录。
-
-      ```bash
-      mkdir build
-      cd build
-      cmake .. -DUSE_CUDA=0 -DCMAKE_BUILD_TYPE=Release -DFFMPEG_DIR=/{path}/ffmpeg-4.1.11 && make
-      cp libdecord.so /usr/local/lib/
-      cp libdecord.so ../python/decord/libdecord.so
-      ```
-
-3. 进入`decord`目录中的`python`文件夹，安装`numpy`依赖项，安装Decord的python包。将FFmpeg的库路径（绝对路径）添加到`LD_LIBRARY_PATH`环境变量中，确保运行时能够找到FFmpeg的共享库。
-
-      ```bash
-      cd /path/to/decord/python
-      pip install numpy
-      python setup.py install
-      export LD_LIBRARY_PATH=/path/to/ffmpeg-4.1.11/lib/:$LD_LIBRARY_PATH
-      ```
-
-4. 执行Python命令，测试Decord是否安装成功，没有报错即为安装成功。
-
-      ```bash
-      python -c "import decord; from decord import VideoReader"
-      ```
-
-OpenEuler系统按照如下步骤安装：
-
-1. 拉取Decord代码，进入`decord`目录。
-
-      ```bash
-      git clone --recursive -b v0.6.0 https://github.com/dmlc/decord
-      cd decord
-      ```
-
-2. 创建并进入`build`目录，配置Decord的编译选项，指定FFmpeg的安装路径(绝对路径)，编译Decord库；进入`decord`目录中的python文件夹，配置环境变量，指定`PYTHONPATH`；安装Decord的python包。
-
-      ```bash
-      mkdir build && cd build
-      cmake -DFFMPEG_DIR=/path/ffmpeg-4.1.11 ..
-      make
-      cd ../python
-      pwd=$PWD
-      echo "PYTHONPATH=$PYTHONPATH:$pwd" >> ~/.bashrc
-      source ~/.bashrc
-      python3 setup.py install
-         ```
-
-3. 执行python命令，测试Decord是否安装成功，没有报错即为安装成功。
-
-      ```bash
-      python -c "import decord; from decord import VideoReader"
-      ```
-
-### 评测
-
-#### 评测前准备
-
-1. 创建一个新目录，例如名称为`model_dir`，用于存储模型yaml文件；
-2. 在上个步骤创建的目录中放置模型推理yaml配置文件（predict_xxx_.yaml），不同模型的推理yaml配置文件的目录位置参考[模型库](../introduction/models.md)各模型说明文档中的模型文件树；
-3. 配置yaml配置文件。
-
-    以[predict_cogvlm2_image_llama3_chat_19b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/predict_cogvlm2_image_llama3_chat_19b.yaml)配置为例：
-
-    ```yaml
-    load_checkpoint: "/{path}/model.ckpt"  # 指定权重文件路径
-    model:
-      model_config:
-        use_past: True                         # 开启增量推理
-        is_dynamic: False                       # 关闭动态shape
-
-      tokenizer:
-        vocab_file: "/{path}/tokenizer.model"  # 指定tokenizer文件路径
-    ```
-
-   配置yaml文件，参考[配置文件说明](../feature/configuration.md)。
-4. MMbench-Video数据集评测需要使用GPT-4 Turbo模型进行评测打分，请提前准备好相应的API Key，并放在VLMEvalKit/.env文件中，内容如下所示：
-
-   ```text
-   OPENAI_API_KEY=your_apikey
-   ```
-
-5. MVBench数据集评测开始时，如果提示需要输入HuggingFace密钥，请按提示输入，保证后续评测的正常执行。
-
-#### 拉起评测任务
-
-在MindSpore Transformers本地代码仓根目录下执行脚本：[run_vlmevalkit.sh](https://gitee.com/mindspore/mindformers/blob/dev/toolkit/benchmarks/run_vlmevalkit.sh)。
-
-执行如下命令拉起评测任务：
-
-```shell
-#!/bin/bash
-
-source toolkit/benchmarks/run_vlmevalkit.sh \
- --data MMMU_DEV_VAL \
- --model cogvlm2-image-llama3-chat \
- --verbose \
- --work_dir /path/to/cogvlm2-image-eval-result \
- --model_path model_dir
-```
-
-### 评测参数
-
-| 参数                | 类型  | 参数介绍                                                                                           | 是否必须      |
-|-------------------|-----|------------------------------------------------------------------------------------------------|-----------|
-| `--data`          | str | 数据集名称，可传入多个数据集，空格分割。                                                                           | 是         |
-| `--model`         | str | 模型名称。                                                                                          | 是         |
-| `--verbose`       | /   | 输出评测运行过程中的日志。                                                                                  | 否         |
-| `--work_dir`      | str | 存放评测结果的目录，默认存储在当前执行目录的`outputs`文件夹下。                                                           | 否         |
-| `--model_path`    | str | 包含配置文件的文件夹路径。                                                                                  | 是         |
-| `--register_path` | str | 外挂代码所在目录的绝对路径。比如[research](https://gitee.com/mindspore/mindformers/tree/dev/research)目录下的模型目录。 | 否（外挂代码必填） |
-
-如果因网络限制，服务器不支持在线下载图文数据集时，可以将本地下载好的以.tsv结尾的数据集文件上传至服务器~/LMUData目录下，进行离线评测。（例如：~/LMUData/MME.tsv 或 ~/LMUData/MMBench_DEV_EN.tsv 或 ~/LMUData/COCO_VAL.tsv）
-
-### 查看评测结果
-
-按照上述方式评估后，在存储评测结果的目录中，找到以.json或以.csv结尾的文件查看评估的结果。
-
-评测样例结果如下，其中`Bleu`和`ROUGE_L`表示评估翻译质量的指标，`CIDEr`表示评估图像描述任务的指标。
-
-```json
-{
-   "Bleu": [
-      15.523950970070652,
-      8.971141548228058,
-      4.702477458554666,
-      2.486860744700995
-   ],
-   "ROUGE_L": 15.575063213115946,
-   "CIDEr": 0.01734615519604295
-}
-```
-
-## 使用VideoBench数据集进行模型评测
-
-### 基本介绍
-
-[Video-Bench](https://github.com/PKU-YuanGroup/Video-Bench/tree/main) 是首个针对 Video-LLM 的综合评估基准，具有三级能力评估，可以系统地评估模型在视频专属理解、先验知识融入和基于视频的决策能力方面的表现。
-
-### 评测前准备
-
-1. 数据集下载
-
-    下载[Video-Bench中的视频数据](https://huggingface.co/datasets/LanguageBind/Video-Bench)，解压后按照如下目录格式进行放置：
-
-    ```text
-    egs/VideoBench/
-      └── Eval_video
-            ├── ActivityNet
-            │     ├── v__2txWbQfJrY.mp4
-            │     ...
-            ├── Driving-decision-making
-            │     ├── 1.mp4
-            │     ...
-            ...
-    ```
-
-2. 文本下载
-
-    下载[Video-Bench中的文本数据](https://github.com/PKU-YuanGroup/Video-Bench/tree/main?tab=readme-ov-file)，解压后按照如下目录格式进行放置：
-
-    ```text
-    egs/Video-Bench/
-      └── Eval_QA
-            ├── Youcook2_QA_new.json等json文件
-            ...
-    ```
-
-3. 所有问题的正确答案下载
-
-    下载[Video-Bench中的答案数据](https://huggingface.co/spaces/LanguageBind/Video-Bench/resolve/main/file/ANSWER.json)。
-
-> Video-Bench中的文本数据按照“egs/VideoBench/Eval_QA”（目录至少两层，且最后一层是`Eval_QA`）的路径格式进行存储；Video-Bench中的视频数据按照“egs/VideoBench/Eval_video”（目录至少两层，且最后一层是`Eval_video`）的路径格式进行存储。
-
-### 评测
-
-执行脚本路径可参考链接：[eval_with_videobench.py](https://gitee.com/mindspore/mindformers/blob/dev/toolkit/benchmarks/eval_with_videobench.py)。
-
-#### 执行推理脚本，获取推理结果
-
-```shell
-python toolkit/benchmarks/eval_with_videobench.py \
---model_path model_path \
---dataset_name dataset_name \
---Eval_QA_root Eval_QA_root \
---Eval_Video_root Eval_Video_root \
---chat_conversation_output_folder output
-```
-
-> 参数`Eval_QA_root`填写Eval_QA的上一层目录；参数`Eval_Video_root`填写Eval_video的上一层目录。
-
-**参数说明**
-
-| **参数**                             | **是否必选** | **说明**                                     |
-|------------------------------------|---------|--------------------------------------------|
-| `--model_path`                     | 是       | 存储模型相关文件的文件夹路径，包含模型配置文件及模型词表文件。            |
-| `--dataset_name`                   | 否       | 评测数据子集名称，默认为None，评测VideoBench的所有子集。        |
-| `--Eval_QA_root`                   | 是       | 存放VideoBench数据集的json文件目录。 |
-| `--Eval_Video_root`                | 是       | 存放VideoBench数据集的视频文件目录。                    |
-| `--chat_conversation_output_folder` | 否       | 生成结果文件的目录。默认存放在当前目录的Chat_results文件夹下。      |
-
-运行结束后，在chat_conversation_output_folder目录下会生成对话结果文件。
-
-#### 根据生成结果进行评测打分
-
-Video-Bench可以根据模型生成的答案利用ChatGPT或T5进行评估，最终得到13个数据子集的最终分数。
-
-例如：使用ChatGPT进行评估打分：
-
-```shell
-python Step2_chatgpt_judge.py \
---model_chat_files_folder ./Chat_results \
---apikey sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
---chatgpt_judge_output_folder ./ChatGPT_Judge
-
-python Step3_merge_into_one_json.py \
---chatgpt_judge_files_folder ./ChatGPT_Judge \
---merge_file ./Video_Bench_Input.json
-```
-
-上述评测打分命令中的脚本路径为：[Step2_chatgpt_judge.py](https://github.com/PKU-YuanGroup/Video-Bench/blob/main/Step2_chatgpt_judge.py)、[Step3_merge_into_one_json.py](https://github.com/PKU-YuanGroup/Video-Bench/blob/main/Step3_merge_into_one_json.py)
-
-ChatGPT可能会将部分问题的回答视为格式错误，因此需要多次运行Step2_chatgpt_judge.py以确保每个问题都由ChatGPT进行验证。
-
 ## FAQ
 
-1. 使用Harness或VLMEvalKit进行评测，在加载HuggingFace数据集时，报错`SSLError`：
+1. 使用Harness进行评测，在加载HuggingFace数据集时，报错`SSLError`：
 
    参考[SSL Error报错解决方案](https://stackoverflow.com/questions/71692354/facing-ssl-error-with-huggingface-pretrained-models)。
 
-   注意：关闭SSL校验存在风险，可能暴露在中间人攻击（MITM）下。仅建议在测试环境或你完全信任的连接里使用。
-
-2. 使用VLMEvalKit中的MVBench数据集进行评测，出现`AssertionError`：
-
-   由于开源框架`VLMEvalKit`在跑`MVBench`数据集时存在已知问题，请参考开源框架的[issue](https://github.com/open-compass/VLMEvalKit/issues/888)进行修改，或删除评测过程中产生的文件（由参数`--work_dir`指定，默认在当前执行目录的`outputs`文件夹）重新执行。
\ No newline at end of file
+   注意：关闭SSL校验存在风险，可能暴露在中间人攻击（MITM）下。仅建议在测试环境或你完全信任的连接里使用。
\ No newline at end of file
diff --git a/docs/mindformers/docs/source_zh_cn/guide/inference.md b/docs/mindformers/docs/source_zh_cn/guide/inference.md
index e415405812fad70292d44d74ce1d59cd56ffc396..2e4a4ee9dfa98114420e55270d3aacd94248e700 100644
--- a/docs/mindformers/docs/source_zh_cn/guide/inference.md
+++ b/docs/mindformers/docs/source_zh_cn/guide/inference.md
@@ -150,33 +150,6 @@ bash scripts/msrun_launcher.sh "run_mindformer.py \
 
 推理结果查看方式，与多卡推理相同。
 
-### 多模态推理
-
-以`cogvlm2-llama3-chat-19B`模型为例，可以参考以下流程启动推理任务：
-
-修改模型配置文件[predict_cogvlm2_image_llama3_chat_19b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/predict_cogvlm2_image_llama3_chat_19b.yaml)。
-
-```shell
-model:
-  model_config:
-    use_past: True                         # 开启增量推理
-    is_dynamic: False                      # 关闭动态shape
-
-  tokenizer:
-    vocab_file: "/{path}/tokenizer.model"  # 指定tokenizer文件路径
-```
-
-启动推理脚本
-
-```shell
-python run_mindformer.py \
- --config configs/cogvlm2/predict_cogvlm2_image_llama3_chat_19b.yaml \
- --run_mode predict \
- --predict_data "/path/image.jpg" "Please describe this image." \  # 模型推理输入，第一个输入是图片路径，第二个输入是文本
- --modal_type image text \                                         # 模型推理输入对应模态，图片路径对应'image'，文本对应'text'
- --load_checkpoint /{path}/cogvlm2-image-llama3-chat.ckpt
-```
-
 ## 更多信息
 
 更多关于不同模型的推理示例，请访问[MindSpore Transformers 已支持模型库](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/introduction/models.html)。
\ No newline at end of file
diff --git a/docs/mindformers/docs/source_zh_cn/index.rst b/docs/mindformers/docs/source_zh_cn/index.rst
index 75541a558d98250f1bbe02342bc77e5bc141987a..308840a5968ff8899f4505dc5dd34f22214809d1 100644
--- a/docs/mindformers/docs/source_zh_cn/index.rst
+++ b/docs/mindformers/docs/source_zh_cn/index.rst
@@ -139,7 +139,6 @@ MindSpore Transformers功能特性说明
 - 模型开发
 
   - `开发迁移 <https://www.mindspore.cn/mindformers/docs/zh-CN/dev/advanced_development/dev_migration.html>`_
-  - `多模态理解模型开发 <https://www.mindspore.cn/mindformers/docs/zh-CN/dev/advanced_development/multi_modal_dev.html>`_
 
 - 精度对比
 
@@ -213,7 +212,6 @@ FAQ
    advanced_development/precision_optimization
    advanced_development/performance_optimization
    advanced_development/dev_migration
-   advanced_development/multi_modal_dev
    advanced_development/accuracy_comparison
    advanced_development/api