From 15f44aa924b8f2fb063c69bdaed497f7c7856f13 Mon Sep 17 00:00:00 2001 From: huan <3174348550@qq.com> Date: Tue, 8 Jul 2025 10:39:46 +0800 Subject: [PATCH] update en files 2.6.0 --- .../model_infer/ms_infer/llm_inference_overview.md | 13 +++++++------ .../source_en/model_infer/ms_infer/model_dev.md | 10 +++++----- .../source_en/model_infer/ms_infer/parallel.md | 10 ++++++---- .../source_en/model_infer/ms_infer/quantization.md | 2 +- .../model_infer/ms_infer/weight_prepare.md | 14 +++++--------- .../source_en/model_infer/ms_infer/weight_split.md | 2 +- 6 files changed, 25 insertions(+), 26 deletions(-) diff --git a/tutorials/source_en/model_infer/ms_infer/llm_inference_overview.md b/tutorials/source_en/model_infer/ms_infer/llm_inference_overview.md index a3ec3ec54f..2d9b85d123 100644 --- a/tutorials/source_en/model_infer/ms_infer/llm_inference_overview.md +++ b/tutorials/source_en/model_infer/ms_infer/llm_inference_overview.md @@ -5,6 +5,7 @@ ## Background At the end of 2022, with the release of OpenAI's ChatGPT, a new research direction emerged in the AI domain, that is, large language models based on the transformer structure. These models exhibited capabilities beyond expectations and achieved impressive results in various tests, quickly becoming the research focus of AI. + One significant research direction in large language models is improving their cost-effectiveness in practical applications. - A large language model usually has tens of billions of parameters, and the computation workload for a single model inference process is extremely high and requires massive compute resources. As a result, AI service providers find that the cost of a large language model inference is very high and cannot be effectively applied to real-world scenarios. @@ -145,7 +146,7 @@ config = "/path/to/llama2_7b.yaml" model = AutoModel.from_config(config) ``` -In this code, tokenizer.model is a file downloaded along with the weights from the Hugging Face official website, containing the token mapping table, while config is the model configuration file from MindFormers, which includes the relevant parameters for running the Llama2 model. You can obtain the sample from [predict_llama2_7b.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/configs/llama2/predict_llama2_7b.yaml) (Note: Change the CKPT weight path to the actual weight path). For details, see [Llama 2](https://gitee.com/mindspore/mindformers/blob/r1.5.0/docs/model_cards/llama2.md#-18). +In this code, tokenizer.model is a tokenizer file downloaded along with the weights from the Hugging Face official website, containing the token mapping table, while config is the model configuration file from MindFormers, which includes the relevant parameters for running the Llama2 model. You can obtain the sample from [predict_llama2_7b.yaml](https://gitee.com/mindspore/mindformers/blob/r1.5.0/configs/llama2/predict_llama2_7b.yaml) (Note: Change the CKPT weight path to the actual weight path). For details, see [Llama 2](https://gitee.com/mindspore/mindformers/blob/r1.5.0/docs/model_cards/llama2.md#-18). In addition, if you have special requirements for the model or have a deep understanding of deep learning, you can build your own model. For details, see [Model Development](./model_dev.md). @@ -209,11 +210,11 @@ In addition to utilizing the capabilities provided by the MindFormers model suit For large language models with many model parameters, such as Llama2-70B and Qwen2-72B, the parameter scale usually exceeds the memory capacity of a GPU or NPU. Therefore, multi-device parallel inference is required. MindSpore large language model inference can shard the original large language model into N parallel models so that they can be executed on multiple devices in parallel. This not only enables inference for super-large models but also enhances performance by leveraging more resources from the multiple devices. The model scripts provided by the MindFormers model suite can be used to shard a model into multi-device models for execution. You can perform the following steps to deploy the model on multiple devices. -- **Weight sharding**: Because the original weight files are too large, when executing on multiple devices, the overall weight needs to be sharded into multiple weights for each device and passed to the model process corresponding to each device. You can use the script in the MindFormers model suite to perform weight sharding. For details, see [Weight Conversion](https://gitee.com/mindspore/mindformers/blob/r1.5.0/docs/feature_cards/Transform_Ckpt.md). +1. **Weight sharding**: Because the original weight files are too large, when executing on multiple devices, the overall weight needs to be sharded into multiple weights for each device and passed to the model process corresponding to each device. You can use the script in the MindFormers model suite to perform weight sharding. For details, see [Weight Conversion](https://gitee.com/mindspore/mindformers/blob/r1.5.0/docs/feature_cards/Transform_Ckpt.md). Here is an example of how to shard the Llama2-7B model for parallel execution on two devices. - - **Generating a target parallel strategy file** When MindSpore performs sharding, you need to specify the sharding mode. The information is stored in the parallel strategy file and can be generated using the [run_mindformer.py](https://gitee.com/mindspore/mindformers/blob/r1.5.0/run_mindformer.py) script. Open the YAML file corresponding to the Llama2-7B model and modify the following configuration: + 1. **Generating a target parallel strategy file** When MindSpore performs sharding, you need to specify the sharding mode. The information is stored in the parallel strategy file and can be generated using the [run_mindformer.py](https://gitee.com/mindspore/mindformers/blob/r1.5.0/run_mindformer.py) script. Open the YAML file corresponding to the Llama2-7B model and modify the following configuration: - Set only_save_strategy to True, indicating that generating parallel sharding strategy files is enabled. @@ -229,7 +230,7 @@ For large language models with many model parameters, such as Llama2-70B and Qwe msrun is a parallel execution tool provided by MindSpore. The input_data parameter can accept any content to ensure that the model process can be executed properly. After the program is executed, the strategy directory is generated in the output directory, that is, the parallel sharding strategy file for two-device parallel inference. - - **Sharding model weight CKPT file**: Call the conversion script to shard and generate the weight CKPT files. For details, see [transform_checkpoint.py](https://gitee.com/mindspore/mindformers/blob/r1.5.0/mindformers/tools/ckpt_transform/transform_checkpoint.py). + 2. **Sharding model weight CKPT file**: Call the conversion script to shard and generate the weight CKPT files. For details, see [transform_checkpoint.py](https://gitee.com/mindspore/mindformers/blob/r1.5.0/mindformers/tools/ckpt_transform/transform_checkpoint.py). Run the following command to shard the weight into two-device parallel weights: @@ -240,7 +241,7 @@ For large language models with many model parameters, such as Llama2-70B and Qwe Here, src_checkpoint is the path to the source CKPT file. In the example, full sharding is used. Therefore, the source strategy file does not need to be passed. However, the path must point to the CKPT file, not to a directory. dst_checkpoint is the target directory of the sharding result. After the sharding is complete, two subdirectories rank_0 and rank_1 are generated to store the weight CKPT files of different devices. dst_strategy is the path of the strategy file generated in the previous step. -- **Model adaptation**: When the MindSpore large language model is running on multiple devices, model parallelism is usually used. Therefore, the original model needs to be sharded based on the number of devices. For example, the matrix multiplication of [1024, 4096] and [4096, 2048] can be sharded into two matrix multiplications of [1024, 4096] and [4096, 1024] respectively. Different sharding may bring different parallel computing performance. The MindFormers model provides proven excellent sharding solution for MindSpore model and uses the MindSpore parallel framework for sharding. The following is part of the sharding code in the model: +2. **Model adaptation**: When the MindSpore large language model is running on multiple devices, model parallelism is usually used. Therefore, the original model needs to be sharded based on the number of devices. For example, the matrix multiplication of [1024, 4096] and [4096, 2048] can be sharded into two matrix multiplications of [1024, 4096] and [4096, 1024] respectively. Different sharding may bring different parallel computing performance. The MindFormers model provides proven excellent sharding solution for MindSpore model and uses the MindSpore parallel framework for sharding. The following is part of the sharding code in the model: ```python if not (_get_parallel_mode() in (ParallelMode.AUTO_PARALLEL,) and _is_sharding_propagation()): @@ -277,7 +278,7 @@ For large language models with many model parameters, such as Llama2-70B and Qwe For details about the network script code, see [llama.py](https://gitee.com/mindspore/mindformers/blob/r1.5.0/mindformers/models/llama/llama.py). -- **Model inference**: Different from single-device inference, multi-device inference requires multiple processes to be started at the same time for parallel inference. Therefore, compared with directly running a script, multi-device inference requires multiple groups of related processes to be run at a time. The MindSpore framework provides the msrun parallel running tool. The usage method is as follows. +3. **Model inference**: Different from single-device inference, multi-device inference requires multiple processes to be started at the same time for parallel inference. Therefore, compared with directly running a script, multi-device inference requires multiple groups of related processes to be run at a time. The MindSpore framework provides the msrun parallel running tool. The usage method is as follows. ```shell msrun --worker_num=2 --local_worker_num=2 run_mindformer.py --config "/path/to/llama2_7b.yaml" --input_data "hello" diff --git a/tutorials/source_en/model_infer/ms_infer/model_dev.md b/tutorials/source_en/model_infer/ms_infer/model_dev.md index 764a83859e..475f3b67dd 100644 --- a/tutorials/source_en/model_infer/ms_infer/model_dev.md +++ b/tutorials/source_en/model_infer/ms_infer/model_dev.md @@ -4,7 +4,7 @@ ## Large Language Model Backbone Network -Currently, the backbone networks of mainstream large language models are mainly based on the transformer structure. The most important part is the computation of the self-attention mechanism. The following figure uses the Llama2 large language model as an example to describe the backbone network structure. +Currently, the backbone networks of mainstream large language models are mainly based on the Transformer structure. The most important part is the computation of the self-attention mechanism. The following figure uses the Llama2 large language model as an example to describe the backbone network structure. ![LLAMA network structure](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.6.0/tutorials/source_zh_cn/model_infer/ms_infer/images/llm_llama_network_arch.png) @@ -12,15 +12,15 @@ The core layer of Llama2 consists of the following parts: - **Embedding**: converts the index corresponding to each token into a vector to implement feature dispersion. Similar to onehot vectorization, the embedding weight is involved in the training process to better adapt to the context semantics in the language model. The process is implemented by an Embedding operator. -- **DecodeLayer**: transformer structure, which is the key for the computation of the large language model. Generally, multi-layer computation is performed based on different configurations. Each layer is actually a transformer structure, which is one of the cores of the foundation language model. +- **DecodeLayer**: Transformer structure, which is the key for the computation of the large language model. Generally, multi-layer computation is performed based on different configurations. Each layer is actually a Transformer structure, which is one of the cores of the foundation language model. -- **RmsNorm&Linear**: outputs linear normalization layer. After computation of the transformer structure, the result is normalized to the same dimension as the model vocabulary, and the probability distribution of each token is returned. +- **RmsNorm&Linear**: outputs linear normalization layer. After computation of the Transformer structure, the result is normalized to the same dimension as the model vocabulary, and the probability distribution of each token is returned. -To build a network with MindSpore large language model inference, you can assemble the operators provided by MindSpore. The following is an example to describe how to build a typical transformer model. +To build a network with MindSpore large language model inference, you can assemble the operators provided by MindSpore. The following is an example to describe how to build a typical Transformer model. ## TransformerModel -In a typical transformer model, each layer consists of the normalization, attention, residual connection, and multi-layer perception (MLP). Both attention and MLP meet the requirements of two continuous matrix multiplications. +In a typical Transformer model, each layer consists of the normalization, attention, residual connection, and multi-layer perception (MLP). Both attention and MLP meet the requirements of two continuous matrix multiplications. 1. Attention diff --git a/tutorials/source_en/model_infer/ms_infer/parallel.md b/tutorials/source_en/model_infer/ms_infer/parallel.md index 4cd1ca902b..d5146d3e8b 100644 --- a/tutorials/source_en/model_infer/ms_infer/parallel.md +++ b/tutorials/source_en/model_infer/ms_infer/parallel.md @@ -2,8 +2,9 @@ [![](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.6.0/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/r2.6.0/tutorials/source_en/model_infer/ms_infer/parallel.md) -In recent years, with the rapid development of deep learning technologies, especially the emergence of large-scale pre-trained models (such as ChatGPT, LLaMA, and Pangu), the AI field has made significant progress. However, as model sizes continue to expand, the computing resources required by these large models, particularly GPU memory, are growing exponentially. For example, the Pangu model with 71 billion parameters requires approximately 142 GB of GPU memory at half-precision (FP16). In addition, the increasing sequence length of large models places immense pressure on GPU memory. -The constraints of GPU memory not only affect model loading but also limit batch sizes. Smaller batch sizes may lead to decreased inference efficiency, consequently impacting the overall throughput of the system. +In recent years, with the rapid development of deep learning technologies, especially the emergence of large-scale pre-trained models (such as ChatGPT, LLaMA, and Pangu), the AI field has made significant progress. However, as model sizes continue to expand, the computing resources required by these large models, particularly GPU memory, are growing exponentially. For example, the Pangu model with 71 billion parameters requires approximately 142 GB of GPU memory at half-precision (FP16). + +In addition, the increasing sequence length of large models places immense pressure on GPU memory. The constraints of GPU memory not only affect model loading but also limit batch sizes. Smaller batch sizes may lead to decreased inference efficiency, consequently impacting the overall throughput of the system. The pressure on GPU memory makes it challenging for a single device to complete inference tasks within a reasonable time frame, and parallel computing has become a key strategy to address this challenge. @@ -80,14 +81,15 @@ Starting with the original implementation of `nn.Dense` in MindSpore, we can bui `ColumnParallelLinear` class calculates and initializes the sharded weights' shape based on the number of devices used for model parallelism. Column-wise means to shard `out_channels`. During the model's forward pass, MatMul is called to compute the parallel results. Finally, an `AllGather` operation can be optionally performed on the parallel results to obtain the complete output. The MindSpore training and inference integrated framework supports enabling `infer_boost`. This parameter activates the high-performance self-developed operator library within the MindSpore framework. To enable this mode, you need to: - - Set variables. + + 1. Set variables. ```python from mindspore import set_context set_context(jit_config={"jit_level": 'O0', "infer_boost": 'on'}) ``` - - Set system environment variables. + 2. Set system environment variables. ```bash export ASCEND_HOME_PATH={$ascend_custom_path} diff --git a/tutorials/source_en/model_infer/ms_infer/quantization.md b/tutorials/source_en/model_infer/ms_infer/quantization.md index 429220ab39..1502f9c607 100644 --- a/tutorials/source_en/model_infer/ms_infer/quantization.md +++ b/tutorials/source_en/model_infer/ms_infer/quantization.md @@ -60,7 +60,7 @@ rtn.convert(net) ms.save_checkpoint(net.parameters_dict(), './simplenet_rtn.ckpt') ``` -1. Use [nn.Cell](https://www.mindspore.cn/docs/en/r2.0/api_python/nn/mindspore.nn.Cell.html) to define the network. After training the model, obtain the floating-point weight of the model, and then load the floating-point weight during inference. The above example simplifies the process by directly creating a network and using the initial floating-point weight for quantization. +1. Use [nn.Cell](https://www.mindspore.cn/docs/en/r2.6.0/api_python/nn/mindspore.nn.Cell.html) to define the network. After training the model, obtain the floating-point weight of the model, and then load the floating-point weight during inference. The above example simplifies the process by directly creating a network and using the initial floating-point weight for quantization. 2. Use PTQConfig to set mode to quantization mode, set the backend to Ascend, and perform 8-bit quantization on the weight. For details, see [PTQConfig Configuration Description](#ptqconfig-configuration-description). 3. Use the apply API to convert the network into a pseudo-quantized network and collect statistics on the quantized object based on the configuration in `PTQConfig`. 4. Use the convert API to quantize the pseudo-quantized network in the previous step to obtain the quantized network. diff --git a/tutorials/source_en/model_infer/ms_infer/weight_prepare.md b/tutorials/source_en/model_infer/ms_infer/weight_prepare.md index 7ee4e7eb22..9727f84580 100644 --- a/tutorials/source_en/model_infer/ms_infer/weight_prepare.md +++ b/tutorials/source_en/model_infer/ms_infer/weight_prepare.md @@ -46,11 +46,7 @@ In the preceding information, pytorch_model-00001-of-00002.bin and pytorch_model To convert Hugging Face weight files into MindSpore weight files, perform the following steps: -1. Load the Hugging Face weight files into a list of PyTorch tensors. -2. Convert the PyTorch tensor list into a list of MindSpore tensors. -3. Save the MindSpore tensor list as a MindSpore CKPT weight file. - -- **Install the Python dependency package**: Since the conversion involves both Hugging Face and MindSpore, you need to install the respective Python packages, primarily including transformers, torch, and mindspore. +1. **Install the Python dependency package**: Since the conversion involves both Hugging Face and MindSpore, you need to install the respective Python packages, primarily including transformers, torch, and mindspore. ```shell pip install torch @@ -58,7 +54,7 @@ To convert Hugging Face weight files into MindSpore weight files, perform the fo pip install transformers ``` -- **Load the Hugging Face model**: Use the transformers library to load the Llama2 weight files and model, and retrieve the list of weights which is actually a list of PyTorch tensor objects. +2. **Load the Hugging Face model**: Use the transformers library to load the Llama2 weight files and model, and retrieve the list of weights which is actually a list of PyTorch tensor objects. ```python import os @@ -74,9 +70,9 @@ To convert Hugging Face weight files into MindSpore weight files, perform the fo print(f"name: {name}") ``` -Executing this python code will load the weights of Llama2 and print out the names of each weight, indicating that the model has been successfully loaded. + Executing this python code will load the weights of Llama2 and print out the names of each weight, indicating that the model has been successfully loaded. -- **Converting torch.Tensor to mindspore.Tensor**: Use NumPy as an intermediary to convert the PyTorch tensor objects into MindSpore tensor objects. In addition to the data, the names of the MindSpore weights differ from those in Hugging Face, so a mapping relationship needs to be recorded. +3. **Converting torch.Tensor to mindspore.Tensor**: Use NumPy as an intermediary to convert the PyTorch tensor objects into MindSpore tensor objects. In addition to the data, the names of the MindSpore weights differ from those in Hugging Face, so a mapping relationship needs to be recorded. - Weight name mapping: Replace the Hugging Face weight names with the MindSpore weight names. @@ -133,7 +129,7 @@ Executing this python code will load the weights of Llama2 and print out the nam print(ckpt_list) ``` -- **Saving a MindSpore CKPT weight file**: Call the MindSpore API to save tensors as a CKPT weight file. +4. **Saving a MindSpore CKPT weight file**: Call the MindSpore API to save tensors as a CKPT weight file. ```python ms_ckpt_path="/path/to/mindspore/ckpt" diff --git a/tutorials/source_en/model_infer/ms_infer/weight_split.md b/tutorials/source_en/model_infer/ms_infer/weight_split.md index 2d482a6991..afbcfa7f79 100644 --- a/tutorials/source_en/model_infer/ms_infer/weight_split.md +++ b/tutorials/source_en/model_infer/ms_infer/weight_split.md @@ -151,6 +151,6 @@ def save_strategy_file(state_dict, strategy_file_name): raise e ``` -After the parallel strategy file of the inference network is obtained, the training weight can be converted into the weight required for inference according to the method of executing distributed checkpoint transformation. +After the parallel strategy file of the inference network is obtained, the training weight can be converted into the weight required for inference according to the distributed checkpoint transformation method. For details about the end-to-end weight sharding code project, see [Weight Sharding](https://gitee.com/mindspore/docs/blob/r2.6.0/docs/sample_code/infer_code/param_split.py). -- Gitee