diff --git a/tutorials/source_en/model_infer/ms_infer/ms_infer_model_infer.rst b/tutorials/source_en/model_infer/ms_infer/ms_infer_model_infer.rst index 89b49353632646b5a740bcecc2f140cad00c86d1..df8bde8f05cb7b6512df92f42a3749d40972c905 100644 --- a/tutorials/source_en/model_infer/ms_infer/ms_infer_model_infer.rst +++ b/tutorials/source_en/model_infer/ms_infer/ms_infer_model_infer.rst @@ -123,7 +123,7 @@ MindSpore LLM inference with the framework mainly depends on the MindSpore open- You can also install the Python package adapted to your environment by referring to the official installation document. For details, see `MindSpore Installation `_. -MindSpore inference mainly runs on the Ascend AI Processor environment. You need to install the corresponding Ascend development environment. For details, see the following: +MindSpore inference mainly runs on the Ascend AI Processor environment. You need to install the corresponding Ascend development environment. For details, see `CANN Software Installation `_: .. code:: shell @@ -157,6 +157,7 @@ After the download is complete, the following file tree structure should be disp ls |- config.json + |- generation_config.json |- LICENSE |- merges.txt |- model-00001-of-00004.safetensors @@ -418,7 +419,7 @@ The MindSpore LLM supports the following quantization technologies to improve th To quantize a model using golden-stick, perform the following steps: -1. **Weight quantization**: Use a quantization algorithm to convert the model weight data from float16 to int8. +1. **Model quantization**: Use a quantization algorithm to convert the model data type from high-bit type (eg. float16) to low-bit type (eg. int8 or int4). 2. **Model inference**: Load the standard model, quantize the model network (by inserting corresponding quantization operators), load the quantized weight, and call the model inference. diff --git a/tutorials/source_en/model_infer/ms_infer/ms_infer_network_develop.md b/tutorials/source_en/model_infer/ms_infer/ms_infer_network_develop.md index ee68bae89dadddca47c52ad1ae0ccc0505bccd39..e86477f43172293d14cc5db248bce2139b7dec23 100644 --- a/tutorials/source_en/model_infer/ms_infer/ms_infer_network_develop.md +++ b/tutorials/source_en/model_infer/ms_infer/ms_infer_network_develop.md @@ -606,7 +606,7 @@ class Qwen2Model(nn.Cell): ### KVCacheManager -Since KVCache is usually used to optimize LLMs, to use KVCache with FlashAttention and lashPagedAttention provided by MindSpore, some parameters need to be specified additionally, including: +Since KVCache is usually used to optimize LLMs, to use KVCache with FlashAttention and PagedAttention provided by MindSpore, some parameters need to be specified additionally, including: - **k_cache & v_cache**: The kv_cache object can be considered as a cache table, which is used to store the keys and values in the previous iteration. In the next iteration, these values can be directly read, avoiding repeated computation of the keys and values of the first *n* words, thereby improving performance. diff --git a/tutorials/source_en/model_infer/ms_infer/ms_infer_parallel_infer.md b/tutorials/source_en/model_infer/ms_infer/ms_infer_parallel_infer.md index c5757040bea62b824561935dbea5de628450203b..6d674715e9cbb1e7aeda1724a72baa256ed58734 100644 --- a/tutorials/source_en/model_infer/ms_infer/ms_infer_parallel_infer.md +++ b/tutorials/source_en/model_infer/ms_infer/ms_infer_parallel_infer.md @@ -12,7 +12,7 @@ The pressure on graphics memory makes it challenging for a single device to comp Before performing model sharding and parallelism, you need to analyze the parallelism based on the model structure to determine which layers can be parallelized and how to divide the model to achieve better performance acceleration. To achieve better acceleration, the parallelized part needs to be computed separately, minimizing the impact on other parts. The following uses the Qwen2 model structure as an example to analyze the parallelism of the main network structure: -- **Embedding**: The embedding layer is actually a gather operation and can be parallelized properly regardless of the sharding dimension (**hidden_dim** or **num_embeddings**). Because **all_reduce** (reducing overheads of data arrangement) can be better performed based on **num_embedding**, sharding is performed based on the **num_embeddings** dimension. +- **Embedding**: The embedding layer is actually a gather operation and can be parallelized properly regardless of the sharding dimension (**hidden_dim** or **num_embeddings**). Because **all_reduce** (reducing overheads of data arrangement) can be better performed based on **num_embeddings**, sharding is performed based on the **num_embeddings** dimension. - **Attention**: The Qwen2 model uses the attention computation method of GQA, that is, multiple independent attention computations. Therefore, the query, key, and value can be parallelized separately by column. However, the number of shards must be exactly divided by the number of attention heads. @@ -739,7 +739,7 @@ class Qwen2MLP(nn.Cell): param_dtype=config.param_dtype, bias=False ) -- self.qgate_proj = Qwen2Linear( +- self.gate_proj = Qwen2Linear( + self.gate_proj = Qwen2ColParallelLinear( input_size=config.hidden_size, output_size=config.intermediate_size, @@ -923,7 +923,7 @@ class Qwen2ForCausalLM(nn.Cell): for path in glob(weight_path + "/*.safetensors"): weight_dict.update(ms.load_checkpoint(path, format="safetensors")) -- ms.load_param_into_net(self, weight_dict, strict_load=False) +- load_param_into_net(self, weight_dict, strict_load=False) + param_dict = self.parameters_dict() + + for (name, weight) in weight_dict.items(): diff --git a/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_model_infer.rst b/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_model_infer.rst index 4b423724ce0f460d21c0e1aea6f4ee6b757da0e0..e275f3cb8b6506bc5bcc52de3c9759df36fbe7e4 100644 --- a/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_model_infer.rst +++ b/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_model_infer.rst @@ -123,7 +123,7 @@ MindSpore大语言模型带框架推理主要依赖MindSpore开源软件,用 同时,用户也可以参考官方安装文档来安装自己环境适配的Python包,具体见 `MindSpore安装 `_。 -由于MindSpore推理主要支持Ascend芯片环境上运行,还需要安装相应的Ascend开发环境,具体可以参考: +由于MindSpore推理主要支持Ascend芯片环境上运行,还需要安装相应的Ascend开发环境,具体可以参考 `快速安装CANN `_ : .. code:: shell @@ -157,6 +157,7 @@ MindSpore大语言模型带框架推理主要依赖MindSpore开源软件,用 ls |- config.json + |- generation_config.json |- LICENSE |- merges.txt |- model-00001-of-00004.safetensors @@ -401,7 +402,7 @@ MindSpore大语言模型带框架推理主要依赖MindSpore开源软件,用 2048]矩阵乘法,可以切分成2个[1024, 4096]和[4096, 1024]的矩阵乘法。而不同的切分可能带来不同的并行计算性能。对于Qwen、LLAMA这类大语言模型而言,其切分主要包含在Attention中query、key、value这些数据的linear操作上。 -2. **权重适配**:除了模型结构的并行化改造外,由于模型计算中的权重也被切分了,因此在模型加载的时候,相关的权重也要进行切分,以尽量减少不必要权重加载占用显存。对于大语言模型而言,主要的权重都集中在embbeding和linear两个网络层中,因此权重加载的适配主要涉及这两个模块改造。 +2. **权重适配**:除了模型结构的并行化改造外,由于模型计算中的权重也被切分了,因此在模型加载的时候,相关的权重也要进行切分,以尽量减少不必要权重加载占用显存。对于大语言模型而言,主要的权重都集中在embedding和linear两个网络层中,因此权重加载的适配主要涉及这两个模块改造。 3. **模型推理**:和单卡推理不同,多卡推理需要同时启动多个进程来并行进行推理,因此在启动模型推理时,相比于直接运行脚本,多卡推理需要一次运行多组相关进程。MindSpore框架为用户提供了msrun的并行运行工具,具体使用方法可以参考 `构建可并行的大语言模型网络 <./ms_infer_parallel_infer.md>`_。 @@ -416,9 +417,9 @@ MindSpore大语言模型支持以下量化技术,来提升模型推理性能 - **KVCache量化**:在大语言模型推理场景下,除了模型权重以外,KVCache也占用了大量显存,因此对KVCache进行量化,降低其显存消耗,也能够有效提升整体的吞吐量。MindSpore大语言模型支持对KVCache做float16到int8的量化,通过FA和PA适配,将量化和反量化融合到算子内部,降低量化带来的开销,实现整体吞吐量提升。 -使用golen-stick进行模型量化主要分为以下两步: +使用golden-stick进行模型量化主要分为以下两步: -1. **权重量化**:利用量化算法,将模型的权重数据从float16转化成int8数据。 +1. **模型量化**:利用量化算法,将模型的数据类型从高bit类型(如float16)转化成低bit类型(如int8或int4)。 2. **模型推理**:加载标准模型,将模型网络进行量化改造(插入相应量化算子),加载量化后的权重,调用模型推理。 diff --git a/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_network_develop.md b/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_network_develop.md index af67261a948dca8ccae318d1c5389e0dcd3ab153..c064ebcf71a528b055258fe435f3a82f88474cf8 100644 --- a/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_network_develop.md +++ b/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_network_develop.md @@ -606,7 +606,7 @@ class Qwen2Model(nn.Cell): ### KVCacheManager -由于大语言模型通常会使用KVCache优化,MindSpore提供的FlashAttention和lashPagedAttention需要和KVCache配合使用,需要额外传入一些参数,其中主要包括: +由于大语言模型通常会使用KVCache优化,MindSpore提供的FlashAttention和PagedAttention需要和KVCache配合使用,需要额外传入一些参数,其中主要包括: - **k_cache & v_cache**:kv_cache对象可以理解为是一个缓存表,用于保存上一次迭代中的key和value值。在下一次迭代时,可以直接读取这些值,从而避免重复计算前n个词的key和value,以提升性能。 diff --git a/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_parallel_infer.md b/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_parallel_infer.md index b942851a18ada553ee3f822e133742ce8a6b0be8..9e0e769c39d1a2ce165870105eef471a02e1f4b6 100644 --- a/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_parallel_infer.md +++ b/tutorials/source_zh_cn/model_infer/ms_infer/ms_infer_parallel_infer.md @@ -12,7 +12,7 @@ 在对模型进行并行切分前,需要先根据模型的结构特征来进行并行分析,确认网络中哪些层可以并行,以及如何切分能够获得比较好的性能加速。为了要能够获得好的加速效果,并行切分的部分就需要尽可能的独立计算互不影响。以Qwen2模型结构为例,我们对其主要的网络结构进行并行分析: -- **Embedding**:Embedding层实际上是一个gather操作,不管是按hidden_dim还是num_embeddings维度切分,都可以比较好地进行并行计算。由于按照num_embedding可以更好地进行all_reduce(减少数据排布的开销),此处我们按照num_embeddings维度进行切分。 +- **Embedding**:Embedding层实际上是一个gather操作,不管是按hidden_dim还是num_embeddings维度切分,都可以比较好地进行并行计算。由于按照num_embeddings可以更好地进行all_reduce(减少数据排布的开销),此处我们按照num_embeddings维度进行切分。 - **Attention**:Qwen2模型使用了GQA的Attention计算方法,即有多个独立的Attention计算,因此我们可以按照列维度将query、key、value切分开来单独计算,但是需要保证切分能够被Attention的head数整除。 @@ -751,7 +751,7 @@ class Qwen2MLP(nn.Cell): param_dtype=config.param_dtype, bias=False ) -- self.qgate_proj = Qwen2Linear( +- self.gate_proj = Qwen2Linear( + self.gate_proj = Qwen2ColParallelLinear( input_size=config.hidden_size, output_size=config.intermediate_size, @@ -937,7 +937,7 @@ class Qwen2ForCausalLM(nn.Cell): for path in glob(weight_path + "/*.safetensors"): weight_dict.update(ms.load_checkpoint(path, format="safetensors")) -- ms.load_param_into_net(self, weight_dict, strict_load=False) +- load_param_into_net(self, weight_dict, strict_load=False) + param_dict = self.parameters_dict() + + for (name, weight) in weight_dict.items():