diff --git a/MindIE/MultiModal/CogVideoX/README.md b/MindIE/MultiModal/CogVideoX/README.md index d87e92f0b84e241b7d354f7500bcc85e853e2067..fe95f0bf41f6e80169ae922b822c26fe1beceb93 100644 --- a/MindIE/MultiModal/CogVideoX/README.md +++ b/MindIE/MultiModal/CogVideoX/README.md @@ -19,8 +19,9 @@ hardwares: ### 1.1 获取CANN&MindIE安装包&环境准备 - 设备支持: Atlas 800I A2 (64G) / Atlas 800T A2设备:CogVideoX-5b支持1、2、4、8卡推理,CogVideoX-2b支持1、2、4卡推理 -- [Atlas 800I A2/Atlas 800T A2](https://www.hiascend.com/developer/download/community/result?module=pt+ie+cann&product=4&model=32) -- [环境准备指导](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC2alpha002/softwareinst/instg/instg_0001.html) +- [Atlas 800I A2](https://www.hiascend.com/developer/download/community/result?module=pt+ie+cann&product=4&model=32) +- [Atlas 800T A2](https://www.hiascend.com/developer/download/community/result?module=pt+cann&product=4&model=26) +- [环境准备指导](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1alpha001/softwareinst/instg/instg_0003.html) ### 1.2 CANN安装 ```shell @@ -50,7 +51,7 @@ chmod +x ./Ascend-mindie_${version}_linux-${arch}.run cd /usr/local/Ascend/mindie && source set_env.sh # 方式二:指定路径安装 -./Ascend-mindie_${version}_linux-${arch}.run --install --install-path=${AieInstallPath} +./Ascend-mindie_${version}_linux-${arch}.run --install --install-path=${AieInstallPath} # 设置环境变量 cd ${AieInstallPath}/mindie && source set_env.sh ``` diff --git a/MindIE/MultiModal/CogView3-Plus-3B/README.md b/MindIE/MultiModal/CogView3-Plus-3B/README.md index 913fde80786592c473930a193531b7acb8a13232..0d669616112dc1bb32e64c0a0802a6dabab258b9 100644 --- a/MindIE/MultiModal/CogView3-Plus-3B/README.md +++ b/MindIE/MultiModal/CogView3-Plus-3B/README.md @@ -4,14 +4,15 @@ | 配套 | 版本 | 环境准备指导 | | ----- | ----- |-----| - | Python | 3.10.12 | - | + | Python | 3.10 / 3.11 | - | | torch | 2.1.0 | - | ### 1.1 获取CANN&MindIE安装包&环境准备 - 设备支持 Atlas 800I A2/Atlas 800T A2设备:支持的卡数为1 -- [Atlas 800I A2/Atlas 800T A2](https://www.hiascend.com/developer/download/community/result?module=pt+ie+cann&product=4&model=32) -- [环境准备指导](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC2alpha002/softwareinst/instg/instg_0001.html) +- [Atlas 800I A2](https://www.hiascend.com/developer/download/community/result?module=pt+ie+cann&product=4&model=32) +- [Atlas 800T A2](https://www.hiascend.com/developer/download/community/result?module=pt+cann&product=4&model=26) +- [环境准备指导](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1alpha001/softwareinst/instg/instg_0003.html) ### 1.2 CANN安装 ```shell @@ -41,7 +42,7 @@ chmod +x ./Ascend-mindie_${version}_linux-${arch}.run cd /usr/local/Ascend/mindie && source set_env.sh # 方式二:指定路径安装 -./Ascend-mindie_${version}_linux-${arch}.run --install-path=${AieInstallPath} +./Ascend-mindie_${version}_linux-${arch}.run --install --install-path=${AieInstallPath} # 设置环境变量 cd ${AieInstallPath}/mindie && source set_env.sh ``` @@ -81,62 +82,6 @@ pip install -r requirements.txt https://huggingface.co/THUDM/CogView3-Plus-3B/tree/main ``` -- 在main路径下,修改model_index.json文件,更改结果如下: -```shell -{ - "_class_name": "CogView3PlusPipeline", - "_diffusers_version": "0.31.0.dev0", - "scheduler": [ - "cogview3plus", - "CogVideoXDDIMScheduler" - ], - "text_encoder": [ - "transformers", - "T5EncoderModel" - ], - "tokenizer": [ - "transformers", - "T5Tokenizer" - ], - "transformer": [ - "cogview3plus", - "CogView3PlusTransformer2DModel" - ], - "vae": [ - "diffusers", - "AutoencoderKL" - ] -} -``` - -- 在main/transformer路径下,修改config.json文件,更改结果如下: -```shell -{ - "_class_name": "CogView3PlusTransformer2DModel", - "_diffusers_version": "0.31.0.dev0", - "attention_head_dim": 40, - "condition_dim": 256, - "in_channels": 16, - "num_attention_heads": 64, - "num_layers": 30, - "out_channels": 16, - "patch_size": 2, - "pooled_projection_dim": 1536, - "pos_embed_max_size": 128, - "sample_size": 128, - "text_embed_dim": 4096, - "time_embed_dim": 512, - "use_cache": true, - "cache_interval": 2, - "cache_start": 1, - "num_cache_layer": 11, - "cache_start_steps": 10, - "useagb": false, - "pab": 2, - "total_step": 50 -} -``` - #### 2. 各模型的配置文件、权重文件的层级样例如下所示: ```commandline |----main @@ -162,19 +107,7 @@ cd cogview3 path="/data/CogView3B" ``` -#### 3. 有损加速算法选择 - -修改权重文件CogView3B/transformer/config.json中的`use_cache`和`useagb`参数,对应算法关系如下: - -| 算法类型 | use_cache | useagb | -| :------: |:----:|:----:| -| 不使用加速算法 | false | false | -| DiT Cache | true | false | -| AGB Cache | false | true | - -**注意**:在32G的服务器上,可开启DiT Cache算法,开启AGB Cache算法可能会报显存不足的错误,因为AGB算法对显存要求更高。在64G机器上,两种Cache算法皆可开启。 - -#### 4. 执行命令,进行推理: +#### 3. 执行命令,进行推理: ```shell python inference_cogview3plus.py \ --model_path ${path} \ @@ -183,7 +116,8 @@ python inference_cogview3plus.py \ --height 1024 \ --num_inference_steps 50 \ --dtype bf16 \ - --device_id 0 + --device_id 0 \ + --cache_algorithm attention ``` 参数说明: - model_path:权重路径,包含scheduler、text_encoder、tokenizer、transformer、vae,5个模型的配置文件及权重。 @@ -193,7 +127,9 @@ python inference_cogview3plus.py \ - num_inference_steps:推理迭代步数。 - dtype: 数据类型。目前只支持bf16。 - device_id:推理设备ID。 +- cache_algorithm:默认为None,可选择attention,即使用AGBCache算法,注意是有损的加速算法。 +**注意**:在32G的服务器上,开启cache算法可能会报显存不足的错误;在64G机器上,可正常开启cache算法。 **注意**:本仓库模型,是对开源模型进行优化。用户在使用时,应对开源代码函数的变量范围,类型进行校验,避免出现变量超出范围、除零等操作。 @@ -208,19 +144,7 @@ cd cogview3 path="/data/CogView3B" ``` -#### 3. 有损加速算法选择 - -修改权重文件CogView3B/transformer/config.json中的`use_cache`和`useagb`参数,对应算法关系如下: - -| 算法类型 | use_cache | useagb | -| :------: |:----:|:----:| -| 不使用加速算法 | false | false | -| DiT Cache | true | false | -| AGB Cache | false | true | - -**注意**:在32G的服务器上,batch_size需要等于1,否则会报显存不足的错误;在64G机器上,batch_size可为2,可开启Cache算法。 - -#### 4. 执行命令,进行推理: +#### 3. 执行命令,进行推理: ```shell python inference_cogview3plus.py \ --model_path ${path} \ @@ -230,7 +154,8 @@ python inference_cogview3plus.py \ --num_inference_steps 50 \ --dtype bf16 \ --batch_size 2 \ - --device_id 0 + --device_id 0 \ + --cache_algorithm attention ``` 参数说明: - model_path:权重路径,包含scheduler、text_encoder、tokenizer、transformer、vae,5个模型的配置文件及权重。 @@ -241,6 +166,9 @@ python inference_cogview3plus.py \ - dtype: 数据类型。目前只支持bf16。 - batch_size: 推理时的batch_size。 - device_id:推理设备ID。 +- cache_algorithm:默认为None,可选择attention,即使用AGBCache算法,注意是有损的加速算法。 + +**注意**:在32G的服务器上,batch_size需要等于1,否则会报显存不足的错误;在64G机器上,batch_size可为2,可开启cache算法。 ### 3.4 精度验证 @@ -277,7 +205,8 @@ python3 inference_cogview3plus.py \ --width 1024 \ --batch_size 1 \ --seed 42 \ - --device_id 0 + --device_id 0 \ + --cache_algorithm attention ``` 参数说明: - model_path:权重路径,包含scheduler、text_encoder、tokenizer、transformer、vae,5个模型的配置文件及权重。 @@ -291,6 +220,7 @@ python3 inference_cogview3plus.py \ - batch_size:模型batch size。 - seed:随机种子。 - device_id:推理设备ID。 +- cache_algorithm:默认为None,可选择attention,即使用AGBCache算法,注意是有损的加速算法。 执行完成后在`./results_PartiPrompts`目录下生成推理图片,在当前目录生成一个`image_info_PartiPrompts.json`文件,记录着图片和prompt的对应关系,并在终端显示推理时间。 @@ -307,7 +237,8 @@ python3 inference_cogview3plus.py \ --width 1024 \ --batch_size 1 \ --seed 42 \ - --device_id 0 + --device_id 0 \ + --cache_algorithm attention ``` 参数说明: - model_path:权重路径,包含scheduler、text_encoder、tokenizer、transformer、vae,5个模型的配置文件及权重。 @@ -321,6 +252,7 @@ python3 inference_cogview3plus.py \ - batch_size:模型batch size。 - seed:随机种子。 - device_id:推理设备ID。 +- cache_algorithm:默认为None,可选择attention,即使用AGBCache算法,注意是有损的加速算法。 执行完成后在`./results_hpsv2`目录下生成推理图片,在当前目录生成一个`image_info_hpsv2.json`文件,记录着图片和prompt的对应关系,并在终端显示推理时间。 @@ -377,8 +309,7 @@ python3 hpsv2_score.py \ | 硬件形态 | 迭代次数 | 加速算法 | 平均耗时 | CLIP_score | HPSV2_score | | :------: |:----:|:----:|:----:|:----:|:----:| | Atlas 800T A2 (8*64G) 单卡 | 50 | 无 | 27.588s | 0.367 | 0.2879729 | -| Atlas 800T A2 (8*64G) 单卡 | 50 | DiT Cache | 23.639s | 0.367 | 0.2878573 | -| Atlas 800T A2 (8*64G) 单卡 | 50 | AGB | 17.219s | 0.367 | 0.2879835 | +| Atlas 800T A2 (8*64G) 单卡 | 50 | AGBCache | 17.219s | 0.367 | 0.2879835 | ## 四、优化指南 diff --git a/MindIE/MultiModal/CogView3-Plus-3B/cogview3plus/layers/__init__.py b/MindIE/MultiModal/CogView3-Plus-3B/cogview3plus/layers/__init__.py index 602ad432a01cefe3824b16f4ee90ce56f18c3aab..4d25f1e889c0d9b1c4ff83c21c30acf8bcaacce8 100644 --- a/MindIE/MultiModal/CogView3-Plus-3B/cogview3plus/layers/__init__.py +++ b/MindIE/MultiModal/CogView3-Plus-3B/cogview3plus/layers/__init__.py @@ -1,3 +1,2 @@ from .normalization import CogView3PlusAdaLayerNormZeroTextImage, AdaLayerNormContinuous -from .embeddings import CogView3CombinedTimestepSizeEmbeddings, CogView3PlusPatchEmbed -from .linear import QKVLinear \ No newline at end of file +from .embeddings import CogView3CombinedTimestepSizeEmbeddings, CogView3PlusPatchEmbed \ No newline at end of file diff --git a/MindIE/MultiModal/CogView3-Plus-3B/cogview3plus/layers/linear.py b/MindIE/MultiModal/CogView3-Plus-3B/cogview3plus/layers/linear.py deleted file mode 100644 index d242d17c2e83b7d27d86f1132a736951963b71bf..0000000000000000000000000000000000000000 --- a/MindIE/MultiModal/CogView3-Plus-3B/cogview3plus/layers/linear.py +++ /dev/null @@ -1,48 +0,0 @@ -#!/usr/bin/env python -# coding=utf-8 -# Copyright 2024 Huawei Technologies Co., Ltd -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# https://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - - -import torch -import torch.nn as nn - - -class QKVLinear(nn.Module): - def __init__(self, attention_dim, hidden_size, qkv_bias=True, device=None, dtype=None): - super(QKVLinear, self).__init__() - self.attention_dim = attention_dim - self.hidden_size = hidden_size - self.qkv_bias = qkv_bias - - factory_kwargs = {"device": device, "dtype": dtype} - - self.weight = nn.Parameter(torch.empty([self.attention_dim, 3 * self.hidden_size], **factory_kwargs)) - if self.qkv_bias: - self.bias = nn.Parameter(torch.empty([3 * self.hidden_size], **factory_kwargs)) - - def forward(self, hidden_states): - - if not self.qkv_bias: - qkv = torch.matmul(hidden_states, self.weight) - else: - qkv = torch.addmm( - self.bias, - hidden_states.view(hidden_states.size(0) * hidden_states.size(1), hidden_states.size(2)), - self.weight, - beta=1, - alpha=1 - ) - - return qkv \ No newline at end of file diff --git a/MindIE/MultiModal/CogView3-Plus-3B/cogview3plus/models/attention_processor.py b/MindIE/MultiModal/CogView3-Plus-3B/cogview3plus/models/attention_processor.py index eb29215831e757fe0a6eddc7040449f45148a514..dbef9116287b1a41e4c22bacf9697f99a52b2d7a 100644 --- a/MindIE/MultiModal/CogView3-Plus-3B/cogview3plus/models/attention_processor.py +++ b/MindIE/MultiModal/CogView3-Plus-3B/cogview3plus/models/attention_processor.py @@ -23,7 +23,7 @@ import torch_npu from diffusers.utils import logging from diffusers.utils.torch_utils import maybe_allow_in_graph -from ..layers import QKVLinear +from mindiesd.layers.linear import QKVLinear logger = logging.get_logger(__name__) # pylint: disable=invalid-name @@ -304,11 +304,12 @@ class CogVideoXAttnProcessor2_0: attention_mask = attention_mask.view(batch_size, attn.heads, -1, attention_mask.shape[-1]) B, S, _ = hidden_states.shape - qkv = attn.to_qkv(hidden_states) - inner_dim = qkv.shape[-1] // 3 + query, key, value = attn.to_qkv(hidden_states) + inner_dim = key.shape[-1] head_dim = inner_dim // attn.heads - qkv_shape = (B, S, 3, attn.heads, head_dim) - query, key, value = qkv.view(qkv_shape).permute(2, 0, 3, 1, 4).contiguous().unbind(0) + query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2) + key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2) + value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2) if attn.norm_q is not None: query = attn.norm_q(query) diff --git a/MindIE/MultiModal/CogView3-Plus-3B/cogview3plus/models/transformer_cogview3plus.py b/MindIE/MultiModal/CogView3-Plus-3B/cogview3plus/models/transformer_cogview3plus.py index e61259874834a0664ab52c17e2f1826e3355a8ee..f1abc4981c01299ab14567068105da142c690e18 100644 --- a/MindIE/MultiModal/CogView3-Plus-3B/cogview3plus/models/transformer_cogview3plus.py +++ b/MindIE/MultiModal/CogView3-Plus-3B/cogview3plus/models/transformer_cogview3plus.py @@ -39,25 +39,9 @@ class CogView3PlusTransformerBlock(nn.Module): dim: int = 2560, num_attention_heads: int = 64, attention_head_dim: int = 40, - time_embed_dim: int = 512, - useagb: bool = True, - pab: int = 2, - total_step: int = 50 + time_embed_dim: int = 512 ): super().__init__() - self.useagb = useagb - self.pab = pab - self.total_step = total_step - - self.attn_count = 0 - self.last_attn_x_image = None - self.last_attn_x_prompt = None - self.attn_alpha_image = 0 - self.attn_alpha_prompt = 0 - self.last_attn_image = None - self.last_attn_prompt = None - self.last_ff_image = None - self.last_ff_prompt = None self.norm1 = CogView3PlusAdaLayerNormZeroTextImage(embedding_dim=time_embed_dim, dim=dim) @@ -77,6 +61,7 @@ class CogView3PlusTransformerBlock(nn.Module): self.norm2_context = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-5) self.ff = FeedForward(dim=dim, dim_out=dim, activation_fn="gelu-approximate") + self.cache = None def forward( self, @@ -85,128 +70,46 @@ class CogView3PlusTransformerBlock(nn.Module): emb: torch.Tensor, ) -> torch.Tensor: text_seq_length = encoder_hidden_states.size(1) - - if self.useagb: - if self.attn_count > 0: - diff_x_image = hidden_states - self.last_attn_x_image - diff_x_prompt = encoder_hidden_states - self.last_attn_x_prompt - - self.last_attn_x_image = hidden_states - self.last_attn_x_prompt = encoder_hidden_states - - lower_bound = int(self.total_step / 5) - 0.5 - upper_bound = self.total_step - 1.5 - if (self.attn_count % self.pab != 0) and (lower_bound < self.attn_count < upper_bound): - broadcast_attn = 1 - else: - broadcast_attn = 0 - - if broadcast_attn == 1: - attn_hidden_states = self.last_attn_image + self.attn_alpha_image * diff_x_image - attn_encoder_hidden_states = self.last_attn_prompt + self.attn_alpha_prompt * diff_x_prompt - else: - # norm & modulate - norm_hidden_states, chunk_params = self.norm1(hidden_states, encoder_hidden_states, emb) - - gate_msa = chunk_params.gate_msa - shift_mlp = chunk_params.shift_mlp - scale_mlp = chunk_params.scale_mlp - gate_mlp = chunk_params.gate_mlp - norm_encoder_hidden_states = chunk_params.context - c_gate_msa = chunk_params.c_gate_msa - c_shift_mlp = chunk_params.c_shift_mlp - c_scale_mlp = chunk_params.c_scale_mlp - c_gate_mlp = chunk_params.c_gate_mlp - - # attention - attn_hidden_states, attn_encoder_hidden_states = self.attn1( - hidden_states=norm_hidden_states, encoder_hidden_states=norm_encoder_hidden_states - ) - - attn_hidden_states = gate_msa.unsqueeze(1) * attn_hidden_states - attn_encoder_hidden_states = c_gate_msa.unsqueeze(1) * attn_encoder_hidden_states - - # calculate alpha - if lower_bound < self.attn_count < upper_bound: - diff_image = attn_hidden_states - self.last_attn_image - diff_prompt = attn_encoder_hidden_states - self.last_attn_prompt - - self.attn_alpha_image = ((diff_x_image / 100) * (diff_image / 100)).sum() / \ - ((diff_x_image / 100) ** 2).sum() - self.attn_alpha_prompt = ((diff_x_prompt / 100) * (diff_prompt / 100)).sum() / \ - ((diff_x_prompt / 100) ** 2).sum() - else: - self.attn_alpha_image = 0 - self.attn_alpha_prompt = 0 - - self.last_attn_image = attn_hidden_states - self.last_attn_prompt = attn_encoder_hidden_states - - hidden_states = hidden_states + attn_hidden_states - encoder_hidden_states = encoder_hidden_states + attn_encoder_hidden_states - - if broadcast_attn == 1: - hidden_states = hidden_states + self.last_ff_image - encoder_hidden_states = encoder_hidden_states + self.last_ff_prompt - else: - # norm & modulate - norm_hidden_states = self.norm2(hidden_states) - norm_hidden_states = norm_hidden_states * (1 + scale_mlp[:, None]) + shift_mlp[:, None] - - norm_encoder_hidden_states = self.norm2_context(encoder_hidden_states) - norm_encoder_hidden_states = norm_encoder_hidden_states * (1 + c_scale_mlp[:, None]) + \ - c_shift_mlp[:, None] - - # feed-forward - norm_hidden_states = torch.cat([norm_encoder_hidden_states, norm_hidden_states], dim=1) - ff_output = self.ff(norm_hidden_states) - - ff_image = gate_mlp.unsqueeze(1) * ff_output[:, text_seq_length:] - ff_prompt = c_gate_mlp.unsqueeze(1) * ff_output[:, :text_seq_length] - - hidden_states = hidden_states + ff_image - encoder_hidden_states = encoder_hidden_states + ff_prompt - self.last_ff_image = ff_image - self.last_ff_prompt = ff_prompt - - # 更新self.attn_count - self.attn_count = (self.attn_count + 1) % self.total_step - else: - # norm & modulate - norm_hidden_states, chunk_params = self.norm1(hidden_states, encoder_hidden_states, emb) - - gate_msa = chunk_params.gate_msa - shift_mlp = chunk_params.shift_mlp - scale_mlp = chunk_params.scale_mlp - gate_mlp = chunk_params.gate_mlp - norm_encoder_hidden_states = chunk_params.context - c_gate_msa = chunk_params.c_gate_msa - c_shift_mlp = chunk_params.c_shift_mlp - c_scale_mlp = chunk_params.c_scale_mlp - c_gate_mlp = chunk_params.c_gate_mlp - - # attention + # norm & modulate + norm_hidden_states, chunk_params = self.norm1(hidden_states, encoder_hidden_states, emb) + + gate_msa = chunk_params.gate_msa + shift_mlp = chunk_params.shift_mlp + scale_mlp = chunk_params.scale_mlp + gate_mlp = chunk_params.gate_mlp + norm_encoder_hidden_states = chunk_params.context + c_gate_msa = chunk_params.c_gate_msa + c_shift_mlp = chunk_params.c_shift_mlp + c_scale_mlp = chunk_params.c_scale_mlp + c_gate_mlp = chunk_params.c_gate_mlp + + # attention + if self.cache is None: attn_hidden_states, attn_encoder_hidden_states = self.attn1( hidden_states=norm_hidden_states, encoder_hidden_states=norm_encoder_hidden_states ) + else: + attn_hidden_states, attn_encoder_hidden_states = self.cache.apply(self.attn1.forward, + hidden_states=norm_hidden_states, encoder_hidden_states=norm_encoder_hidden_states + ) - hidden_states = hidden_states + gate_msa.unsqueeze(1) * attn_hidden_states - encoder_hidden_states = encoder_hidden_states + c_gate_msa.unsqueeze(1) * attn_encoder_hidden_states + hidden_states = hidden_states + gate_msa.unsqueeze(1) * attn_hidden_states + encoder_hidden_states = encoder_hidden_states + c_gate_msa.unsqueeze(1) * attn_encoder_hidden_states - # norm & modulate - norm_hidden_states = self.norm2(hidden_states) - norm_hidden_states = norm_hidden_states * (1 + scale_mlp[:, None]) + shift_mlp[:, None] + # norm & modulate + norm_hidden_states = self.norm2(hidden_states) + norm_hidden_states = norm_hidden_states * (1 + scale_mlp[:, None]) + shift_mlp[:, None] - norm_encoder_hidden_states = self.norm2_context(encoder_hidden_states) - norm_encoder_hidden_states = norm_encoder_hidden_states * (1 + c_scale_mlp[:, None]) + c_shift_mlp[:, None] + norm_encoder_hidden_states = self.norm2_context(encoder_hidden_states) + norm_encoder_hidden_states = norm_encoder_hidden_states * (1 + c_scale_mlp[:, None]) + c_shift_mlp[:, None] - # feed-forward - norm_hidden_states = torch.cat([norm_encoder_hidden_states, norm_hidden_states], dim=1) - ff_output = self.ff(norm_hidden_states) + # feed-forward + norm_hidden_states = torch.cat([norm_encoder_hidden_states, norm_hidden_states], dim=1) + ff_output = self.ff(norm_hidden_states) - hidden_states = hidden_states + gate_mlp.unsqueeze(1) * ff_output[:, text_seq_length:] - encoder_hidden_states = encoder_hidden_states + c_gate_mlp.unsqueeze(1) * ff_output[:, :text_seq_length] + hidden_states = hidden_states + gate_mlp.unsqueeze(1) * ff_output[:, text_seq_length:] + encoder_hidden_states = encoder_hidden_states + c_gate_mlp.unsqueeze(1) * ff_output[:, :text_seq_length] if hidden_states.dtype == torch.float16: hidden_states = hidden_states.clip(-65504, 65504) @@ -232,14 +135,7 @@ class CogView3PlusTransformer2DModel(ModelMixin, ConfigMixin): time_embed_dim: int = 512, condition_dim: int = 256, pos_embed_max_size: int = 128, - use_cache: bool = False, - cache_interval: int = 2, - cache_start: int = 1, - num_cache_layer: int = 11, - cache_start_steps: int = 10, - useagb: bool = True, - pab: int = 2, - total_step: int = 50, + sample_size: int = 128, ): super().__init__() self.out_channels = out_channels @@ -272,9 +168,6 @@ class CogView3PlusTransformer2DModel(ModelMixin, ConfigMixin): num_attention_heads=num_attention_heads, attention_head_dim=attention_head_dim, time_embed_dim=time_embed_dim, - useagb=useagb, - pab=pab, - total_step=total_step ) for _ in range(num_layers) ] @@ -297,14 +190,6 @@ class CogView3PlusTransformer2DModel(ModelMixin, ConfigMixin): self.v_weight_cache = None self.v_bias_cache = None - self.use_cache = use_cache - self.cache_interval = cache_interval - self.cache_start = cache_start - self.num_cache_layer = num_cache_layer - self.cache_start_steps = cache_start_steps - - self.delta_cache = None - self.delta_encoder_cache = None @property def attn_processors(self) -> Dict[str, AttentionProcessor]: @@ -358,14 +243,30 @@ class CogView3PlusTransformer2DModel(ModelMixin, ConfigMixin): def forward( self, - states, + hidden_states: torch.Tensor, + encoder_hidden_states: torch.Tensor, timestep: torch.LongTensor, original_size: torch.Tensor, target_size: torch.Tensor, crop_coords: torch.Tensor, + return_dict: bool = True, ) -> Union[torch.Tensor, Transformer2DModelOutput]: - hidden_states = states[0] - encoder_hidden_states = states[1] + """ + The [`CogView3PlusTransformer2DModel`] forward method. + + Args: + hidden_states (`torch.Tensor`): + Input `hidden_states` of shape `(batch size, channel, height, width)`. + encoder_hidden_states (`torch.Tensor`): + Conditional embeddings (embeddings computed from the input conditions such as prompts) of shape + `(batch_size, sequence_len, text_embed_dim)` + timestep (`torch.LongTensor`): + Used to indicate denoising step. + + Returns: + `torch.Tensor` or [`~models.transformer_2d.Transformer2DModelOutput`]: + The denoised latents using provided inputs as conditioning. + """ height, width = hidden_states.shape[-2:] text_seq_length = encoder_hidden_states.shape[1] @@ -377,7 +278,28 @@ class CogView3PlusTransformer2DModel(ModelMixin, ConfigMixin): encoder_hidden_states = hidden_states[:, :text_seq_length] hidden_states = hidden_states[:, text_seq_length:] - hidden_states, encoder_hidden_states = self._forward_blocks(hidden_states, encoder_hidden_states, emb, states[2]) + for _, block in enumerate(self.transformer_blocks): + if self.training and self.gradient_checkpointing: + def create_custom_forward(module): + def custom_forward(*inputs): + return module(*inputs) + + return custom_forward + + ckpt_kwargs: Dict[str, Any] = {"use_reentrant": False} if is_torch_version(">=", "1.11.0") else {} + hidden_states, encoder_hidden_states = torch.utils.checkpoint.checkpoint( + create_custom_forward(block), + hidden_states, + encoder_hidden_states, + emb, + **ckpt_kwargs, + ) + else: + hidden_states, encoder_hidden_states = block( + hidden_states=hidden_states, + encoder_hidden_states=encoder_hidden_states, + emb=emb, + ) hidden_states = self.norm_out(hidden_states, emb) hidden_states = self.proj_out(hidden_states) # (batch_size, height*width, patch_size*patch_size*out_channels) @@ -395,66 +317,10 @@ class CogView3PlusTransformer2DModel(ModelMixin, ConfigMixin): shape=(hidden_states.shape[0], self.out_channels, height * patch_size, width * patch_size) ) - return Transformer2DModelOutput(sample=output) - - # forward blocks in range [start_idx, end_idx), then return input and output - def _forward_blocks_range(self, hidden_states, encoder_hidden_states, emb, start_idx, end_idx, **kwargs): - for _, block in enumerate(self.transformer_blocks[start_idx: end_idx]): - hidden_states, encoder_hidden_states = block( - hidden_states=hidden_states, - encoder_hidden_states=encoder_hidden_states, - emb=emb, - ) - - return hidden_states, encoder_hidden_states - - def _forward_blocks(self, hidden_states, encoder_hidden_states, emb, t_idx): - num_blocks = len(self.transformer_blocks) - - if not self.use_cache or (t_idx < self.cache_start_steps): - hidden_states, encoder_hidden_states = self._forward_blocks_range( - hidden_states, - encoder_hidden_states, - emb, - 0, - num_blocks - ) - else: - # infer [0, cache_start) - hidden_states, encoder_hidden_states = self._forward_blocks_range( - hidden_states, - encoder_hidden_states, - emb, - 0, - self.cache_start - ) - # infer [cache_start, cache_end) - cache_end = np.minimum(self.cache_start + self.num_cache_layer, num_blocks) - hidden_states_before_cache = hidden_states.clone() - encoder_hidden_states_before_cache = encoder_hidden_states.clone() - if t_idx % self.cache_interval == (self.cache_start_steps % self.cache_interval): - hidden_states, encoder_hidden_states = self._forward_blocks_range( - hidden_states, - encoder_hidden_states, - emb, - self.cache_start, - cache_end - ) - self.delta_cache = hidden_states - hidden_states_before_cache - self.delta_encoder_cache = encoder_hidden_states - encoder_hidden_states_before_cache - else: - hidden_states = hidden_states_before_cache + self.delta_cache - encoder_hidden_states = encoder_hidden_states_before_cache + self.delta_encoder_cache - # infer [cache_end, num_blocks) - hidden_states, encoder_hidden_states = self._forward_blocks_range( - hidden_states, - encoder_hidden_states, - emb, - cache_end, - num_blocks - ) + if not return_dict: + return (output,) - return hidden_states, encoder_hidden_states + return Transformer2DModelOutput(sample=output) def load_weights(self, state_dict, shard=False): with torch.no_grad(): diff --git a/MindIE/MultiModal/CogView3-Plus-3B/cogview3plus/pipeline/pipeline_cogview3plus.py b/MindIE/MultiModal/CogView3-Plus-3B/cogview3plus/pipeline/pipeline_cogview3plus.py index 3ea10a212a898b8b8cb9638defa0518cefa955ed..01877a9e6eb8189e659a22120a7c003fa8bc3ec2 100644 --- a/MindIE/MultiModal/CogView3-Plus-3B/cogview3plus/pipeline/pipeline_cogview3plus.py +++ b/MindIE/MultiModal/CogView3-Plus-3B/cogview3plus/pipeline/pipeline_cogview3plus.py @@ -309,11 +309,13 @@ class CogView3PlusPipeline(DiffusionPipeline): # predict noise model_output noise_pred = self.transformer( - states=(latent_model_input, prompt_embeds, i), + hidden_states=latent_model_input, + encoder_hidden_states=prompt_embeds, timestep=timestep, original_size=original_size, target_size=target_size, crop_coords=crops_coords_top_left, + return_dict=False, )[0] noise_pred = noise_pred.float() diff --git a/MindIE/MultiModal/CogView3-Plus-3B/inference_cogview3plus.py b/MindIE/MultiModal/CogView3-Plus-3B/inference_cogview3plus.py index 0052f511f5e3694ca422b96d3fdd76209fd2808b..a9912983cbbb0187f44008188d747a5b2edaad15 100644 --- a/MindIE/MultiModal/CogView3-Plus-3B/inference_cogview3plus.py +++ b/MindIE/MultiModal/CogView3-Plus-3B/inference_cogview3plus.py @@ -23,8 +23,9 @@ import json import torch -from cogview3plus import CogView3PlusPipeline, set_random_seed +from cogview3plus import CogView3PlusPipeline, set_random_seed, CogView3PlusTransformer2DModel from cogview3plus.utils.file_utils import standardize_path +from mindiesd import CacheAgent, CacheConfig logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) @@ -190,6 +191,7 @@ def parse_arguments(): parser.add_argument("--dtype", type=str, default="bf16", help="bf16 or fp16") parser.add_argument("--seed", type=int, default=None, help="Random seed") parser.add_argument("--device_id", type=int, default=0, help="NPU device id") + parser.add_argument('--cache_algorithm', type=str, default="None", help="The type of optimization algorithm") return parser.parse_args() @@ -206,7 +208,27 @@ def infer(args): # Load the pre-trained model with the specified precision args.model_path = standardize_path(args.model_path) - pipe = CogView3PlusPipeline.from_pretrained(args.model_path, torch_dtype=dtype).to("npu") + pipe = CogView3PlusPipeline.from_pretrained(args.model_path, torch_dtype=dtype) + transformer = CogView3PlusTransformer2DModel.from_pretrained(os.path.join(args.model_path, 'transformer'), torch_dtype=dtype) + pipe.transformer = transformer + pipe = pipe.to("npu") + + # attention cache + if args.cache_algorithm == "attention": + steps_count = args.num_inference_steps + blocks_count = pipe.transformer.config.num_layers + config = CacheConfig( + method="attention_cache", + blocks_count=blocks_count, + steps_count=steps_count, + step_start=15, + step_end=47, + step_interval=5 + ) + agent = CacheAgent(config) + pipe.transformer.use_cache = True + for block in pipe.transformer.transformer_blocks: + block.cache = agent use_time = 0 prompt_loader = PromptLoader(args.prompt_file, diff --git a/MindIE/MultiModal/Flux.1-DEV/README.md b/MindIE/MultiModal/Flux.1-DEV/README.md index 5569bab5cdc0e6a4bd5cd3a4818795d0d275d682..a182483f1fb598605bf20269bef83f43a6337345 100644 --- a/MindIE/MultiModal/Flux.1-DEV/README.md +++ b/MindIE/MultiModal/Flux.1-DEV/README.md @@ -11,7 +11,7 @@ - 设备支持: Atlas 800I A2推理设备:支持的卡数为1或2 - [Atlas 800I A2](https://www.hiascend.com/developer/download/community/result?module=pt+ie+cann&product=4&model=32) -- [环境准备指导](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC2alpha002/softwareinst/instg/instg_0001.html) +- [环境准备指导](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1alpha001/softwareinst/instg/instg_0003.html) ### 1.2 CANN安装 ```shell @@ -46,7 +46,7 @@ chmod +x ./Ascend-mindie_${version}_linux-${arch}.run cd /usr/local/Ascend/mindie && source set_env.sh # 方式二:指定路径安装 -./Ascend-mindie_${version}_linux-${arch}.run --install-path=${AieInstallPath} +./Ascend-mindie_${version}_linux-${arch}.run --install --install-path=${AieInstallPath} # 设置环境变量 cd ${AieInstallPath}/mindie && source set_env.sh ``` diff --git a/MindIE/MultiModal/HunyuanDiT/README.md b/MindIE/MultiModal/HunyuanDiT/README.md index b99301a884712ed01bbc359c1e5db29db6cf3ec3..d91a96e37a38976ac7889b6103708b31aee5f713 100644 --- a/MindIE/MultiModal/HunyuanDiT/README.md +++ b/MindIE/MultiModal/HunyuanDiT/README.md @@ -11,7 +11,7 @@ - 设备支持: Atlas 800I A2推理设备:支持的卡数为1 - [Atlas 800I A2](https://www.hiascend.com/developer/download/community/result?module=pt+ie+cann&product=4&model=32) -- [环境准备指导](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC2alpha002/softwareinst/instg/instg_0001.html) +- [环境准备指导](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1alpha001/softwareinst/instg/instg_0003.html) ### 1.2 CANN安装 ```shell @@ -41,7 +41,7 @@ chmod +x ./Ascend-mindie_${version}_linux-${arch}.run cd /usr/local/Ascend/mindie && source set_env.sh # 方式二:指定路径安装 -./Ascend-mindie_${version}_linux-${arch}.run --install-path=${AieInstallPath} +./Ascend-mindie_${version}_linux-${arch}.run --install --install-path=${AieInstallPath} # 设置环境变量 cd ${AieInstallPath}/mindie && source set_env.sh ``` diff --git a/MindIE/MultiModal/HunyuanVideo/README.md b/MindIE/MultiModal/HunyuanVideo/README.md index 0e8f13a159586da2c5c7b02dc28aa069d11718c3..29f998ebcdcfc0ba172bf5e72c534fb327f21152 100644 --- a/MindIE/MultiModal/HunyuanVideo/README.md +++ b/MindIE/MultiModal/HunyuanVideo/README.md @@ -9,9 +9,8 @@ ### 1.1 获取CANN&MindIE安装包&环境准备 - 设备支持 -Atlas 800I A2(8\*64G)推理设备:当前支持的卡数:1、2、3、4、6、8。 -Atlas 800I A3(16\*64G)推理设备:当前支持的卡数:1、2、3、4、6、8、16。 -- [Atlas 800I A2(8*64G)环境准备指导](https://www.hiascend.com/developer/download/community/result?module=pt+ie+cann&product=4&model=32) +- [Atlas 800I A2(8*64G)](https://www.hiascend.com/developer/download/community/result?module=pt+ie+cann&product=4&model=32)推理设备:当前支持的卡数:1、2、3、4、6、8、16 +- [环境准备指导](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1alpha001/softwareinst/instg/instg_0003.html) ### 1.2 CANN安装 ```shell @@ -249,7 +248,7 @@ torchrun --nproc_per_node=8 sample_video.py \ #### 3.5.2 算法优化 -一、使用attentioncache +使用attentioncache 执行命令: ```shell export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True" @@ -299,52 +298,6 @@ torchrun --nproc_per_node=8 sample_video.py \ - ulysses-degree:ulysses并行使用的卡数 - ring-degree: ring并行使用的卡数 -### 3.6 16卡性能测试 -仅支持Atlas 800I A3 -执行命令: -```shell -export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True" -export TASK_QUEUE_ENABLE=2 -export CPU_AFFINITY_CONF=1 -export TOKENIZERS_PARALLELISM=false -export ALGO=0 -torchrun --nproc_per_node=16 sample_video.py \ - --model-base HunyuanVideo \ - --dit-weight HunyuanVideo/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt \ - --vae-path HunyuanVideo/hunyuan-video-t2v-720p/vae \ - --text-encoder-path HunyuanVideo/text_encoder \ - --text-encoder-2-path HunyuanVideo/clip-vit-large-patch14 \ - --model-resolution "720p" \ - --video-size 720 1280 \ - --video-length 129 \ - --infer-steps 50 \ - --prompt "A cat walks on the grass, realistic style." \ - --seed 42 \ - --flow-reverse \ - --ulysses-degree 8 \ - --ring-degree 2 \ - --vae-parallel \ - --save-path ./results -``` -参数说明: -- ALGO: 为0表示默认FA算子;设置为1表示使用高性能FA算子 -- nproc_per_node: 并行推理的总卡数。 -- model-base: 权重路径,包含vae、text_encoder、Tokenizer、Transformer和Scheduler五个模型的配置文件及权重。 -- dit-weight: dit的权重路径 -- vae-path: VAE的权重路径 -- text-encoder-path: text_encoder的权重路径 -- text-encoder-2-path: text_encoder_2的权重路径 -- model-resolution: 分辨率 -- video-size: 生成视频的高和宽 -- video-length: 总帧数 -- infer-steps: 推理步数 -- prompt: 文本提示词 -- seed: 随机种子 -- vae-parallel: vae部分使能并行,目前只支持8卡、16卡并行时使用 -- save-path: 生成的视频的保存路径 -- flow-reverse:是否进行反向采样 -- ulysses-degree:ulysses并行使用的卡数 -- ring-degree: ring并行使用的卡数 ## 精度指标 我们使用prompts.txt测试了seed42-46五组种子的视频,并测试了vbench并取平均值,6个指标如下: diff --git a/MindIE/MultiModal/Janus-Pro/README.md b/MindIE/MultiModal/Janus-Pro/README.md index 52d242d018e3b103081ec8684925a42739a90217..aa36610f05b2e705e96030345b705c816e9134d0 100644 --- a/MindIE/MultiModal/Janus-Pro/README.md +++ b/MindIE/MultiModal/Janus-Pro/README.md @@ -25,7 +25,7 @@ Atlas 800I A2推理设备:支持的卡数最小为1 Atlas 300I Duo推理卡:支持的卡数最小为1 Atlas 300 V:支持的卡数最小为1 - [Atlas 800I A2/Atlas 300I Duo/Atlas 300 V](https://www.hiascend.com/developer/download/community/result?module=pt+ie+cann) -- [环境准备指导](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC2alpha002/softwareinst/instg/instg_0001.html) +- [环境准备指导](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1alpha001/softwareinst/instg/instg_0003.html) ### 1.2 CANN安装 ```shell @@ -55,7 +55,7 @@ chmod +x ./Ascend-mindie_${version}_linux-${arch}.run cd /usr/local/Ascend/mindie && source set_env.sh # 方式二:指定路径安装 -./Ascend-mindie_${version}_linux-${arch}.run --install-path=${AieInstallPath} +./Ascend-mindie_${version}_linux-${arch}.run --install --install-path=${AieInstallPath} # 设置环境变量 cd ${AieInstallPath}/mindie && source set_env.sh ``` diff --git a/MindIE/MultiModal/OpenSora-1.2/README.md b/MindIE/MultiModal/OpenSora-1.2/README.md index a439a60331bb934f297f893274c505e04ec47fe2..7b28a836379c1d4837ec5f02aaf4d65ca4476ee8 100644 --- a/MindIE/MultiModal/OpenSora-1.2/README.md +++ b/MindIE/MultiModal/OpenSora-1.2/README.md @@ -11,7 +11,7 @@ - 设备支持: Atlas 800I A2推理设备:支持的卡数最小为1 - [Atlas 800I A2](https://www.hiascend.com/developer/download/community/result?module=pt+ie+cann&product=4&model=32) -- [环境准备指导](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC2alpha002/softwareinst/instg/instg_0001.html) +- [环境准备指导](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1alpha001/softwareinst/instg/instg_0003.html) ### 1.2 CANN安装 ```shell @@ -50,7 +50,7 @@ chmod +x ./Ascend-mindie_${version}_linux-${arch}.run cd /usr/local/Ascend/mindie && source set_env.sh # 方式二:指定路径安装 -./Ascend-mindie_${version}_linux-${arch}.run --install-path=${AieInstallPath} +./Ascend-mindie_${version}_linux-${arch}.run --install --install-path=${AieInstallPath} # 设置环境变量 cd ${AieInstallPath}/mindie && source set_env.sh ```