From 21c40d47d2f131b5da1fdeda555f82be659bbd5c Mon Sep 17 00:00:00 2001 From: zhaoranzhi <1641621634@qq.com> Date: Wed, 25 May 2022 13:17:09 +0800 Subject: [PATCH 1/4] =?UTF-8?q?RefineNet=E6=A8=A1=E5=9E=8B=E7=A6=BB?= =?UTF-8?q?=E7=BA=BF=E6=8E=A8=E7=90=86?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../cv/segmentation/RefineNet/README.md | 75 +++++++++++-------- .../contrib/cv/segmentation/RefineNet/env.sh | 6 -- .../RefineNet/test/eval_acc_perf.sh | 2 +- .../cv/segmentation/RefineNet/test/parse.py | 2 +- .../cv/segmentation/RefineNet/test/pth2om.sh | 6 +- 5 files changed, 49 insertions(+), 42 deletions(-) delete mode 100644 ACL_PyTorch/contrib/cv/segmentation/RefineNet/env.sh diff --git a/ACL_PyTorch/contrib/cv/segmentation/RefineNet/README.md b/ACL_PyTorch/contrib/cv/segmentation/RefineNet/README.md index bc041e0d57..2d1ee6c365 100644 --- a/ACL_PyTorch/contrib/cv/segmentation/RefineNet/README.md +++ b/ACL_PyTorch/contrib/cv/segmentation/RefineNet/README.md @@ -56,7 +56,7 @@ commit_id: 8f25c076016e61a835551493aae303e81cf36c53 ### 2.1 深度学习框架 ``` -CANN 5.0.1 +CANN 5.1.RC1 pytorch >= 1.5.0 torchvision >= 0.6.0 onnx >= 1.7.0 @@ -65,10 +65,10 @@ onnx >= 1.7.0 ### 2.2 python第三方库 ``` -numpy == 1.21.2 -Pillow == 8.3.1 -opencv-python == 3.4.4.19 -albumentations == 0.4.5 +numpy == 1.21.6 +Pillow == 9.1.0 +opencv-python == 4.1.0.25 +albumentations == 1.1.0 densetorch == 0.0.2 ``` @@ -106,7 +106,16 @@ cd .. **说明:** >注意目前ATC支持的onnx算子版本为11 -4.执行pth2onnx脚本,生成onnx模型文件 + +4.通过将densetorch包下载到本地安装 + +```bash +git clone https://github.com/drsleep/densetorch.git +cd densetorch +pip install -e . +``` + +5.执行pth2onnx脚本,生成onnx模型文件 ```bash python3.7 RefineNet_pth2onnx.py --input-file model/RefineNet_910.pth.tar --output-file model/RefineNet_910.onnx @@ -117,12 +126,12 @@ python3.7 RefineNet_pth2onnx.py --input-file model/RefineNet_910.pth.tar --outpu 1.设置环境变量 ``` -source env.sh +source /usr/local/Ascend/ascend-toolkit/set_env.sh ``` -2.使用atc将onnx模型转换为om模型文件,工具使用方法可以参考CANN 5.0.1 开发辅助工具指南 (推理) 01 +2.使用atc将onnx模型转换为om模型文件,工具使用方法可以参考CANN 5.1.RC1 开发辅助工具指南 (推理) 01 ```BASH -atc --framework=5 --model=model/RefineNet_910.onnx --output=model/RefineNet_910_bs1 --input_format=NCHW --input_shape="input:1,3,500,500" --log=debug --soc_version=Ascend310 +atc --framework=5 --model=model/RefineNet_910.onnx --output=model/RefineNet_910_bs1 --input_format=NCHW --input_shape="input:1,3,500,500" --log=debug --soc_version=Ascend710 ``` ## 4 数据集预处理 @@ -164,17 +173,18 @@ python3.7 get_info.py bin prepare_dataset ./refinenet_prep_bin.info 500 500 ### 5.1 benchmark工具概述 -benchmark工具为华为自研的模型推理工具,支持多种模型的离线推理,能够迅速统计出模型在Ascend310上的性能,支持真实数据和纯推理两种模式,配合后处理脚本,可以实现诸多模型的端到端过程,获取工具及使用方法可以参考CANN 5.0.1 推理benchmark工具用户指南 01 +benchmark工具为华为自研的模型推理工具,支持多种模型的离线推理,能够迅速统计出模型在Ascend310上的性能,支持真实数据和纯推理两种模式,配合后处理脚本,可以实现诸多模型的端到端过程,获取工具及使用方法可以参考CANN 5.1.RC1 推理benchmark工具用户指南 01 ### 5.2 离线推理 1.设置环境变量 ``` -source env.sh +source /usr/local/Ascend/ascend-toolkit/set_env.sh ``` 2.执行离线推理 ```bash ./benchmark.x86_64 -model_type=vision -device_id=0 -batch_size=1 -om_path=model/RefineNet_910_bs1.om \ -input_text_path=./refinenet_prep_bin.info -input_width=500 -input_height=500 -output_binary=False -useDvpp=False ``` + 输出结果默认保存在当前目录result/dumpOutput_deviceX(X为对应的device_id),每个输入对应一个_X.bin文件的输出。 ## 6 精度对比 @@ -189,13 +199,13 @@ source env.sh 调用RefineNet_postprocess.py脚本推理结果与语义分割真值进行比对,可以获得IoU精度数据。 ```bash ulimit -n 10240 -python3.7 RefineNet_postprocess.py --val-dir /opt/npu/VOCdevkit/VOC2012 --result-dir result/dumpOutput_device0 +python3.7 RefineNet_postprocess.py --val-dir /opt/npu --result-dir result/dumpOutput_device0 ``` 第一个为真值所在目录,第二个为benchmark输出目录。 查看输出结果: ``` -miou: 0.786147 +miou: 0.786359 ``` 经过对bs1与bs16的om测试,本模型batch1的精度与batch16的精度没有差别,精度数据均如上。 @@ -212,28 +222,22 @@ light-weight-refinenet开源代码仓库给出的精度是~76%,但使用的是 - **[npu性能数据](#71-npu性能数据)** ### 7.1 npu性能数据 -benchmark工具在整个数据集上推理时也会统计性能数据,但是推理整个数据集较慢,如果这么测性能那么整个推理期间需要确保独占device,使用npu-smi info可以查看device是否空闲。也可以使用benchmark纯推理功能测得性能数据,但是由于随机数不能模拟数据分布,纯推理功能测的有些模型性能数据可能不太准,benchmark纯推理功能测性能仅为快速获取大概的性能数据以便调试优化使用,模型的性能以使用benchmark工具在整个数据集上推理得到bs1与bs16的性能数据为准,对于使用benchmark工具测试的batch4,8,32的性能数据在README.md中如下作记录即可。 +benchmark工具在整个数据集上推理时也会统计性能数据,但是推理整个数据集较慢,如果这么测性能那么整个推理期间需要确保独占device,使用npu-smi info可以查看device是否空闲。也可以使用benchmark纯推理功能测得性能数据,但是由于随机数不能模拟数据分布,纯推理功能测的有些模型性能数据可能不太准,benchmark纯推理功能测性能仅为快速获取大概的性能数据以便调试优化使用,模型的性能以使用benchmark工具在整个数据集上推理得到bs1与bs16的性能数据为准,对于使用benchmark工具测试的batch4,8,32,64的性能数据在README.md中如下作记录即可。 + 1.benchmark工具在整个数据集上推理获得性能数据 batch1的性能,benchmark工具在整个数据集上推理后生成result/perf_vision_batchsize_1_device_0.txt: ``` -[e2e] throughputRate: 11.0237, latency: 131444 -[data read] throughputRate: 22.5381, moduleLatency: 44.3693 -[preprocess] throughputRate: 19.9412, moduleLatency: 50.1475 -[infer] throughputRate: 12.9423, Interface throughputRate: 14.4054, moduleLatency: 76.9069 -[post] throughputRate: 11.2069, moduleLatency: 89.231 +[inference] throughputRate: 60.5672, Interface throughputRate: 91.434, moduleLatency: 16.1848 ``` -Interface throughputRate: 14.4054,14.4054x4=57.6216既是batch1 310单卡吞吐率 +Interface throughputRate: 91.434是batch1 710单卡吞吐率 batch16的性能,benchmark工具在整个数据集上推理后生成result/perf_vision_batchsize_16_device_1.txt: ``` -[e2e] throughputRate: 10.9856, latency: 131900 -[data read] throughputRate: 22.1641, moduleLatency: 45.1179 -[preprocess] throughputRate: 19.98, moduleLatency: 50.05 -[infer] throughputRate: 12.6673, Interface throughputRate: 13.9533, moduleLatency: 78.5275 -[post] throughputRate: 0.696184, moduleLatency: 1436.4 +[inference] throughputRate: 56.6365, Interface throughputRate: 77.7378, moduleLatency: 17.4284 ``` -Interface throughputRate: 13.9533,13.9533x4=55.8132既是batch16 310单卡吞吐率 +Interface throughputRate: 77.7378是batch16 710单卡吞吐率 + 2.npu纯推理性能 @@ -244,7 +248,7 @@ batch1的性能,执行20次纯推理取均值,统计吞吐率与其倒数时 [INFO] PureInfer result saved in ./result/PureInfer_perf_of_RefineNet_910_bs1_in_device_0.txt -----------------PureInfer Performance Summary------------------ -[INFO] ave_throughputRate: 14.4137samples/s, ave_latency: 69.5644ms +[INFO] ave_throughputRate: 92.2434samples/s, ave_latency: 11.171ms ---------------------------------------------------------------- ``` @@ -255,7 +259,7 @@ batch4的性能,执行20次纯推理取均值,统计吞吐率与其倒数时 [INFO] PureInfer result saved in ./result/PureInfer_perf_of_RefineNet_910_bs4_in_device_0.txt -----------------PureInfer Performance Summary------------------ -[INFO] ave_throughputRate: 14.136samples/s, ave_latency: 70.7773ms +[INFO] ave_throughputRate: 88.0568samples/s, ave_latency: 11.4838ms ---------------------------------------------------------------- ``` @@ -266,7 +270,7 @@ batch8的性能,执行20次纯推理取均值,统计吞吐率与其倒数时 [INFO] PureInfer result saved in ./result/PureInfer_perf_of_RefineNet_910_bs8_in_device_0.txt -----------------PureInfer Performance Summary------------------ -[INFO] ave_throughputRate: 13.9813samples/s, ave_latency: 71.5408ms +[INFO] ave_throughputRate: 84.4253samples/s, ave_latency: 11.8672ms ---------------------------------------------------------------- ``` @@ -277,7 +281,7 @@ batch16的性能,执行20次纯推理取均值,统计吞吐率与其倒数 [INFO] PureInfer result saved in ./result/PureInfer_perf_of_RefineNet_910_bs16_in_device_0.txt -----------------PureInfer Performance Summary------------------ -[INFO] ave_throughputRate: 14.0079samples/s, ave_latency: 71.3959ms +[INFO] ave_throughputRate: 80.3815samples/s, ave_latency: 12.4635ms ---------------------------------------------------------------- ``` @@ -288,7 +292,16 @@ batch32的性能,执行20次纯推理取均值,统计吞吐率与其倒数 [INFO] PureInfer result saved in ./result/PureInfer_perf_of_RefineNet_910_bs32_in_device_0.txt -----------------PureInfer Performance Summary------------------ -[INFO] ave_throughputRate: 14.0264samples/s, ave_latency: 71.3015ms +[INFO] ave_throughputRate: 79.9769samples/s, ave_latency: 12.5081ms +---------------------------------------------------------------- +``` + +batch64的性能,执行20次纯推理取均值,统计吞吐率与其倒数时延(benchmark的时延是单个数据的推理时间),npu性能是一个device执行的结果 + +```bash +[INFO] PureInfer result saved in ./result/PureInfer_perf_of_RefineNet_910_bs64_in_device_0.txt +-----------------PureInfer Performance Summary------------------ +[INFO] ave_throughputRate: 79.1814samples/s, ave_latency: 12.6375ms ---------------------------------------------------------------- ``` diff --git a/ACL_PyTorch/contrib/cv/segmentation/RefineNet/env.sh b/ACL_PyTorch/contrib/cv/segmentation/RefineNet/env.sh deleted file mode 100644 index 52554cfca2..0000000000 --- a/ACL_PyTorch/contrib/cv/segmentation/RefineNet/env.sh +++ /dev/null @@ -1,6 +0,0 @@ -export install_path=/usr/local/Ascend/ascend-toolkit/latest -export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH -export PYTHONPATH=${install_path}/atc/python/site-packages:$PYTHONPATH -export LD_LIBRARY_PATH=${install_path}/atc/lib64:${install_path}/acllib/lib64:$LD_LIBRARY_PATH -export ASCEND_OPP_PATH=${install_path}/opp -export ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest diff --git a/ACL_PyTorch/contrib/cv/segmentation/RefineNet/test/eval_acc_perf.sh b/ACL_PyTorch/contrib/cv/segmentation/RefineNet/test/eval_acc_perf.sh index 01b9ee5d0a..1624a486d0 100644 --- a/ACL_PyTorch/contrib/cv/segmentation/RefineNet/test/eval_acc_perf.sh +++ b/ACL_PyTorch/contrib/cv/segmentation/RefineNet/test/eval_acc_perf.sh @@ -22,7 +22,7 @@ if [ $? != 0 ]; then echo "fail!" exit -1 fi -source env.sh +source /usr/local/Ascend/ascend-toolkit/set_env.sh rm -rf result/dumpOutput_device0 ./benchmark.${arch} -model_type=vision -device_id=0 -batch_size=1 -om_path=model/RefineNet_910_bs1.om -input_text_path=./refinenet_prep_bin.info -input_width=500 -input_height=500 -output_binary=False -useDvpp=False if [ $? != 0 ]; then diff --git a/ACL_PyTorch/contrib/cv/segmentation/RefineNet/test/parse.py b/ACL_PyTorch/contrib/cv/segmentation/RefineNet/test/parse.py index 27eae0d0ac..e1e95a32a1 100644 --- a/ACL_PyTorch/contrib/cv/segmentation/RefineNet/test/parse.py +++ b/ACL_PyTorch/contrib/cv/segmentation/RefineNet/test/parse.py @@ -29,4 +29,4 @@ if __name__ == '__main__': content = f.read() txt_data_list = [i.strip() for i in re.findall(r':(.*?),', content.replace('\n', ',') + ',')] fps = float(txt_data_list[7].replace('samples/s', '')) * 4 - print('310 bs{} fps:{}'.format(result_txt.split('_')[3], fps)) \ No newline at end of file + print('710 bs{} fps:{}'.format(result_txt.split('_')[3], fps)) \ No newline at end of file diff --git a/ACL_PyTorch/contrib/cv/segmentation/RefineNet/test/pth2om.sh b/ACL_PyTorch/contrib/cv/segmentation/RefineNet/test/pth2om.sh index 4fbb604602..1a40fd8326 100644 --- a/ACL_PyTorch/contrib/cv/segmentation/RefineNet/test/pth2om.sh +++ b/ACL_PyTorch/contrib/cv/segmentation/RefineNet/test/pth2om.sh @@ -2,10 +2,10 @@ rm -rf model/RefineNet_910.onnx python RefineNet_pth2onnx.py --input-file model/RefineNet_910.pth.tar --output-file model/RefineNet_910.onnx -source env.sh +source /usr/local/Ascend/ascend-toolkit/set_env.sh rm -rf model/RefineNet_910_bs1.om model/RefineNet_910_bs16.om -atc --framework=5 --model=model/RefineNet_910.onnx --output=model/RefineNet_910_bs1 --input_format=NCHW --input_shape="input:1,3,500,500" --log=debug --soc_version=Ascend310 -atc --framework=5 --model=model/RefineNet_910.onnx --output=model/RefineNet_910_bs16 --input_format=NCHW --input_shape="input:16,3,500,500" --log=debug --soc_version=Ascend310 +atc --framework=5 --model=model/RefineNet_910.onnx --output=model/RefineNet_910_bs1 --input_format=NCHW --input_shape="input:1,3,500,500" --log=debug --soc_version=Ascend710 +atc --framework=5 --model=model/RefineNet_910.onnx --output=model/RefineNet_910_bs16 --input_format=NCHW --input_shape="input:16,3,500,500" --log=debug --soc_version=Ascend710 if [ -f "model/RefineNet_910_bs1.om" ] && [ -f "model/RefineNet_910_bs16.om" ]; then echo "success" else -- Gitee From 7c5cd91546fb73fc4a0691e9d583baa86fc5eb8c Mon Sep 17 00:00:00 2001 From: zhaoranzhi <1641621634@qq.com> Date: Wed, 25 May 2022 15:36:52 +0800 Subject: [PATCH 2/4] =?UTF-8?q?GAN=E6=A8=A1=E5=9E=8Bpt1.8=E8=BF=81?= =?UTF-8?q?=E7=A7=BB=E5=87=BA=E9=A2=98=E6=8F=90=E4=BA=A4?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- PyTorch/contrib/cv/others/GAN_Pytorch/main.py | 67 ++++----- .../cv/others/GAN_Pytorch/test/env_npu.sh | 25 ++-- .../others/GAN_Pytorch/test/train_eval_8p.sh | 84 +++++++++++- .../GAN_Pytorch/test/train_finetune_1p.sh | 86 +++++++++++- .../others/GAN_Pytorch/test/train_full_1p.sh | 128 +++++++++++++++++- .../others/GAN_Pytorch/test/train_full_8p.sh | 127 ++++++++++++++++- .../GAN_Pytorch/test/train_performance_1p.sh | 126 ++++++++++++++++- .../GAN_Pytorch/test/train_performance_8p.sh | 125 ++++++++++++++++- 8 files changed, 686 insertions(+), 82 deletions(-) diff --git a/PyTorch/contrib/cv/others/GAN_Pytorch/main.py b/PyTorch/contrib/cv/others/GAN_Pytorch/main.py index 87b96cdccb..e39c6099d3 100644 --- a/PyTorch/contrib/cv/others/GAN_Pytorch/main.py +++ b/PyTorch/contrib/cv/others/GAN_Pytorch/main.py @@ -12,7 +12,7 @@ # See the License for the specific language governing permissions and # limitations under the License. # ============================================================================ - +import torch_npu import argparse import os import sys @@ -62,7 +62,7 @@ def train_one_epoch(generator, discriminator, optimizer_G, optimizer_D, adversar fake = Variable(Tensor(imgs.size(0), 1).fill_(0.0), requires_grad=False) # Configure input - real_imgs = Variable(imgs.type(Tensor)).to(device) + real_imgs = Variable(imgs.type(torch.Tensor)).to(device) # ----------------- # Train Generator @@ -106,7 +106,7 @@ def train_one_epoch(generator, discriminator, optimizer_G, optimizer_D, adversar d_loss.backward() optimizer_D.step() batch_time.update(time.time() - start_time) - if args.n_epochs == 1: + if args.n_epochs == 1 and args.is_master_node: print( "[Epoch %d] [step %d] [D loss: %f] [G loss: %f]" % (epoch, i, D_loss.avg, G_loss.avg) @@ -117,7 +117,7 @@ def train_one_epoch(generator, discriminator, optimizer_G, optimizer_D, adversar if args.is_master_node: print( "[Epoch %d] [D loss: %f] [G loss: %f] FPS:%.3f" - % (epoch, D_loss.avg,G_loss.avg,args.batch_size*args.gpus/batch_time.avg) + % (epoch, D_loss.avg, G_loss.avg, args.batch_size * args.gpus / batch_time.avg) ) LOSS_G.append(G_loss.avg) LOSS_D.append(D_loss.avg) @@ -135,43 +135,30 @@ def main(args): if amp is None: raise RuntimeError("Failed to import apex. Please install apex from https://www.github.com/nvidia/apex " "to enable mixed-precision training.") - # if args.output_dir: - # os.mkdir(args.output_dir) - + + device = torch.device(f'npu:{args.local_rank}') # npu + torch.npu.set_device(f'npu:{args.local_rank}') + print('device_id=', args.local_rank) if args.distributed: + torch.distributed.init_process_group(backend='hccl', world_size=args.gpus, rank=args.local_rank) - mp.spawn(main_worker, nprocs=args.gpus, - args=(args,)) - else: - main_worker(args.gpus, args) + args.is_master_node = not args.distributed or args.local_rank == 0 -def main_worker(nprocs, args): - local_rank = 0 - if args.distributed: - torch.distributed.init_process_group(backend="hccl", - init_method='env://', - world_size=args.nodes * args.gpus, - rank=nprocs) - local_rank = torch.distributed.get_rank() - args.is_master_node = not args.distributed or local_rank == 0 if args.is_master_node: print(args) - args.device_id = args.device_id + local_rank - print('device_id=', args.device_id) - device = torch.device(f'npu:{args.device_id}') # npu - torch.npu.set_device(device) # for npu - print("Downloading dataset...") + print("Preparing dataset...") + # Configure data loader - os.makedirs("../data/mnist", exist_ok=True) train_dataset = datasets.MNIST( - "../../data/mnist", + args.data_path, train=True, download=True, transform=transforms.Compose( [transforms.Resize(args.img_size), transforms.ToTensor(), transforms.Normalize([0.5], [0.5])] )) - - print("Creating dataloader") + + if args.is_master_node: + print("Creating dataloader") if args.distributed: train_sampler = torch.utils.data.distributed.DistributedSampler( @@ -185,12 +172,11 @@ def main_worker(nprocs, args): if args.is_master_node: print("Creating model") - # create model + Tensor = torch.npu.FloatTensor LOSS_G=[] LOSS_D=[] - os.makedirs("../output", exist_ok=True) - os.chdir("../output") + generator = Generator() discriminator = Discriminator() if args.pretrained: @@ -233,10 +219,11 @@ def main_worker(nprocs, args): opt_level='O1', loss_scale=128,combine_grad=True) if args.distributed: - generator = DDP(generator, device_ids=[local_rank], broadcast_buffers=False) - discriminator = DDP(discriminator, device_ids=[local_rank], broadcast_buffers=False) + generator = DDP(generator, device_ids=[args.local_rank], broadcast_buffers=False) + discriminator = DDP(discriminator, device_ids=[args.local_rank], broadcast_buffers=False) if args.test_only : + os.makedirs("test_images",exist_ok=True) Tensor = torch.npu.FloatTensor generator = Generator().npu() checkpoint = torch.load(r'./checkpoint.pth.tar', map_location='cpu') @@ -268,7 +255,7 @@ def main_worker(nprocs, args): # Generate a batch of images gen_imgs = generator(z) - save_image(gen_imgs.data[:25], "image/%d.png" % i, nrow=5, normalize=True) + save_image(gen_imgs.data[:25], "test_images/image/%d.png" % i, nrow=5, normalize=True) print("Generate done!") return @@ -357,9 +344,9 @@ def parse_args(): parser.add_argument("--latent_dim", type=int, default=100, help="dimensionality of the latent space") parser.add_argument("--img_size", type=int, default=28, help="size of each image dimension") parser.add_argument("--channels", type=int, default=1, help="number of image channels") - parser.add_argument("--gpus", type=int, default=8, help="num of gpus of per node") + parser.add_argument("--gpus", type=int, default=1, help="num of gpus of per node") parser.add_argument("--nodes", type=int, default=1) - parser.add_argument('--device_id', default=0, type=int, help='device id') + parser.add_argument('--local_rank', default=0, type=int, help='device id') parser.add_argument("--test_only", type=int, default=None, help="only generate images") parser.add_argument('--start_epoch', default=0, type=int, metavar='N', help='manual epoch number (useful on restarts)') @@ -368,6 +355,9 @@ def parse_args(): parser.add_argument('--pretrained', dest='pretrained', action='store_true', help='use pre-trained model') + # 数据集path + parser.add_argument('--data_path', default='../data/mnist', + help='the path of the dataset') parser.add_argument('--distributed', action='store_true', help='Use multi-processing distributed training to launch ' 'N processes per node, which has N GPUs. This is the ' @@ -381,6 +371,9 @@ def parse_args(): parser.add_argument('--apex', default=False, action='store_true', help='use apex to train the model') args = parser.parse_args() + + args.gpus = int(os.environ['WORLD_SIZE']) if 'WORLD_SIZE' in os.environ else 1 + return args if __name__ == '__main__': args = parse_args() diff --git a/PyTorch/contrib/cv/others/GAN_Pytorch/test/env_npu.sh b/PyTorch/contrib/cv/others/GAN_Pytorch/test/env_npu.sh index 4740fafdcc..1e746bffeb 100644 --- a/PyTorch/contrib/cv/others/GAN_Pytorch/test/env_npu.sh +++ b/PyTorch/contrib/cv/others/GAN_Pytorch/test/env_npu.sh @@ -34,23 +34,24 @@ ${install_path}/driver/tools/msnpureport -g error -d 5 ${install_path}/driver/tools/msnpureport -g error -d 6 ${install_path}/driver/tools/msnpureport -g error -d 7 -#将Host日志输出到串口,0-关闭/1-开启 +#Host־,0-ر/1- export ASCEND_SLOG_PRINT_TO_STDOUT=0 -#设置默认日志级别,0-debug/1-info/2-warning/3-error -export ASCEND_GLOBAL_LOG_LEVEL=3 -#设置Event日志开启标志,0-关闭/1-开启 +#Ĭ־,0-debug/1-info/2-warning/3-error +export ASCEND_GLOBAL_LOG_LEVEL==3 +#Event־־,0-ر/1- export ASCEND_GLOBAL_EVENT_ENABLE=0 -#设置是否开启taskque,0-关闭/1-开启 -export TASK_QUEUE_ENABLE=1 -#设置是否开启PTCopy,0-关闭/1-开启 +#Ƿtaskque,0-ر/1- +export TASK_QUEUE_ENABLE=0 +#ǷPTCopy,0-ر/1- export PTCOPY_ENABLE=1 -#设置是否开启combined标志,0-关闭/1-开启 -export COMBINED_ENABLE=0 -#设置特殊场景是否需要重新编译,不需要修改 +#Ƿ2combined־,0-ر/1- +export COMBINED_ENABLE=1 +#ⳡǷҪ±,Ҫ޸ export DYNAMIC_OP="ADD#MUL" -#HCCL白名单开关,1-关闭/0-开启 +# HCCL,1-ر/0- export HCCL_WHITELIST_DISABLE=1 -export HCCL_IF_IP=$(hostname -I |awk '{print $1}') +# HCCLĬϳʱʱ120s٣޸Ϊ1800sPyTorchĬ +export HCCL_CONNECT_TIMEOUT=1800 ulimit -SHn 512000 diff --git a/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_eval_8p.sh b/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_eval_8p.sh index 744956a7d1..3c9e578f43 100644 --- a/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_eval_8p.sh +++ b/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_eval_8p.sh @@ -1,14 +1,88 @@ #!/bin/bash -source env_npu.sh currentDir=$(cd "$(dirname "$0")";pwd)/.. -nohup python3 ${currentDir}/main.py \ - --gpus 8\ +#当前路径,不需要修改 +cur_path=`pwd` + +#集合通信参数,不需要修改 +#export RANK_SIZE=8 + +# 数据集路径,保持为空,不需要修改 +data_path="" + +#网络名称,同目录名称,需要模型审视修改 +Network="GAN" + +#训练batch_size,,需要模型审视修改 +batch_size=64 + + + +#参数校验,不需要修改 +for para in $* +do + if [[ $para == --device_id* ]];then + device_id=`echo ${para#*=}` + elif [[ $para == --data_path* ]];then + data_path=`echo ${para#*=}` + fi +done + +#校验是否传入data_path,不需要修改 +if [[ $data_path == "" ]];then + echo "[Error] para \"data_path\" must be confing" + exit 1 +fi + +# 校验是否指定了device_id,分动态分配device_id与手动指定device_id,此处不需要修改 +if [ $ASCEND_DEVICE_ID ];then + echo "device id is ${ASCEND_DEVICE_ID}" +elif [ ${device_id} ];then + export ASCEND_DEVICE_ID=${device_id} + echo "device id is ${ASCEND_DEVICE_ID}" +else + export ASCEND_DEVICE_ID=0,1,2,3,4,5,6,7 + echo "device id is ${ASCEND_DEVICE_ID}" +fi + +#训练开始时间,不需要修改 +start_time=$(date +%s) +echo "start_time: ${start_time}" + +#进入训练脚本目录,需要模型审视修改 +cd $cur_path/ + +#创建DeviceID输出目录,不需要修改 +if [ -d ${cur_path}/output/${Network}/${ASCEND_DEVICE_ID} ];then + rm -rf ${cur_path}/output/${Network}/${ASCEND_DEVICE_ID} + mkdir -p ${cur_path}/output/${Network}/$ASCEND_DEVICE_ID/ckpt +else + mkdir -p ${cur_path}/output/${Network}/$ASCEND_DEVICE_ID/ckpt +fi + +#非平台场景时source 环境变量 +check_etp_flag=`env | grep etp_running_flag` +etp_flag=`echo ${check_etp_flag#*=}` +if [ x"${etp_flag}" != x"true" ];then + source ${cur_path}/test/env_npu.sh +fi + +#执行训练脚本,以下传参不需要修改,其他需要模型审视修改 +python3.7 -u -m torch.distributed.launch --nproc_per_node=8 ${currentDir}/main.py \ --distributed \ --lr 0.0008 \ --batch_size 128 \ --n_epochs 200 \ --workers 0 \ --apex \ - --device_id 0 \ - --test_only 1 & + --test_only 1 \ + --data_path ${data_path} > ${cur_path}/output/train_eval_8p.log 2>&1 & + +wait + +#训练结束时间,不需要修改 +end_time=$(date +%s) +echo "end_time: ${end_time}" +e2e_time=$(( $end_time - $start_time )) + + diff --git a/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_finetune_1p.sh b/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_finetune_1p.sh index 58b97a99a6..c2e690aa7d 100644 --- a/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_finetune_1p.sh +++ b/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_finetune_1p.sh @@ -1,14 +1,88 @@ #!/bin/bash - -source env_npu.sh currentDir=$(cd "$(dirname "$0")";pwd)/.. -nohup python3 ${currentDir}/main.py \ - --gpus 1\ +#当前路径,不需要修改 +cur_path=`pwd` + +#集合通信参数,不需要修改 +#export RANK_SIZE=8 + +# 数据集路径,保持为空,不需要修改 +data_path="" + +#网络名称,同目录名称,需要模型审视修改 +Network="GAN" + +#训练batch_size,,需要模型审视修改 +batch_size=64 + + + +#参数校验,不需要修改 +for para in $* +do + if [[ $para == --device_id* ]];then + device_id=`echo ${para#*=}` + elif [[ $para == --data_path* ]];then + data_path=`echo ${para#*=}` + fi +done + +#校验是否传入data_path,不需要修改 +if [[ $data_path == "" ]];then + echo "[Error] para \"data_path\" must be confing" + exit 1 +fi + +# 校验是否指定了device_id,分动态分配device_id与手动指定device_id,此处不需要修改 +if [ $ASCEND_DEVICE_ID ];then + echo "device id is ${ASCEND_DEVICE_ID}" +elif [ ${device_id} ];then + export ASCEND_DEVICE_ID=${device_id} + echo "device id is ${ASCEND_DEVICE_ID}" +else + export ASCEND_DEVICE_ID=0,1,2,3,4,5,6,7 + echo "device id is ${ASCEND_DEVICE_ID}" +fi + +#训练开始时间,不需要修改 +start_time=$(date +%s) +echo "start_time: ${start_time}" + +#进入训练脚本目录,需要模型审视修改 +cd $cur_path/ + +#创建DeviceID输出目录,不需要修改 +if [ -d ${cur_path}/output/${Network}/${ASCEND_DEVICE_ID} ];then + rm -rf ${cur_path}/output/${Network}/${ASCEND_DEVICE_ID} + mkdir -p ${cur_path}/output/${Network}/$ASCEND_DEVICE_ID/ckpt +else + mkdir -p ${cur_path}/output/${Network}/$ASCEND_DEVICE_ID/ckpt +fi + +#非平台场景时source 环境变量 +check_etp_flag=`env | grep etp_running_flag` +etp_flag=`echo ${check_etp_flag#*=}` +if [ x"${etp_flag}" != x"true" ];then + source ${cur_path}/test/env_npu.sh +fi + +#执行训练脚本,以下传参不需要修改,其他需要模型审视修改 +python3 -u ${currentDir}/main.py \ --lr 0.0002 \ --batch_size 64 \ --n_epochs 100 \ --workers 0 \ --apex \ - --device_id 0 \ - --pretrained & + --local_rank 0 \ + --pretrained \ + --data_path ${data_path} > ${cur_path}/output/train_tune_1p.log 2>&1 & + +wait + +#训练结束时间,不需要修改 +end_time=$(date +%s) +echo "end_time: ${end_time}" +e2e_time=$(( $end_time - $start_time )) + + diff --git a/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_full_1p.sh b/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_full_1p.sh index eb288785d0..1fd471a4e0 100644 --- a/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_full_1p.sh +++ b/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_full_1p.sh @@ -1,12 +1,128 @@ #!/bin/bash -source env_npu.sh currentDir=$(cd "$(dirname "$0")";pwd)/.. -nohup python3 ${currentDir}/main.py \ - --gpus 1\ +#当前路径,不需要修改 +cur_path=`pwd` + +#集合通信参数,不需要修改 +export RANK_SIZE=1 + +# 数据集路径,保持为空,不需要修改 +data_path="" + +#网络名称,同目录名称,需要模型审视修改 +Network="GAN" + +#训练batch_size,,需要模型审视修改 +batch_size=128 + + + +#参数校验,不需要修改 +for para in $* +do + if [[ $para == --device_id* ]];then + device_id=`echo ${para#*=}` + elif [[ $para == --data_path* ]];then + data_path=`echo ${para#*=}` + fi +done + +#校验是否传入data_path,不需要修改 +if [[ $data_path == "" ]];then + echo "[Error] para \"data_path\" must be confing" + exit 1 +fi + +# 校验是否指定了device_id,分动态分配device_id与手动指定device_id,此处不需要修改 +if [ $ASCEND_DEVICE_ID ];then + echo "device id is ${ASCEND_DEVICE_ID}" +elif [ ${device_id} ];then + export ASCEND_DEVICE_ID=${device_id} + echo "device id is ${ASCEND_DEVICE_ID}" +else + export ASCEND_DEVICE_ID=0,1,2,3,4,5,6,7 + echo "device id is ${ASCEND_DEVICE_ID}" +fi + +#训练开始时间,不需要修改 +start_time=$(date +%s) +echo "start_time: ${start_time}" + +#进入训练脚本目录,需要模型审视修改 +cd $cur_path/ + +#创建DeviceID输出目录,不需要修改 +if [ -d ${cur_path}/output/${Network}/${ASCEND_DEVICE_ID} ];then + rm -rf ${cur_path}/output/${Network}/${ASCEND_DEVICE_ID} + mkdir -p ${cur_path}/output/${Network}/$ASCEND_DEVICE_ID/ckpt +else + mkdir -p ${cur_path}/output/${Network}/$ASCEND_DEVICE_ID/ckpt +fi + +#非平台场景时source 环境变量 +check_etp_flag=`env | grep etp_running_flag` +etp_flag=`echo ${check_etp_flag#*=}` +if [ x"${etp_flag}" != x"true" ];then + source ${cur_path}/test/env_npu.sh +fi + +#执行训练脚本,以下传参不需要修改,其他需要模型审视修改 +python3 -u ${currentDir}/main.py \ --lr 0.0002 \ - --batch_size 64 \ + --batch_size ${batch_size} \ --n_epochs 200 \ - --workers 0 \ + --workers 16 \ --apex \ - --device_id 0 & + --local_rank 6 \ + --data_path ${data_path} > ${cur_path}/output/train_full_1p.log 2>&1 & + +wait + +#训练结束时间,不需要修改 +end_time=$(date +%s) +echo "end_time: ${end_time}" +e2e_time=$(( $end_time - $start_time )) + +#最后一个迭代FPS值 +FPS=`grep -a 'FPS:' ${cur_path}/output/train_full_1p.log|awk -F "FPS:" '{print $NF}'|awk 'END {print}'` + +#最后一个迭代loss值 +loss=`grep -a 'D loss:' ${cur_path}/output/train_full_1p.log | awk -F "D loss:" '{print $NF}'| awk 'END {print}' | awk -F "]" '{print $1}'` + +#打印,不需要修改 +echo "ActualFPS : $FPS" +echo "ActualLoss : ${loss}" +echo "E2E Training Duration sec : $e2e_time" + +#稳定性精度看护结果汇总 +#训练用例信息,不需要修改 +BatchSize=${batch_size} +DeviceType=`uname -m` +CaseName=${Network}_bs${BatchSize}_${RANK_SIZE}'p'_'acc' + +##获取性能数据,不需要修改 +#单迭代训练时长 +TrainingTime=`awk 'BEGIN{printf "%.2f\n", '${batch_size}'*1000/'${FPS}'}'` + +#从train_$ASCEND_DEVICE_ID.log提取Loss到train_${CaseName}_loss.txt中,需要模型审视修改 +grep -a 'D loss:' ${cur_path}/output/train_full_1p.log | awk -F "D loss:" '{print $NF}' | awk -F "]" '{print $1}' >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/train_${CaseName}_loss.txt + +grep -a 'FPS:' ${cur_path}/output/train_full_1p.log | awk -F "FPS:" '{print $NF}' | awk -F "]" '{print $1}' >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/train_${CaseName}_FPS.txt + +#关键信息打印到${CaseName}.log中,不需要修改 +echo "Network = ${Network}" > $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "RankSize = ${RANK_SIZE}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "BatchSize = ${BatchSize}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "DeviceType = ${DeviceType}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "CaseName = ${CaseName}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "ActualFPS = ${FPS}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "TrainingTime = ${TrainingTime}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "ActualLoss = ${loss}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "E2ETrainingTime = ${e2e_time}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log + + + + + + diff --git a/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_full_8p.sh b/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_full_8p.sh index 8c42ea3fa3..ad9102476e 100644 --- a/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_full_8p.sh +++ b/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_full_8p.sh @@ -1,13 +1,128 @@ #!/bin/bash -source env_npu.sh currentDir=$(cd "$(dirname "$0")";pwd)/.. -nohup python3 ${currentDir}/main.py \ - --gpus 8\ +#当前路径,不需要修改 +cur_path=`pwd` + +#集合通信参数,不需要修改 +export RANK_SIZE=8 + +# 数据集路径,保持为空,不需要修改 +data_path="" + +#网络名称,同目录名称,需要模型审视修改 +Network="GAN" + +#训练batch_size,,需要模型审视修改 +batch_size=128 + + + +#参数校验,不需要修改 +for para in $* +do + if [[ $para == --device_id* ]];then + device_id=`echo ${para#*=}` + elif [[ $para == --data_path* ]];then + data_path=`echo ${para#*=}` + fi +done + +#校验是否传入data_path,不需要修改 +if [[ $data_path == "" ]];then + echo "[Error] para \"data_path\" must be confing" + exit 1 +fi + +# 校验是否指定了device_id,分动态分配device_id与手动指定device_id,此处不需要修改 +if [ $ASCEND_DEVICE_ID ];then + echo "device id is ${ASCEND_DEVICE_ID}" +elif [ ${device_id} ];then + export ASCEND_DEVICE_ID=${device_id} + echo "device id is ${ASCEND_DEVICE_ID}" +else + export ASCEND_DEVICE_ID=0,1,2,3,4,5,6,7 + echo "device id is ${ASCEND_DEVICE_ID}" +fi + +#训练开始时间,不需要修改 +start_time=$(date +%s) +echo "start_time: ${start_time}" + +#进入训练脚本目录,需要模型审视修改 +cd $cur_path/ + +#创建DeviceID输出目录,不需要修改 +if [ -d ${cur_path}/output/${Network}/${ASCEND_DEVICE_ID} ];then + rm -rf ${cur_path}/output/${Network}/${ASCEND_DEVICE_ID} + mkdir -p ${cur_path}/output/${Network}/$ASCEND_DEVICE_ID/ckpt +else + mkdir -p ${cur_path}/output/${Network}/$ASCEND_DEVICE_ID/ckpt +fi + +#非平台场景时source 环境变量 +check_etp_flag=`env | grep etp_running_flag` +etp_flag=`echo ${check_etp_flag#*=}` +if [ x"${etp_flag}" != x"true" ];then + source ${cur_path}/test/env_npu.sh +fi + +#执行训练脚本,以下传参不需要修改,其他需要模型审视修改 +python3.7 -u -m torch.distributed.launch --nproc_per_node=8 ${currentDir}/main.py \ --distributed \ --lr 0.0008 \ - --batch_size 128 \ + --batch_size ${batch_size} \ --n_epochs 200 \ - --workers 0 \ + --workers 16 \ --apex \ - --device_id 0 & + --data_path ${data_path} > ${cur_path}/output/train_full_8p.log 2>&1 & + +wait + +#训练结束时间,不需要修改 +end_time=$(date +%s) +echo "end_time: ${end_time}" +e2e_time=$(( $end_time - $start_time )) + +#最后一个迭代FPS值 +FPS=`grep -a 'FPS:' ${cur_path}/output/train_full_8p.log|awk -F "FPS:" '{print $NF}'|awk 'END {print}'` + +#最后一个迭代loss值 +loss=`grep -a 'D loss:' ${cur_path}/output/train_full_8p.log | awk -F "D loss:" '{print $NF}'| awk 'END {print}' | awk -F "]" '{print $1}'` + +#打印,不需要修改 +echo "ActualFPS : $FPS" +echo "ActualLoss : ${loss}" +echo "E2E Training Duration sec : $e2e_time" + +#稳定性精度看护结果汇总 +#训练用例信息,不需要修改 +BatchSize=${batch_size} +DeviceType=`uname -m` +CaseName=${Network}_bs${BatchSize}_${RANK_SIZE}'p'_'acc' + +##获取性能数据,不需要修改 +#单迭代训练时长 +TrainingTime=`awk 'BEGIN{printf "%.2f\n", '${batch_size}'*1000/'${FPS}'}'` + +#从train_$ASCEND_DEVICE_ID.log提取Loss到train_${CaseName}_loss.txt中,需要模型审视修改 +grep -a 'D loss:' ${cur_path}/output/train_full_8p.log | awk -F "D loss:" '{print $NF}' | awk -F "]" '{print $1}' >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/train_${CaseName}_loss.txt + +grep -a 'FPS:' ${cur_path}/output/train_full_8p.log | awk -F "FPS:" '{print $NF}' | awk -F "]" '{print $1}' >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/train_${CaseName}_FPS.txt + +#关键信息打印到${CaseName}.log中,不需要修改 +echo "Network = ${Network}" > $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "RankSize = ${RANK_SIZE}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "BatchSize = ${BatchSize}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "DeviceType = ${DeviceType}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "CaseName = ${CaseName}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "ActualFPS = ${FPS}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "TrainingTime = ${TrainingTime}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "ActualLoss = ${loss}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "E2ETrainingTime = ${e2e_time}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log + + + + + + diff --git a/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_performance_1p.sh b/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_performance_1p.sh index 7794753d3d..da5a3e61a8 100644 --- a/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_performance_1p.sh +++ b/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_performance_1p.sh @@ -1,12 +1,128 @@ #!/bin/bash -source env_npu.sh currentDir=$(cd "$(dirname "$0")";pwd)/.. -nohup python3 ${currentDir}/main.py \ - --gpus 1\ +#当前路径,不需要修改 +cur_path=`pwd` + +#集合通信参数,不需要修改 +export RANK_SIZE=1 + +# 数据集路径,保持为空,不需要修改 +data_path="" + +#网络名称,同目录名称,需要模型审视修改 +Network="GAN" + +#训练batch_size,,需要模型审视修改 +batch_size=64 + + + +#参数校验,不需要修改 +for para in $* +do + if [[ $para == --device_id* ]];then + device_id=`echo ${para#*=}` + elif [[ $para == --data_path* ]];then + data_path=`echo ${para#*=}` + fi +done + +#校验是否传入data_path,不需要修改 +if [[ $data_path == "" ]];then + echo "[Error] para \"data_path\" must be confing" + exit 1 +fi + +# 校验是否指定了device_id,分动态分配device_id与手动指定device_id,此处不需要修改 +if [ $ASCEND_DEVICE_ID ];then + echo "device id is ${ASCEND_DEVICE_ID}" +elif [ ${device_id} ];then + export ASCEND_DEVICE_ID=${device_id} + echo "device id is ${ASCEND_DEVICE_ID}" +else + export ASCEND_DEVICE_ID=0,1,2,3,4,5,6,7 + echo "device id is ${ASCEND_DEVICE_ID}" +fi + +#训练开始时间,不需要修改 +start_time=$(date +%s) +echo "start_time: ${start_time}" + +#进入训练脚本目录,需要模型审视修改 +cd $cur_path/ + +#创建DeviceID输出目录,不需要修改 +if [ -d ${cur_path}/output/${Network}/${ASCEND_DEVICE_ID} ];then + rm -rf ${cur_path}/output/${Network}/${ASCEND_DEVICE_ID} + mkdir -p ${cur_path}/output/${Network}/$ASCEND_DEVICE_ID/ckpt +else + mkdir -p ${cur_path}/output/${Network}/$ASCEND_DEVICE_ID/ckpt +fi + +#非平台场景时source 环境变量 +check_etp_flag=`env | grep etp_running_flag` +etp_flag=`echo ${check_etp_flag#*=}` +if [ x"${etp_flag}" != x"true" ];then + source ${cur_path}/test/env_npu.sh +fi + +#执行训练脚本,以下传参不需要修改,其他需要模型审视修改 +python3 -u ${currentDir}/main.py \ --lr 0.0002 \ - --batch_size 64 \ + --batch_size ${batch_size} \ --n_epochs 1 \ --workers 16 \ --apex \ - --device_id 0 & + --local_rank 0 \ + --data_path ${data_path} > ${cur_path}/output/train_perf_1p.log 2>&1 & + +wait + +#训练结束时间,不需要修改 +end_time=$(date +%s) +echo "end_time: ${end_time}" +e2e_time=$(( $end_time - $start_time )) + +#最后一个迭代FPS值 +FPS=`grep -a 'FPS:' ${cur_path}/output/train_perf_1p.log|awk -F "FPS:" '{print $NF}'|awk 'END {print}'` + +#最后一个迭代loss值 +loss=`grep -a 'D loss:' ${cur_path}/output/train_perf_1p.log | awk -F "D loss:" '{print $NF}'| awk 'END {print}' | awk -F "]" '{print $1}'` + +#打印,不需要修改 +echo "ActualFPS : $FPS" +echo "ActualLoss : ${loss}" +echo "E2E Training Duration sec : $e2e_time" + +#稳定性精度看护结果汇总 +#训练用例信息,不需要修改 +BatchSize=${batch_size} +DeviceType=`uname -m` +CaseName=${Network}_bs${BatchSize}_${RANK_SIZE}'p'_'acc' + +##获取性能数据,不需要修改 +#单迭代训练时长 +TrainingTime=`awk 'BEGIN{printf "%.2f\n", '${batch_size}'*1000/'${FPS}'}'` + +#从train_$ASCEND_DEVICE_ID.log提取Loss到train_${CaseName}_loss.txt中,需要模型审视修改 +grep -a 'D loss:' ${cur_path}/output/train_perf_1p.log | awk -F "D loss:" '{print $NF}' | awk -F "]" '{print $1}' >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/train_${CaseName}_loss.txt + +grep -a 'FPS:' ${cur_path}/output/train_perf_1p.log | awk -F "FPS:" '{print $NF}' | awk -F "]" '{print $1}' >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/train_${CaseName}_FPS.txt + +#关键信息打印到${CaseName}.log中,不需要修改 +echo "Network = ${Network}" > $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "RankSize = ${RANK_SIZE}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "BatchSize = ${BatchSize}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "DeviceType = ${DeviceType}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "CaseName = ${CaseName}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "ActualFPS = ${FPS}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "TrainingTime = ${TrainingTime}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "ActualLoss = ${loss}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "E2ETrainingTime = ${e2e_time}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log + + + + + + diff --git a/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_performance_8p.sh b/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_performance_8p.sh index 092fdb26e7..eaca822eda 100644 --- a/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_performance_8p.sh +++ b/PyTorch/contrib/cv/others/GAN_Pytorch/test/train_performance_8p.sh @@ -1,13 +1,128 @@ #!/bin/bash -source env_npu.sh currentDir=$(cd "$(dirname "$0")";pwd)/.. -nohup python3 ${currentDir}/main.py \ - --gpus 8\ +#当前路径,不需要修改 +cur_path=`pwd` + +#集合通信参数,不需要修改 +export RANK_SIZE=8 + +# 数据集路径,保持为空,不需要修改 +data_path="" + +#网络名称,同目录名称,需要模型审视修改 +Network="GAN" + +#训练batch_size,,需要模型审视修改 +batch_size=128 + + + +#参数校验,不需要修改 +for para in $* +do + if [[ $para == --device_id* ]];then + device_id=`echo ${para#*=}` + elif [[ $para == --data_path* ]];then + data_path=`echo ${para#*=}` + fi +done + +#校验是否传入data_path,不需要修改 +if [[ $data_path == "" ]];then + echo "[Error] para \"data_path\" must be confing" + exit 1 +fi + +# 校验是否指定了device_id,分动态分配device_id与手动指定device_id,此处不需要修改 +if [ $ASCEND_DEVICE_ID ];then + echo "device id is ${ASCEND_DEVICE_ID}" +elif [ ${device_id} ];then + export ASCEND_DEVICE_ID=${device_id} + echo "device id is ${ASCEND_DEVICE_ID}" +else + export ASCEND_DEVICE_ID=0,1,2,3,4,5,6,7 + echo "device id is ${ASCEND_DEVICE_ID}" +fi + +#训练开始时间,不需要修改 +start_time=$(date +%s) +echo "start_time: ${start_time}" + +#进入训练脚本目录,需要模型审视修改 +cd $cur_path/ + +#创建DeviceID输出目录,不需要修改 +if [ -d ${cur_path}/output/${Network}/${ASCEND_DEVICE_ID} ];then + rm -rf ${cur_path}/output/${Network}/${ASCEND_DEVICE_ID} + mkdir -p ${cur_path}/output/${Network}/$ASCEND_DEVICE_ID/ckpt +else + mkdir -p ${cur_path}/output/${Network}/$ASCEND_DEVICE_ID/ckpt +fi + +#非平台场景时source 环境变量 +check_etp_flag=`env | grep etp_running_flag` +etp_flag=`echo ${check_etp_flag#*=}` +if [ x"${etp_flag}" != x"true" ];then + source ${cur_path}/test/env_npu.sh +fi + +#执行训练脚本,以下传参不需要修改,其他需要模型审视修改 +python3.7 -u -m torch.distributed.launch --nproc_per_node=8 ${currentDir}/main.py \ --distributed \ --lr 0.0008 \ - --batch_size 128 \ + --batch_size ${batch_size} \ --n_epochs 1 \ --workers 16 \ --apex \ - --device_id 0 & + --data_path ${data_path} > ${cur_path}/output/train_perf_8p.log 2>&1 & + +wait + +#训练结束时间,不需要修改 +end_time=$(date +%s) +echo "end_time: ${end_time}" +e2e_time=$(( $end_time - $start_time )) + +#最后一个迭代FPS值 +FPS=`grep -a 'FPS:' ${cur_path}/output/train_perf_8p.log|awk -F "FPS:" '{print $NF}'|awk 'END {print}'` + +#最后一个迭代loss值 +loss=`grep -a 'D loss:' ${cur_path}/output/train_perf_8p.log | awk -F "D loss:" '{print $NF}'| awk 'END {print}' | awk -F "]" '{print $1}'` + +#打印,不需要修改 +echo "ActualFPS : $FPS" +echo "ActualLoss : ${loss}" +echo "E2E Training Duration sec : $e2e_time" + +#稳定性精度看护结果汇总 +#训练用例信息,不需要修改 +BatchSize=${batch_size} +DeviceType=`uname -m` +CaseName=${Network}_bs${BatchSize}_${RANK_SIZE}'p'_'acc' + +##获取性能数据,不需要修改 +#单迭代训练时长 +TrainingTime=`awk 'BEGIN{printf "%.2f\n", '${batch_size}'*1000/'${FPS}'}'` + +#从train_$ASCEND_DEVICE_ID.log提取Loss到train_${CaseName}_loss.txt中,需要模型审视修改 +grep -a 'D loss:' ${cur_path}/output/train_perf_8p.log | awk -F "D loss:" '{print $NF}' | awk -F "]" '{print $1}' >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/train_${CaseName}_loss.txt + +grep -a 'FPS:' ${cur_path}/output/train_perf_8p.log | awk -F "FPS:" '{print $NF}' | awk -F "]" '{print $1}' >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/train_${CaseName}_FPS.txt + +#关键信息打印到${CaseName}.log中,不需要修改 +echo "Network = ${Network}" > $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "RankSize = ${RANK_SIZE}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "BatchSize = ${BatchSize}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "DeviceType = ${DeviceType}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "CaseName = ${CaseName}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "ActualFPS = ${FPS}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "TrainingTime = ${TrainingTime}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "ActualLoss = ${loss}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log +echo "E2ETrainingTime = ${e2e_time}" >> $cur_path/output/${Network}/$ASCEND_DEVICE_ID/${CaseName}.log + + + + + + -- Gitee From 4a4e8fbf85a956492d46af4f7205dee795e01df7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=B5=B5=E7=84=B6=E4=B9=8B?= <1641621634@qq.com> Date: Wed, 25 May 2022 07:41:57 +0000 Subject: [PATCH 3/4] update PyTorch/contrib/cv/others/GAN_Pytorch/README.md. --- PyTorch/contrib/cv/others/GAN_Pytorch/README.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/PyTorch/contrib/cv/others/GAN_Pytorch/README.md b/PyTorch/contrib/cv/others/GAN_Pytorch/README.md index 96151c3e21..d0eddb9ebf 100644 --- a/PyTorch/contrib/cv/others/GAN_Pytorch/README.md +++ b/PyTorch/contrib/cv/others/GAN_Pytorch/README.md @@ -11,7 +11,7 @@ url=https://github.com/eriklindernoren/PyTorch-GAN/blob/master/implementations/g - Install PyTorch ([pytorch.org](http://pytorch.org)) - `pip install -r requirements.txt` -- The MNIST Dataset can be downloaded from the links below.Move the datasets to directory ./data . +- The MNIST Dataset can be downloaded from the links below. - Train Set : [Download Mnist](https://wwr.lanzoui.com/iSBOeu43dkf) ## Training # @@ -19,19 +19,19 @@ To train a model, change the working directory to `./test`,then run: ```bash # 1p train perf -bash train_performance_1p.sh +bash train_performance_1p.sh --data_path=data/mnist # 8p train perf -bash train_performance_8p.sh +bash train_performance_8p.sh --data_path=data/mnist # 8p train full -bash train_full_8p.sh +bash train_full_8p.sh --data_path=data/mnist # 8p eval -bash train_eval_8p.sh +bash train_eval_8p.sh --data_path=data/mnist # finetuning -bash train_finetune_1p.sh +bash train_finetune_1p.sh --data_path=data/mnist ``` After running,you can see the results in `./output` @@ -39,8 +39,9 @@ After running,you can see the results in `./output` | Acc@1 | FPS | Npu_nums | Epochs | AMP_Type | | :------: | :------: | :------: | :------: | :------: | -| - | 997 | 1 | 200 | O1 | -| - | 11795 | 8 | 200 | O1 | +| - | 515.439 | 1 | 200 | O1 | +| - | 15275.049 | 8 | 200 | O1 | + -- Gitee From 4715e10fcd3992fd51f0797272c018caa431f1a0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E8=B5=B5=E7=84=B6=E4=B9=8B?= <1641621634@qq.com> Date: Wed, 25 May 2022 07:47:43 +0000 Subject: [PATCH 4/4] update PyTorch/contrib/cv/others/GAN_Pytorch/README.md. --- PyTorch/contrib/cv/others/GAN_Pytorch/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/PyTorch/contrib/cv/others/GAN_Pytorch/README.md b/PyTorch/contrib/cv/others/GAN_Pytorch/README.md index d0eddb9ebf..eb86e0de01 100644 --- a/PyTorch/contrib/cv/others/GAN_Pytorch/README.md +++ b/PyTorch/contrib/cv/others/GAN_Pytorch/README.md @@ -15,7 +15,7 @@ url=https://github.com/eriklindernoren/PyTorch-GAN/blob/master/implementations/g - Train Set : [Download Mnist](https://wwr.lanzoui.com/iSBOeu43dkf) ## Training # -To train a model, change the working directory to `./test`,then run: +To train a model, run: ```bash # 1p train perf -- Gitee