diff --git "a/AscendPyTorch\346\250\241\345\236\213\344\274\227\346\231\272\346\226\207\346\241\243-\350\256\255\347\273\203.md" "b/AscendPyTorch\346\250\241\345\236\213\344\274\227\346\231\272\346\226\207\346\241\243-\350\256\255\347\273\203.md" index 54a541bd99e3338c38d2d1e607b27056c1335748..353c0b87150bd36e843c3f6fc3191344e2216650 100644 --- "a/AscendPyTorch\346\250\241\345\236\213\344\274\227\346\231\272\346\226\207\346\241\243-\350\256\255\347\273\203.md" +++ "b/AscendPyTorch\346\250\241\345\236\213\344\274\227\346\231\272\346\226\207\346\241\243-\350\256\255\347\273\203.md" @@ -201,8 +201,43 @@ print(prof.key_averages().table(sort_by="self_cpu_time_total")) > **实际操作中推荐优先使用 ```opt_level='O2', loss_scale=128.0``` 的配置进行amp.initialize** - 若源代码为单P训练,则需要修改为8P训练 1. 初始化group + ```python + dist.init_process_group(backend=args.dist_backend, world_size=args.world_size, rank=args.rank) + ``` + - 参数说明: + - dist_backend:GPU环境默认值nccl, NPU环境默认值hccl + - world_size:取值1,2,4,8。8p训练参数配置为8 + - rank:进程id,每个进程对应一个id + 2. 添加ddp + ```python + model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]) + ``` + - 8p训练中,args.rank和args.gpu值相等 + 3. 添加DistributedSampler + ```python + if is_distributed == 0: + self.train_sampler = None + else: + self.train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset) + ``` + - 一般只需要对训练数据进行DistributedSampler,load数据时也要同步适配 + ```python + self.train_dataloader = data.DataLoader(dataset=train_dataset, + batch_size=cfg["train"]["train_batch_size"], + shuffle=(self.train_sampler is None) , + num_workers=workers, + pin_memory=False, + sampler=self.train_sampler , + drop_last=False) + ``` + - 同时在train epoch循环过程中,需要添加set_epoch步骤 + ```python + for epoch in range(self.epochs): + if self.dataparallel: + self.train_sampler.set_epoch(epoch) + ``` - GPU需要保存的数据 >![](https://gitee.com/wangjiangben_hw/ascend-pytorch-crowdintelligence-doc/raw/master/public_sys-resources/icon-note.gif) **说明:**