diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.6.2_for_PyTorch/README.md b/PyTorch/built-in/rl/OpenRLHF_v0.6.2_for_PyTorch/README.osc.md similarity index 96% rename from PyTorch/built-in/rl/OpenRLHF_v0.6.2_for_PyTorch/README.md rename to PyTorch/built-in/rl/OpenRLHF_v0.6.2_for_PyTorch/README.osc.md index fca3b25fe89f9d769d93dc0330effa293487e2ad..1bd5da688c2498d4d2773b790d6e75893605261c 100644 --- a/PyTorch/built-in/rl/OpenRLHF_v0.6.2_for_PyTorch/README.md +++ b/PyTorch/built-in/rl/OpenRLHF_v0.6.2_for_PyTorch/README.osc.md @@ -151,6 +151,7 @@ bash test/train_grpo_performance_16p.sh --model_path=./models/xxx --dataset_path # FAQ * 使用--adam_offload参数可能存在长时间卡顿的情况,解决方法是删除torch_extensions的缓存文件,参考[issue](https://github.com/deepspeedai/DeepSpeed/issues/2816#issuecomment-1450095538)。 +* 在 Atlas 200T A2 Box16 机器中,如果使用了跨平面的卡,需要使能环境变量 `export HCCL_INTRA_ROCE_ENABLE=1`,使用RoCE环路进行多卡间的通信。 # 公网地址说明 diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.6.2_for_PyTorch/setup.py b/PyTorch/built-in/rl/OpenRLHF_v0.6.2_for_PyTorch/setup.py index 78f255874489c5269d10608af5e77af2c50da75e..44425d3c9c97779ccc1e3f0e89c1a8be63e0e3bd 100644 --- a/PyTorch/built-in/rl/OpenRLHF_v0.6.2_for_PyTorch/setup.py +++ b/PyTorch/built-in/rl/OpenRLHF_v0.6.2_for_PyTorch/setup.py @@ -19,7 +19,7 @@ def _fetch_requirements(path): def _fetch_readme(): - with open("README.md", encoding="utf-8") as f: + with open("README.osc.md", encoding="utf-8") as f: return f.read()