diff --git a/tutorials/source_en/advanced/mixed_precision.md b/tutorials/source_en/advanced/mixed_precision.md index 2867c31136d7e72941cd5f5a6abec98c5305bdd4..e192466f888283c33e604af1ed98a8fe8532f589 100644 --- a/tutorials/source_en/advanced/mixed_precision.md +++ b/tutorials/source_en/advanced/mixed_precision.md @@ -24,7 +24,7 @@ As shown in the figure, the storage space of FP16 is half that of FP32, and the But the use of FP16 also poses a number of problems: -- Data overflow: The valid data representation range for FP16 is $[6.10\times10^{-5}, 65504]$ and for FP32 is $[1.4\times10^{-45}, 1.7\times10^{38}]$. It can be seen that the effective range of FP16 is much narrower than that of FP32, and using FP16 to replace FP32 will result in overflow and underflow. In deep learning, the gradient (first-order derivative) of the weights in the network model needs to be calculated, so the gradient will be even smaller than the weight value and often prone to underflow. +- Data overflow: The valid data representation range for FP16 is $[5.9\\times10^{-8}, 65504]$ and for FP32 is $[1.4\times10^{-45}, 1.7\times10^{38}]$. It can be seen that the effective range of FP16 is much narrower than that of FP32, and using FP16 to replace FP32 will result in overflow and underflow. In deep learning, the gradient (first-order derivative) of the weights in the network model needs to be calculated, so the gradient will be even smaller than the weight value and often prone to underflow. - Rounding error: Rounding Error is when the backward gradient of the network model is small, which is generally represented by FP32. But the conversion to FP16 will be smaller than the minimum interval in the current interval and will lead to data overflow. If `0.00006666666` can be expressed normally in FP32, it will be expressed as `0.000067` after conversion to FP16, and the numbers that do not meet the minimum interval of FP16 will be forced to be rounded. Therefore, the solution of the FP16 introduction problem needs to be considered while using mixed precision to obtain training speedup and memory savings. Loss Scale, a solution to the FP16 type data overflow problem, expands the loss by a certain number of times when calculating the loss value loss. According to the chain rule, the gradient is expanded accordingly and then scaled down by a corresponding multiple when the optimizer updates the weights, thus avoiding data underflow. diff --git a/tutorials/source_zh_cn/advanced/mixed_precision.ipynb b/tutorials/source_zh_cn/advanced/mixed_precision.ipynb index c24af517b25239e097ce092a94b3b36ff5e7897c..5ac26d0e0c860a35b21786b56e7ac43d575228a3 100644 --- a/tutorials/source_zh_cn/advanced/mixed_precision.ipynb +++ b/tutorials/source_zh_cn/advanced/mixed_precision.ipynb @@ -49,7 +49,7 @@ "\n", "但是使用FP16同样会带来一些问题:\n", "\n", - "- 数据溢出:FP16的有效数据表示范围为 $[6.10\\times10^{-5}, 65504]$,FP32的有效数据表示范围为 $[1.4\\times10^{-45}, 1.7\\times10^{38}]$。可见FP16相比FP32的有效范围要窄很多,使用FP16替换FP32会出现上溢(Overflow)和下溢(Underflow)的情况。而在深度学习中,需要计算网络模型中权重的梯度(一阶导数),因此梯度会比权重值更加小,往往容易出现下溢情况。\n", + "- 数据溢出:FP16的有效数据表示范围为 $[5.9\\times10^{-8}, 65504]$,FP32的有效数据表示范围为 $[1.4\\times10^{-45}, 1.7\\times10^{38}]$。可见FP16相比FP32的有效范围要窄很多,使用FP16替换FP32会出现上溢(Overflow)和下溢(Underflow)的情况。而在深度学习中,需要计算网络模型中权重的梯度(一阶导数),因此梯度会比权重值更加小,往往容易出现下溢情况。\n", "- 舍入误差:Rounding Error是指当网络模型的反向梯度很小,一般FP32能够表示,但是转换到FP16会小于当前区间内的最小间隔,会导致数据溢出。如`0.00006666666`在FP32中能正常表示,转换到FP16后会表示成为`0.000067`,不满足FP16最小间隔的数会强制舍入。\n", "\n", "因此,在使用混合精度获得训练加速和内存节省的同时,需要考虑FP16引入问题的解决。Loss Scale损失缩放,FP16类型数据下溢问题的解决方案,其主要思想是在计算损失值loss的时候,将loss扩大一定的倍数。根据链式法则,梯度也会相应扩大,然后在优化器更新权重时再缩小相应的倍数,从而避免了数据下溢。"