diff --git a/docs/federated/docs/source_en/communication_compression.md b/docs/federated/docs/source_en/communication_compression.md index 9990e79567b52bb3116ee4d3daf46e2c63b99fff..a37669c9b96e4702b9e5e6ccaaf62f032d2ac14a 100644 --- a/docs/federated/docs/source_en/communication_compression.md +++ b/docs/federated/docs/source_en/communication_compression.md @@ -2,7 +2,7 @@ -During the device-side federated learning training process, the traffic volume affects the user experience of the device-side (user traffic, communication latency, number of FL-Client participants) and is limited by the cloud-side performance constraints (memory, bandwidth, CPU usage). To improve user experience and reduce performance bottlenecks, MindSpore federated learning framework provides traffic compression for upload and download in device-cloud federated scenarios. +During the horizontal device-side federated learning training process, the traffic volume affects the user experience of the device-side (user traffic, communication latency, number of FL-Client participants) and is limited by the cloud-side performance constraints (memory, bandwidth, CPU usage). To improve user experience and reduce performance bottlenecks, MindSpore federated learning framework provides traffic compression for upload and download in device-cloud federated scenarios. ## Compression Method diff --git a/docs/federated/docs/source_en/data_join.md b/docs/federated/docs/source_en/data_join.md new file mode 100644 index 0000000000000000000000000000000000000000..248db955106a13fc65f92209c1db5bbf7dcf6822 --- /dev/null +++ b/docs/federated/docs/source_en/data_join.md @@ -0,0 +1,245 @@ +# Vertical Federated Learning Data Access + + + +Unlike horizontal federated learning, two participants (leader and follower) have the same sample space for training or inference in vertical federated learning. Therefore, the data intersection must be done collaboratively before both parties in vertical federated learning initiate training or inference. Both parties must read their respective original data and extract the ID (unique identifier of each data, and none of them is the same) corresponding to each data for intersection (i.e., finding the intersection). Then, both parties obtain features or tags from the original data based on the intersected IDs. Finally, each side exports the persistence file and reads the data in the reordering manner before subsequent training or inference. + +## Overall Process + +Data access can be divided into two parts: data export and data read. + +### Exporting Data + +The MindSpore Federated vertical federated learning data export process framework is shown in Figure 1: + +![](https://gitee.com/mindspore/docs/blob/master/docs/federated/docs/source_en/images/data_join_en.png) + +Fig. 1 Vertical Federated Learning Data Export Process Framework Diagram + +In the data export process, Leader Worker and Follower Worker are the two participants in the vertical federated learning. The Leader Worker is resident and keeps a listening ear on the Follower Worker, who can enter the data access process at any moment. + +After the Leader Worker receives a registration request from the Follower Worker, it checks the registration content. If the registration is successful, the task-related hyperparameters (PSI-related hyperparameters, bucketing rules, ID field names, etc.) are sent to the Follower Worker. + +The Leader Worker and Follower Worker read their respective raw data, extract the list of IDs from their raw data and implement bucketing. + +Each bucket of Leader Worker and Follower Worker initiates the privacy intersection method to obtain the ID intersections of the two parties. + +Finally, the two parties extract the corresponding data from the original data based on the ID intersections and export it to a file in MindRecord format. + +### Reading Data + +Vertical federated requires that both participants have the same value and order of data IDs for each batch of training or inference. MindSpore Federated ensures that the data is read in the same order by using the same random seed and by using dictionary sorting on the exported file sets when both parties read their respective data. + +## An Example for Quick Experience + +### Sample Data Preparation + +To use the data access method, the original data needs to be prepared first. The user can use [random data generation script](https://gitee.com/mindspore/federated/blob/master/tests/st/data_join/generate_random_data.py) to generate forged data for each participant as a sample. + +```python +python generate_random_data.py \ + --seed=0 \ + --total_output_path=vfl/datasets/total_data.csv \ + --leader_output_path=vfl/datasets/leader_data_*.csv \ + --follower_output_path=vfl/datasets/follower_data_*.csv \ + --leader_file_num=4 \ + --follower_file_num=2 \ + --leader_data_num=300 \ + --follower_data_num=200 \ + --overlap_num=100 \ + --id_len=20 \ + --feature_num=30 +``` + +The user can set the hyperparameter according to the actual situation: + +| Hyperparameter names | Hyperparameter description | +| -------------------- | ------------------------------------------------------------ | +| seed | Random seed, int type. | +| total_output_path | The output path of all data, str type. | +| leader_output_path | The export path of the leader data. If the configuration includes the `*`, the `*` will be replaced by the serial number of 0, 1, 2 ...... in order when exporting multiple files. str type. | +| follower_output_path | The export path of the follower data. If the configuration includes the `*`, the `*` will be replaced by the serial number of 0, 1, 2 ...... in order when exporting multiple files. str type. | +| leader_file_num | The number of output files for leader data. int type. | +| follower_file_num | The number of output files for follower data. int type. | +| leader_data_num | The total number of leader data. int type. | +| follower_data_num | The total number of follower data. int type. | +| overlap_num | The total amount of data that overlaps between leader and follower data. int type. | +| id_len | The data ID is a string type. The hyperparameter is the length of the string. int type. | +| feature_num | The number of columns of the exported data | + +Multiple csv files are generated after running the data preparation: + +```text +follower_data_0.csv +follower_data_1.csv +intersection_data.csv +leader_data_0.csv +leader_data_1.csv +leader_data_2.csv +leader_data_3.csv +``` + +### Sample of Data Export + +Users can use [script of finding data intersections](https://gitee.com/mindspore/federated/blob/master/tests/st/data_join/run_data_join.py) to implement data intersections between two parties and export it to MindRecord format file. The users need to start Leader and Follower processes separately. + +Start Leader: + +```python +python run_data_join.py \ + --role=leader \ + --worker_config_path=vfl/leader.yaml \ + --schema_path=vfl/leader_schema.yaml \ + --server_address="127.0.0.1:9027" \ + --peer_server_address="127.0.0.1:9028" +``` + +Start Follower: + +```python +python run_data_join.py \ + --role=follower \ + --worker_config_path=vfl/follower.yaml \ + --schema_path=vfl/follower_schema.yaml \ + --server_address="127.0.0.1:9028" \ + --peer_server_address="127.0.0.1:9027" +``` + +The user can set the hyperparameter according to the actual situation. + +| Hyperparameter names | Hyperparameter description | +| ------------------- | ------------------------------------------------------- | +| role | Role types of the worker. str type. Including: "leader", "follower". | +| worker_config_path | The path of the hyperparameter file to be configured for intersection. str type. | +| schema_path | The path of the hyperparameter file to be configured for export. str type. | +| server_address | Local IP and port address. str type. | +| peer_server_address | Peer end IP and port address. str type. | + +In the above sample, worker_config_path can be referred to the corresponding file configurations of [leader.yaml](https://gitee.com/mindspore/federated/tree/master/tests/st/data_join/vfl/leader.yaml) and [follower.yaml](https://gitee.com/mindspore/federated/tree/master/tests/st/data_join/vfl/follower.yaml). The configuration meaning is as follows. + +| Hyperparameter names | Hyperparameter description | +| --------------------------------- | ------------------------------------------------------------ | +| main_table_files | The path of raw data, configure either single or multiple file paths, data directory paths, list or str types | +| output_dir | The directory path of the exported MindRecord related files, str type. | +| primary_key (Follower does not need to be configured) | The name of data ID, str type. | +| bucket_num (Follower does not need to be configured) | Find the number of sub-buckets when intersecting and exporting, int type. | +| store_type | Raw data storage type, str type. | +| shard_num (Follower does not need to be configured) | The number of files exported from a single bucket, int type. | +| join_type (Follower does not need to be configured) | Algorithm of intersection finding, str type. | +| thread_num | Calculate the number of threads required when using the PSI intersection algorithm, int type. | + +In the above sample, the schema_path can be referred to the corresponding files configuration of [leader_schema.yaml](https://gitee.com/mindspore/federated/tree/master/tests/st/data_join/vfl/leader_schema. yaml) and [follower_schema.yaml](https://gitee.com/mindspore/federated/tree/master/tests/st/data_join/vfl/follower_schema.yaml). The user needs to provide the column names and types of the data to be exported in this file. + +After running the data export, generate multiple MindRecord related files. + +```text +mindrecord_0 +mindrecord_0.db +mindrecord_1 +mindrecord_1.db +mindrecord_2 +mindrecord_2.db +mindrecord_3 +mindrecord_3.db +mindrecord_4 +mindrecord_4.db +``` + +### Sample of Data Reading + +The user can use the [script of reading data](https://gitee.com/mindspore/federated/blob/master/tests/st/data_join/load_joined_data.py) to implement data reading after intersection. + +```python +python load_joined_data.py \ + --seed=0 \ + --input_dir=vfl/output/leader/ \ + --shuffle=True +``` + +The user can set the hyperparameter according to the actual situation. + +| Hyperparameter names | Hyperparameter description | +| --------- | ----------------------------------------- | +| seed | Random seed. int type. | +| input_dir | The directory of the input MindRecord related files, str type. | +| shuffle | Whether the data order needs to be changed, bool type. | + +If the intersection result is correct, when each of the two parties reads the data, the OAID order of each data of the two parties is the same, while the data of the other columns in each data can be different values. Print the intersection data after running the data read: + +```text +Leader data export results: +{……, 'oaid': Tensor(shape=[], dtype=String, value= 'uMbgxIMMwWhMGrVMVtM7')} +{……, 'oaid': Tensor(shape=[], dtype=String, value= 'IwoGP08kWVtT4WHL2PLu')} +{……, 'oaid': Tensor(shape=[], dtype=String, value= 'MSRe6mURtxgyEgWzDn0b')} +{……, 'oaid': Tensor(shape=[], dtype=String, value= 'y7X0WcMKnTLrhxVcWfGF')} +{……, 'oaid': Tensor(shape=[], dtype=String, value= 'DicKRIVvbOYSiv63TvcL')} +{……, 'oaid': Tensor(shape=[], dtype=String, value= 'TCHgtynOhH3z11QYemsH')} +{……, 'oaid': Tensor(shape=[], dtype=String, value= 'OWmhgIfC3k8UTteGUhni')} +{……, 'oaid': Tensor(shape=[], dtype=String, value= 'NTV3qEYXBHqKBWyHGc7s')} +{……, 'oaid': Tensor(shape=[], dtype=String, value= 'wuinSeN1bzYgXy4XmSlR')} +{……, 'oaid': Tensor(shape=[], dtype=String, value= 'SSsCU0Pb46XGzUIa3Erg')} +…… + +Follower data export results: +{……, 'oaid': Tensor(shape=[], dtype=String, value= 'uMbgxIMMwWhMGrVMVtM7')} +{……, 'oaid': Tensor(shape=[], dtype=String, value= 'IwoGP08kWVtT4WHL2PLu')} +{……, 'oaid': Tensor(shape=[], dtype=String, value= 'MSRe6mURtxgyEgWzDn0b')} +{……, 'oaid': Tensor(shape=[], dtype=String, value= 'y7X0WcMKnTLrhxVcWfGF')} +{……, 'oaid': Tensor(shape=[], dtype=String, value= 'DicKRIVvbOYSiv63TvcL')} +{……, 'oaid': Tensor(shape=[], dtype=String, value= 'TCHgtynOhH3z11QYemsH')} +{……, 'oaid': Tensor(shape=[], dtype=String, value= 'OWmhgIfC3k8UTteGUhni')} +{……, 'oaid': Tensor(shape=[], dtype=String, value= 'NTV3qEYXBHqKBWyHGc7s')} +{……, 'oaid': Tensor(shape=[], dtype=String, value= 'wuinSeN1bzYgXy4XmSlR')} +{……, 'oaid': Tensor(shape=[], dtype=String, value= 'SSsCU0Pb46XGzUIa3Erg')} +…… +``` + +## An Example for Advanced Experience + +For detailed API documentation for the following code, see [Data Access Documentation](https://gitee.com/mindspore/federated/blob/master/docs/api/api_python/data_join.rst). + +### Data Export + +The user can implement data export by using the encapsulated interface in the following way: + +```python +from mindspore_federated.data_join import FLDataWorker + + +if __name__ == '__main__': + worker = FLDataWorker(role="leader", + worker_config_path="vfl/leader.yaml", + data_schema_path="vfl/leader_schema.yaml", + server_address="127.0.0.1:6969", + peer_server_address="127.0.0.1:9696" + ) + worker.export() +``` + +### Data Reading + +The user can implement data reading by using the encapsulated interface in the following way: + +```python +from mindspore_federated.data_join import load_mindrecord + + +if __name__ == "__main__": + dataset = load_mindrecord(input_dir="vfl/output/leader/", shuffle=True, seed=0) +``` + +### Data Communication + +Users can use the encapsulated communication interface to achieve data communication, where the Federated Learning communicator is started as follows, and can call its send() and receive() methods to send and receive data. The communicator is encapsulated in the FLDataWorker class, and the user only needs to use FLDataWorker. + +```python +from mindspore_federated import VerticalFederatedCommunicator, ServerConfig + + +if __name__ == "__main__": + http_server_config = ServerConfig(server_name='serverB', server_address='10.113.216.44:6667') + remote_server_config = ServerConfig(server_name='serverA', server_address='10.113.216.44:6666') + vertical_communicator = VerticalFederatedCommunicator(http_server_config=http_server_config, + remote_server_config=remote_server_config) + vertical_communicator.launch() +``` \ No newline at end of file diff --git a/docs/federated/docs/source_en/index.rst b/docs/federated/docs/source_en/index.rst index da85f400b58536c3f8f9560d77a3c1e9c76c78d3..1c427b5b3cb17d35598b05fbe4395c529c5438b9 100644 --- a/docs/federated/docs/source_en/index.rst +++ b/docs/federated/docs/source_en/index.rst @@ -88,6 +88,8 @@ Common Application Scenarios :maxdepth: 1 :caption: Vertical Application + data_join + .. toctree:: :maxdepth: 1 :caption: Security and Privacy diff --git a/docs/federated/docs/source_zh_cn/data_join.md b/docs/federated/docs/source_zh_cn/data_join.md index d3033ace13ba1dc860dadd46aae83182816ff10f..bd24de8db1f441dede86d95ff031f1bd6c0bb4bd 100644 --- a/docs/federated/docs/source_zh_cn/data_join.md +++ b/docs/federated/docs/source_zh_cn/data_join.md @@ -16,13 +16,13 @@ MindSpore Federated纵向联邦学习数据导出流程框架如图1所示: 图 1 纵向联邦学习数据接入流程框架图 -在数据导出流程中,Leader Worker和 Follower Worker为纵向联邦学习的两个参与方。Leader Worker常驻并保持对Follower的监听,Follower Worker可以在任意时刻进入数据接入流程中。 +在数据导出流程中,Leader Worker和 Follower Worker为纵向联邦学习的两个参与方。Leader Worker常驻并保持对Follower Worker的监听,Follower Worker可以在任意时刻进入数据接入流程中。 在Leader Worker收到 Follower Worker的注册请求后,会对注册内容进行校验。若注册成功,则给Follower Worker发送任务相关的超参(PSI 相关超参、分桶规则、ID字段名称等)。 然后Leader Worker 和 Follower Worker 分别读取各自的原始数据,再从各自的原始数据中提取出 ID 列表并实现分桶。 -Leader Worker和 Follower Worker的每个桶都启动隐私求交方法获得两方的ID交集。 +Leader Worker 和 Follower Worker 的每个桶都启动隐私求交方法获得两方的ID交集。 最后,两方根据ID交集提取原始数据中相应的数据并导出成MindRecord格式的文件。 @@ -34,7 +34,7 @@ Leader Worker和 Follower Worker的每个桶都启动隐私求交方法获得两 ### 数据准备样例 -若要使用数据接入方法,首先需要准备好需要原始数据。用户可以使用[随机数据生成脚本](https://gitee.com/mindspore/federated/blob/master/tests/st/data_join/generate_random_data.py)生成出各参与方的伪造数据作为样例。 +若要使用数据接入方法,首先需要准备好原始数据。用户可以使用[随机数据生成脚本](https://gitee.com/mindspore/federated/blob/master/tests/st/data_join/generate_random_data.py)生成出各参与方的伪造数据作为样例。 ```python python generate_random_data.py \ diff --git a/tutorials/experts/source_en/index.rst b/tutorials/experts/source_en/index.rst index d7dbc205405d33db9137940797e517577b7a05ae..800a2caea94d70cbb6bde31910edbaf381f280e6 100644 --- a/tutorials/experts/source_en/index.rst +++ b/tutorials/experts/source_en/index.rst @@ -33,7 +33,6 @@ For Experts :maxdepth: 1 :caption: Model Training Optimization - others/mixed_precision others/gradient_accumulation others/adaptive_summation others/dimention_reduce_training diff --git a/tutorials/experts/source_en/others/mixed_precision.md b/tutorials/experts/source_en/others/mixed_precision.md deleted file mode 100644 index 6ec03cf7868c488bd478f1854ecc607412b6fb72..0000000000000000000000000000000000000000 --- a/tutorials/experts/source_en/others/mixed_precision.md +++ /dev/null @@ -1,533 +0,0 @@ -# Enabling Mixed Precision - - - -## Overview - -Generally, when a neural network model is trained, the default data type is FP32. In recent years, to accelerate training time, reduce memory occupied during network training, and store a trained model with same precision, more and more mixed-precision training methods are proposed in the industry. The mixed-precision training herein means that both single precision (FP32) and half precision (FP16) are used in a training process. - -## Floating-point Data Type - -Floating-point data types include double-precision (FP64), single-precision (FP32), and half-precision (FP16). In a training process of a neural network model, an FP32 data type is generally used by default to indicate a network model weight and other parameters. The following is a brief introduction to floating-point data types. - -According to [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754), floating-point data types are classified into double-precision (FP64), single-precision (FP32), and half-precision (FP16). Each type is represented by three different bits. FP64 indicates a data type that uses 8 bytes (64 bits in total) for encoding and storage. FP32 indicates a data type that uses 4 bytes (32 bits in total) and FP16 indicates a data type that uses 2 bytes (16 bits in total). As shown in the following figure: - -![fp16_vs_FP32](./images/fp16_vs_fp32.png) - -As shown in the figure, the storage space of FP16 is half that of FP32, and the storage space of FP32 is half that of FP64. It consists of three parts: - -- The highest bit indicates the sign bit. -- The middle bits indicate exponent bits. -- The low bits indicate fraction bits. - -FP16 is used as an example. The first sign bit sign indicates a positive or negative sign, and the next five bits indicate an exponent. All 0s and all 1s have special uses, so the binary range is 00001~11110. The last 10 bits indicate a fraction. Suppose `S` denotes the decimal value of sign bit, `E` denotes the decimal value of exponent, and `fraction` denotes the decimal value of fraction. The formula is as follows: - -$$x=(-1)^{S}\times2^{E-15}\times(1+\frac{fraction}{1024})$$ - -Similarly, suppose `M` is score value, the true value of a formatted FP32 is as follows: - -$$x=(-1)^{S}\times2^{E-127}\times(1.M)$$ - -The true value of a formatted FP64 is as follows: - -$$x=(-1)^{S}\times2^{E-1023}\times(1.M)$$ - -The maximum value that can be represented by FP16 is 0 11110 1111111111, which is calculated as follows: - -$$(-1)^0\times2^{30-15}\times1.1111111111 = 1.1111111111(b)\times2^15 = 1.9990234375(d)\times2^15 = 65504$$ - -where `b` indicates binary value, and `d` indicates decimal value. - -The minimum value that can be represented by FP16 is 0 00001 0000000000, which is calculated as follows: - -$$ (-1)^{1}\times2^{1-15}=2^{-14}=6.104×10^{-5}=-65504$$ - -Therefore, the maximum value range of FP16 is [-65504, 65504], and the precision range is $2^{-24}$. If the value is beyond this range, the value is set to 0. - -## FP16 Training Issues - -Why do we need mixed-precision? Compared with FP32, FP16 has the following advantages: - -- Reduced memory usage: The bit width of FP16 is half of that of FP32. Therefore, the memory occupied by parameters such as the weight is also half of the original memory. The saved memory can be used to store larger network models or train more data. -- Higher communication efficiency: For distributed training, especially the large-scale model training, the communication overhead restricts the overall performance. A smaller communication bit width means that the communication performance can be improved, the waiting time can be reduced, and the data flow can be accelerated. -- Higher computing efficiency: On special AI acceleration chips, such as Huawei Ascend 910 and 310 series, or GPUs of the NVIDIA VOLTA architecture, the computing performance of FP16 is faster than that of FP32. - -However, using FP16 also brings some problems, the most important of which are precision overflow and rounding error. - -- Data overflow: Data overflow is easliy to understand. The valid data range of FP16 is $[6.10\times10^{-5}, 65504]$, and that of FP32 is $[1.4\times10^{-45}, 1.7\times10^{38}]$. We can see that the valid range of FP16 is much narrower than that of FP32. When FP16 is used to replace FP32, overflow and underflow occur. In deep learning, a gradient (a first-order derivative) of a weight in a network model needs to be calculated. Therefore, the gradient is smaller than the weight value, and underflow often occurs. -- Rounding error: Rounding error instruction is when the backward gradient of a network model is small, FP32 is usually used. However, when it is converted to FP16, the interval is smaller than the minimum interval, causing data overflow. For example, 0.00006666666 can be properly represented in FP32, but it will be represented as 0.000067 in FP16. The number that does not meet the minimum interval requirement of FP16 will be forcibly rounded off. - -## Mixed-precision Computing Process - -The following figure shows the typical computation process of mixed precision in MindSpore. - -![mix precision](./images/mix_precision_fp16.png) - -1. Parameters are stored in FP32 format. -2. During the forward computation, if an FP16 operator is involved, the operator input and parameters need to be cast from FP32 to FP16. -3. The Loss layer is set to FP32. -4. During backward computation, the value is multiplied by Loss Scale to avoid underflow due to a small gradient. -5. The FP16 parameter is used for gradient computation, and the result is cast back to FP32. -6. Then, the value is divided by Loss scale to restore the multiplied gradient. -7. The optimizer checks whether the gradient overflows. If yes, the optimizer skips the update. If no, the optimizer uses FP32 to update the original parameters. - -This document describes the computation process by using examples of automatic and manual mixed precision. - -## Loss Scale - -Loss Scale is mainly used in the process of mixed-precision training. - -In the process of mixed precision training, the FP16 type is used instead of the FP32 type for data storage, so as to achieve the effect of reducing memory and improving the computing speed. However, because the FP16 type is much smaller than the range represented by the FP32 type, data underflow occurs when parameters (such as gradients) become very small during training. The Loss Scale is proposed to solve the underflow of FP16 type data. - -The main idea is to enlarge the loss by a certain multiple when calculating the loss. Due to the existence of the chain rule, the gradient also expands accordingly, and then the corresponding multiple is reduced when the optimizer updates the weight, thus avoiding the situation of data underflow without affecting the calculation result. - -There are two ways of implementing Loss Scale in MindSpore, users can either use the functional programming writeup and manually call the `scale` and `unscale` methods of `StaticLossScaler` or `DynamicLossScaler` to scale the loss or gradient during training; or they can configure the loss or gradient based on the `Model` interface and configure the mixed precision `amp_level` and the Loss Scale method `loss_scale_manager` as `FixedLossScaleManager` or `DynamicLossScaleManager` when building the model by using `Model`. - -First, let's take a look at why mixing accuracy is needed. The advantages of using FP16 to train a neural network are: - -- **Reduce memory occupation**: The bit width of FP16 is half that of FP32, so the memory occupied by parameters such as weights is also half of the original, and the saved memory can be used to put a larger network model or use more data for training. -- **Accelerate communication efficiency**: For distributed training, especially in the process of large model training, the overhead of communication restricts the overall performance of network model training, and the less bit width of communication means that communication performance can be improved. Waiting time is reduced, and data circulation can be accelerated. -- **Higher computing effciency**: On special AI-accelerated chips such as Huawei's Ascend 910 and 310 series, or GPUs of the Titan V and Tesla V100 of the NVIDIA VOLTA architecture, the performance of performing operations using FP16 is faster than that of the FP32. - -But using FP16 also brings some problems, the most important of which are precision overflow and rounding error, and Loss Scale is to solve the precision overflow and proposed. - -As shown in the figure, if only FP32 training is used, the model converges better, but if mixed-precision training is used, there will be a situation where the network model cannot converge. The reason is that the value of the gradient is too small, and using the FP16 representation will cause the problem of underflow under the data, resulting in the model not converging, as shown in the gray part of the figure. Loss Scale needs to be introduced. - -![loss_scale1](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/tutorials/experts/source_zh_cn/others/images/loss_scale1.png) - -The following is in the network model training stage, a layer of activation function gradient distribution, of which 68% of the network model activation parameter bit 0. Another 4% of the accuracy in the $2^{-32}, 2^{-20}$ interval, directly use FP16 to represent the data inside, which truncates the underflow data. All gradient values will become 0. - -![loss_scale2](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/tutorials/experts/source_zh_cn/others/images/loss_scale2.png) - -In order to solve the problem of ladder overflowing over small data, the forward calculated Loss value is amplified, that is, the parameters of FP32 are multiplied by a factor coefficient, and the possible overflowing decimal data is moved forward and panned to the data range that FP16 can represent. According to the chain differentiation law, amplifying the Loss acts on each gradient of backpropagation, which is more efficient than amplifying on each gradient. - -![loss_scale3](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/tutorials/experts/source_zh_cn/others/images/loss_scale3.png) - -Loss amplification needs to be achieved in combination with mixing accuracy, and its main main ideas are: - -- **Scale up stage**: After the network model forward calculation, the resulting loss change value DLoss is increased by a factor of $2^K$ before the repercussion propagation. -- **Scale down stage**: After backpropagation, the weight gradient is reduced by $2^K$, and the FP32 value is restored for storage. - -**Dynamic Loss Scale**: The loss scale mentioned above is to use a default value to scale the loss value, in order to make full use of the dynamic range of FP16, you can better mitigate the rounding error, and try to use a relatively large magnification. To summarize the dynamic loss scaling algorithm, it is to reduce the loss scale whenever the gradient overflows, and intermittently try to increase the loss scale, so as to achieve the use of the highest loss scale factor without causing overflow, and better restore accuracy. - -The dynamic loss scale algorithm is as follows: - -1. The algorithm of dynamic loss scaling starts with a relatively high scaling factor (such as $2^{24}$), then starts training and checks whether the number overflows in the iteration (Infs/Nans); -2. If there is no gradient overflow, the scale factor is not adjusted and the iteration continues; if the gradient overflow is detected, the scale factor is halved and the gradient update is reconfirmed until the parameter does not appear in the overflow range; -3. In the later stages of training, the loss has become stable and convergent, and the amplitude of the gradient update is often small, which can allow a higher loss scaling factor to prevent data underflow again. -4. Therefore, the dynamic loss scaling algorithm attempts to increase the loss scaling by the F multiple every N (N=2000) iterations, and then performs step 2 to check for overflow. - -## Using Mixed Precision and Loss Scale in MindSpore - -MindSpore provides two ways of using mixed precision and loss scale. - -- Use functional programming: use `auto_mixed_precision` for automatic mixing accuracy, `all_finite` for overflow judgments, and `StaticLossScaler` and `DynamicLossScaler` for manual scaling of gradients and losses. - -- Using the training interface `Model`: configure the input `amp_level` to set the execution policy for mixed precision and the input `loss_scale_manager` to `FixedLossScaleManager` or `DynamicLossScaleManager` to implement loss scaling. - -## Using a Functional Programming for Mixed Precision and Loss Scale - -MindSpore provides a functional interface for mixed precision scenarios. Users can use `auto_mixed_precision` for automatic mixed precision, `all_finite` for overflow judgments during training, and `StaticLossScaler` and `DynamicLossScaler` to manually perform gradient and loss scaling. - -Common uses of LossScaler under functional. - -First import the relevant libraries and define a LeNet5 network: - -```python -import numpy as np -import mindspore.nn as nn -from mindspore.train import Accuracy -import mindspore as ms -from mindspore.common.initializer import Normal -from mindspore import dataset as ds - - -class LeNet5(nn.Cell): - """ - Lenet network - - Args: - num_class (int): Number of classes. Default: 10. - num_channel (int): Number of channels. Default: 1. - - Returns: - Tensor, output tensor - """ - - def __init__(self, num_class=10, num_channel=1): - super(LeNet5, self).__init__() - self.conv1 = nn.Conv2d(num_channel, 6, 5, pad_mode='valid') - self.conv2 = nn.Conv2d(6, 16, 5, pad_mode='valid') - self.fc1 = nn.Dense(16 * 5 * 5, 120, weight_init=Normal(0.02)) - self.fc2 = nn.Dense(120, 84, weight_init=Normal(0.02)) - self.fc3 = nn.Dense(84, num_class, weight_init=Normal(0.02)) - self.relu = nn.ReLU() - self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2) - self.flatten = nn.Flatten() - - def construct(self, x): - x = self.max_pool2d(self.relu(self.conv1(x))) - x = self.max_pool2d(self.relu(self.conv2(x))) - x = self.flatten(x) - x = self.relu(self.fc1(x)) - x = self.relu(self.fc2(x)) - x = self.fc3(x) - return x -``` - -Perform auto mixed precision on the network. - -`auto_mixed_precision` implements the meanings of automatic mixed precision configuration as follows: - -- 'O0': keep FP32. -- 'O1': cast as FP16 by whitelist. -- 'O2': keep FP32 by blacklist and the rest cast as FP16. -- 'O3': fully cast to FP16. - -> The current black and white list is Cell granularity. - -```python -from mindspore import amp -from mindspore import ops - -net = LeNet5(10) -amp.auto_mixed_precision(net, 'O1') -``` - -Instantiate the LossScaler and manually scale up the loss value when defining the forward network. - -```python -loss_fn = nn.BCELoss(reduction='mean') -opt = nn.Adam(generator.trainable_params(), learning_rate=0.01) - -# Define LossScaler -loss_scaler = amp.DynamicLossScaler(scale_value=2**10, scale_factor=2, scale_window=50) - -def net_forward(data, label): - out = net(data) - loss_value = loss_fn(out, label) - # scale up the loss value - scaled_loss = loss_scaler.scale(loss_value) - return scaled_loss, out -``` - -Reverse acquisition of gradients. - -```python -grad_fn = ops.value_and_grad(net_forward, None, net.trainable_params()) -``` - -Define the training step: calculate the current gradient value and recover the loss. Use `all_finite` to determine whether there is a gradient underflow problem, if there is no overflow, recover the gradient and update the network weights; if there is overflow, skip this step. - -```python -@ms_function -def train_step(x, y): - (loss_value, _), grads = grad_fn(x, y) - loss_value = loss_scaler.unscale(loss_value) - - is_finite = amp.all_finite(grads) - if is_finite: - grads = loss_scaler.unscale(grads) - loss_value = ops.depend(loss_value, opt(grads)) - loss_scaler.adjust(is_finite) - return loss_value -``` - -Execute training. - -```python -epochs = 5 -for epoch in range(epochs): - for data, label in datasets: - loss = train_step(data, label) -``` - -## Mixed-precision and Loss Scale by Using the Training Interface `Model` - -### Mixed-Precision - -The `Model` interface provides the input `amp_level` to achieve automatic mixed precision, or the user can set the operator involved in the Cell to FP16 via `to_float(ms.float16)` to achieve manual mixed precision. - -#### Automatic Mixed-Precision - -To use the automatic mixed-precision, you need to call the `Model` API to transfer the network to be trained and optimizer as the input. This API converts the network model operators into FP16 operators. - -> Due to precision problems, the `BatchNorm` operator and operators involved in loss still use FP32. - -The specific implementation steps for using the `Model` interface are: - -1. Introduce the MindSpore model API `Model`. - -2. Define a network: This step is the same as that for defining a common network (no new configuration is required). - -3. Create a dataset: For this step, refer to [Data Processing](https://www.mindspore.cn/tutorials/en/master/advanced/dataset.html). - -4. Use the `Model` API to encapsulate the network model, optimizer, and loss function, and set the `amp_level` parameter. For details, see [MindSpore API](https://www.mindspore.cn/docs/en/master/api_python/train/mindspore.train.Model.html#mindspore.train.Model). In this step, MindSpore automatically selects an appropriate operator to convert FP32 to FP16. - -The following is a basic code example. First, import the required libraries and declarations. - -```python -import numpy as np -import mindspore.nn as nn -from mindspore.train import Accuracy, Model -import mindspore as ms -from mindspore.common.initializer import Normal -from mindspore import dataset as ds - -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_context(device_target="CPU") -``` - -Create a virtual random dataset for data input of the sample model. - -```python -# create dataset -def get_data(num, img_size=(1, 32, 32), num_classes=10, is_onehot=True): - for _ in range(num): - img = np.random.randn(*img_size) - target = np.random.randint(0, num_classes) - target_ret = np.array([target]).astype(np.float32) - if is_onehot: - target_onehot = np.zeros(shape=(num_classes,)) - target_onehot[target] = 1 - target_ret = target_onehot.astype(np.float32) - yield img.astype(np.float32), target_ret - -def create_dataset(num_data=1024, batch_size=32, repeat_size=1): - input_data = ds.GeneratorDataset(list(get_data(num_data)), column_names=['data','label']) - input_data = input_data.batch(batch_size, drop_remainder=True) - input_data = input_data.repeat(repeat_size) - return input_data -``` - -Taking the LeNet5 as an example, set the `amp_level` parameter and use the `Model` API to encapsulate the network model, optimizer, and loss function. - -```python -ds_train = create_dataset() - -# Initialize network -network = LeNet5(10) - -# Define Loss and Optimizer -net_loss = nn.SoftmaxCrossEntropyWithLogits(reduction="mean") -net_opt = nn.Momentum(network.trainable_params(), learning_rate=0.01, momentum=0.9) -# Set amp level -model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()}, amp_level="O3") - -# Run training -model.train(epoch=10, train_dataset=ds_train) -``` - -#### Manual Mixed-Precision - -MindSpore also supports manual mixed-precision. (Manual mixed-precision is not recommended unless you want to customize special networks and features.) - -Assume that only one Conv layer on the network uses FP16 for computation and other layers use FP32. - -> The mixed-precision is configured in the unit of Cell. The default type of a Cell is FP32. - -The following are the implementation steps of manual mixed-precision: - -1. Define the network: This step is similar with the Step 2 in the automatic mixed-precision. -2. Configure the mixed-precision: Use `to_float(mstype.float16)` to set the operators involved in the Cell to FP16. -3. Use `TrainOneStepCell` to encapsulate the network model and optimizer. - -The following is a basic code example. First, import the required libraries and declarations. - -```python -import numpy as np - -import mindspore.nn as nn -from mindspore.train import Accuracy, Model -import mindspore as ms -from mindspore.common.initializer import Normal -from mindspore import dataset as ds -import mindspore.ops as ops - -ms.set_context(mode=ms.GRAPH_MODE, device_target="GPU") -``` - -After initializing the network model, declare that the Conv1 layer in LeNet5 is computed by using FP16, i.e. `network.conv1.to_float(mstype.float16)`. - -```python -ds_train = create_dataset() -network = LeNet5(10) -net_loss = nn.SoftmaxCrossEntropyWithLogits(reduction="mean") -net_opt = nn.Momentum(network.trainable_params(), learning_rate=0.01, momentum=0.9) -network.conv1.to_float(ms.float16) -model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()}, amp_level="O2") -model.train(epoch=2, train_dataset=ds_train) -``` - -> When mixed-precision is used, the backward network can be generated only by the automatic differential function, not by user-defined inverse networks. Otherwise, MindSpore may generate exception information indicating that the data format does not match. - -### Loss scale - -The following two APIs in MindSpore that use the loss scaling algorithm are described separately [FixedLossScaleManager](https://www.mindspore.cn/docs/en/master/api_python/amp/mindspore.amp.FixedLossScaleManager.html#mindspore.amp.FixedLossScaleManager) and [DynamicLossScaleManager](https://www.mindspore.cn/docs/en/master/api_python/amp/mindspore.amp.DynamicLossScaleManager.html#mindspore.amp.DynamicLossScaleManager). - -#### FixedLossScaleManager - -`FixedLossScaleManager` does not change the size of the scale when scaling, and the value of the scale is controlled by the input parameter loss_scale, which can be specified by the user. The default value is taken if it is not specified. - -Another parameter of `FixedLossScaleManager` is `drop_overflow_update`, which controls whether parameters are updated in the event of an overflow. - -In general, the LossScale function does not need to be used with the optimizer, but when using `FixedLossScaleManager`, if `drop_overflow_update` is False, the optimizer needs to set the value of `loss_scale` and the value of `loss_scale` should be the same as that of `FixedLossScaleManager`. - -The detailed use of `FixedLossScaleManager` is as follows: - -Import the necessary libraries and declare execution using graph mode. - -```python -import numpy as np -import mindspore as ms -import mindspore.nn as nn -from mindspore import amp -from mindspore.train import Accuracy, Model -from mindspore.common.initializer import Normal -from mindspore import dataset as ds - -ms.set_context(mode=ms.GRAPH_MODE, device_target="GPU") -``` - -Define the network model by using LeNet5 as an example; define the dataset and the interfaces commonly used in the training process. - -```python -ds_train = create_dataset() -# Initialize network -network = LeNet5(10) -# Define Loss and Optimizer -net_loss = nn.SoftmaxCrossEntropyWithLogits(reduction="mean") -``` - -Use Loss Scale API to act in optimizers and models. - -```python -# Define Loss Scale, optimizer and model -#1) Drop the parameter update if there is an overflow -loss_scale_manager = amp.FixedLossScaleManager() -net_opt = nn.Momentum(network.trainable_params(), learning_rate=0.01, momentum=0.9) -model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()}, amp_level="O0", loss_scale_manager=loss_scale_manager) - -#2) Execute parameter update even if overflow occurs -loss_scale = 1024.0 -loss_scale_manager = amp.FixedLossScaleManager(loss_scale, False) -net_opt = nn.Momentum(network.trainable_params(), learning_rate=0.01, momentum=0.9, loss_scale=loss_scale) -model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()}, amp_level="O0", loss_scale_manager=loss_scale_manager) - -# Run training -model.train(epoch=10, train_dataset=ds_train, callbacks=[ms.LossMonitor()]) -``` - -#### LossScale and Optimizer - -As mentioned earlier, the optimizer needs to be used together when using `FixedLossScaleManager` and `drop_overflow_update` is False. - -This is due to the fact that when configured in this way, the division between the gradient and the `loss_scale` coefficient is performed in the optimizer. The optimizer setting is the same `loss_scale` as `FixedLossScaleManager` and the training result is correct. - -> Subsequent MindSpore will optimize the use of overflow detection in different scenarios, and gradually remove the `loss_scale` parameter in the optimizer, so that there is no need to configure the `loss_scale` parameter of the optimizer. - -It should be noted that some of the optimizers provided by MindSpore, such as `AdamWeightDecay`, do not provide the `loss_scale` parameter. If you use `FixedLossScaleManager` and the `drop_overflow_update` is configured as False, and the division between the gradient and the `loss_scale` is not performed in the optimizer, you need to customize the `TrainOneStepCell` and divide the gradient by `loss_scale` in it so that the final calculation is correct, as defined as follows: - -```python -import mindspore as ms -from mindspore import nn, ops - -grad_scale = ops.MultitypeFuncGraph("grad_scale") - -@grad_scale.register("Tensor", "Tensor") -def gradient_scale(scale, grad): - return grad * ops.cast(scale, ops.dtype(grad)) - -class CustomTrainOneStepCell(nn.TrainOneStepCell): - def __init__(self, network, optimizer, sens=1.0): - super(CustomTrainOneStepCell, self).__init__(network, optimizer, sens) - self.hyper_map = ops.HyperMap() - self.reciprocal_sense = ms.Tensor(1 / sens, ms.float32) - - def scale_grad(self, gradients): - gradients = self.hyper_map(ops.partial(grad_scale, self.reciprocal_sense), gradients) - return gradients - - def construct(self, *inputs): - loss = self.network(*inputs) - sens = ops.fill(loss.dtype, loss.shape, self.sens) - # calculate gradients, the sens will equal to the loss_scale - grads = self.grad(self.network, self.weights)(*inputs, sens) - # gradients / loss_scale - grads = self.scale_grad(grads) - # reduce gradients in distributed scenarios - grads = self.grad_reducer(grads) - loss = ops.depend(loss, self.optimizer(grads)) - return loss -``` - -- network: The network participating in the training, which contains the computational logic of the forward network and the loss function, input data and labels, and output loss function values. -- optimizer: The used optimizer. -- sens: Parameters are used to receive a user-specified `loss_scale` and the gradient value is magnified by a factor of `loss_scale` during training. -- scale_grad function: Used for division between the gradient and the `loss_scale` coefficient to restore the gradient. -- construct function: Referring to `nn. TrainOneStepCell`, defines the computational logic for `construct` and calls `scale_grad` after acquiring the gradient. - -After customizing `TrainOneStepCell`, the training network needs to be manually built, which is as follows: - -```python -import mindspore as ms -from mindspore import nn -from mindspore.train import Model - -network = LeNet5(10) - -# Define Loss and Optimizer -net_loss = nn.SoftmaxCrossEntropyWithLogits(reduction="mean") -net_opt = nn.AdamWeightDecay(network.trainable_params(), learning_rate=0.01) - -# Define LossScaleManager -loss_scale = 1024.0 -loss_scale_manager = ms.FixedLossScaleManager(loss_scale, False) - -# Build train network -net_with_loss = nn.WithLossCell(network, net_loss) -net_with_train = CustomTrainOneStepCell(net_with_loss, net_opt, loss_scale) -``` - -After building the training network, it can be run directly or via Model: - -```python -epochs = 2 - -#1) Execute net_with_train -ds_train = create_dataset() - -for epoch in range(epochs): - for d in ds_train.create_dict_iterator(): - result = net_with_train(d["data"], d["label"]) - -#2) Define Model and run -model = Model(net_with_train) - -ds_train = create_dataset() - -model.train(epoch=epochs, train_dataset=ds_train) -``` - -When training with `Model` in this scenario, the `loss_scale_manager` and `amp_level` do not need to be configured, as the `CustomTrainOneStepCell` already includes mixed-precision calculation logic. - -#### DynamicLossScaleManager - -`DynamicLossScaleManager` can dynamically change the size of the scale during training, keeping the scale as large as possible without overflow. - -`DynamicLossScaleManager` first sets scale to an initial value, which is controlled by the input init_loss_scale. - -During training, if no overflow occurs, after updating the parameters scale_window times, an attempt is made to expand the value of the scale, and if an overflow occurs, the parameter update is skipped and the value of the scale is reduced, and the scale_factor is to control the number of steps that are expanded or reduced. scale_window controls the maximum number of consecutive update steps when no overflow occurs. - -The detailed use is as follows and we only need to define LossScale in `FixedLossScaleManager` sample. The part code of the optimizer and model changes as the following code: - -```python -# Define Loss Scale, optimizer and model -scale_factor = 4 -scale_window = 3000 -loss_scale_manager = ms.DynamicLossScaleManager(scale_factor, scale_window) -net_opt = nn.Momentum(network.trainable_params(), learning_rate=0.01, momentum=0.9) -model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()}, amp_level="O0", loss_scale_manager=loss_scale_manager) -``` - -> The pictures are cited from [automatic-mixed-precision](https://developer.nvidia.com/automatic-mixed-precision). diff --git a/tutorials/source_en/advanced/mixed_precision.md b/tutorials/source_en/advanced/mixed_precision.md index f66cc791c74f36a903d7ba5bbb323ad4f9173e4c..2e8eb7e483ada3226f043b029252bf517674cedb 100644 --- a/tutorials/source_en/advanced/mixed_precision.md +++ b/tutorials/source_en/advanced/mixed_precision.md @@ -1,3 +1,555 @@ -# Mixed Precision +# Enabling Mixed Precision - \ No newline at end of file + + +## Overview + +Generally, when a neural network model is trained, the default data type is FP32. In recent years, to accelerate training time, reduce memory occupied during network training, and store a trained model with same precision, more and more mixed-precision training methods are proposed in the industry. The mixed-precision training herein means that both single precision (FP32) and half precision (FP16) are used in a training process. + +## Floating-point Data Type + +Floating-point data types include double-precision (FP64), single-precision (FP32), and half-precision (FP16). In a training process of a neural network model, an FP32 data type is generally used by default to indicate a network model weight and other parameters. The following is a brief introduction to floating-point data types. + +According to [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754), floating-point data types are classified into double-precision (FP64), single-precision (FP32), and half-precision (FP16). Each type is represented by three different bits. FP64 indicates a data type that uses 8 bytes (64 bits in total) for encoding and storage. FP32 indicates a data type that uses 4 bytes (32 bits in total) and FP16 indicates a data type that uses 2 bytes (16 bits in total). As shown in the following figure: + +![fp16_vs_FP32](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/tutorials/experts/source_zh_cn/others/images/fp16_vs_fp32.png) + +As shown in the figure, the storage space of FP16 is half that of FP32, and the storage space of FP32 is half that of FP64. It consists of three parts: + +- The highest bit indicates the sign bit. +- The middle bits indicate exponent bits. +- The low bits indicate fraction bits. + +FP16 is used as an example. The first sign bit sign indicates a positive or negative sign, and the next five bits indicate an exponent. All 0s and 1s have special uses, so the binary range is 00001~11110. The last 10 bits indicate a fraction. Suppose `S` denotes the decimal value of sign bit, `E` denotes the decimal value of exponent, and `fraction` denotes the decimal value of fraction. The formula is as follows: + +$$x=(-1)^{S}\times2^{E-15}\times(1+\frac{fraction}{1024})$$ + +Similarly, suppose `M` is score value, the true value of a formatted FP32 is as follows: + +$$x=(-1)^{S}\times2^{E-127}\times(1.M)$$ + +The true value of a formatted FP64 is as follows: + +$$x=(-1)^{S}\times2^{E-1023}\times(1.M)$$ + +The maximum value that can be represented by FP16 is 0 11110 1111111111, which is calculated as follows: + +$$(-1)^0\times2^{30-15}\times1.1111111111 = 1.1111111111(b)\times2^15 = 1.9990234375(d)\times2^15 = 65504$$ + +where `b` indicates binary value, and `d` indicates decimal value. + +The minimum value that can be represented by FP16 is 0 00001 0000000000, which is calculated as follows: + +$$ (-1)^{1}\times2^{1-15}=2^{-14}=6.104×10^{-5}=-65504$$ + +Therefore, the maximum value range of FP16 is [-65504, 65504], and the precision range is $2^{-24}$. If the value is beyond this range, the value is set to 0. + +## FP16 Training Issues + +Why do we need mixed-precision? Compared with FP32, FP16 has the following advantages: + +- Reduced memory usage: The bit width of FP16 is half of that of FP32. Therefore, the memory occupied by parameters such as the weight is also half of the original memory. The saved memory can be used to store larger network models or train more data. +- Higher communication efficiency: For distributed training, especially the large-scale model training, the communication overhead restricts the overall performance. A smaller communication bit width means that the communication performance can be improved, the waiting time can be reduced, and the data flow can be accelerated. +- Higher computing efficiency: On special AI acceleration chips, such as Huawei Ascend 910 and 310 series, or GPUs of the NVIDIA VOLTA architecture, the computing performance of FP16 is faster than that of FP32. + +However, using FP16 also brings some problems, the most important of which are precision overflow and rounding error. + +- Data overflow: Data overflow is easliy to understand. The valid data range of FP16 is $[6.10\times10^{-5}, 65504]$, and that of FP32 is $[1.4\times10^{-45}, 1.7\times10^{38}]$. We can see that the valid range of FP16 is much narrower than that of FP32. When FP16 is used to replace FP32, overflow and underflow occur. In deep learning, a gradient (a first-order derivative) of a weight in a network model needs to be calculated. Therefore, the gradient is smaller than the weight value, and underflow often occurs. +- Rounding error: Rounding error instruction is when the backward gradient of a network model is small, FP32 is usually used. However, when it is converted to FP16, the interval is smaller than the minimum interval, causing data overflow. For example, 0.00006666666 can be properly represented in FP32, but it will be represented as 0.000067 in FP16. The number that does not meet the minimum interval requirement of FP16 will be forcibly rounded off. + +## Mixed-precision Computing Process + +The following figure shows the typical computation process of mixed precision in MindSpore. + +![mix precision](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/tutorials/experts/source_zh_cn/others/images/mix_precision_fp16.png) + +1. Parameters are stored in FP32 format. +2. During the forward computation, if an FP16 operator is involved, the operator input and parameters need to be cast from FP32 to FP16. +3. The Loss layer is set to FP32. +4. During backward computation, the value is multiplied by Loss Scale to avoid underflow due to a small gradient. +5. The FP16 parameter is used for gradient computation, and the result is cast back to FP32. +6. Then, the value is divided by Loss scale to restore the multiplied gradient. +7. The optimizer checks whether the gradient overflows. If yes, the optimizer skips the update. If no, the optimizer uses FP32 to update the original parameters. + +This document describes the computation process by using examples of automatic and manual mixed precision. + +## Loss Scale + +Loss Scale is mainly used in the process of mixed-precision training. + +In the process of mixed precision training, the FP16 type is used instead of the FP32 type for data storage, so as to achieve the effect of reducing memory and improving the computing speed. However, because the FP16 type is much smaller than the range represented by the FP32 type, data underflow occurs when parameters (such as gradients) become very small during training. The Loss Scale is proposed to solve the underflow of FP16 type data. + +The main idea is to enlarge the loss by a certain multiple when calculating the loss. Due to the existence of the chain rule, the gradient also expands accordingly, and then the corresponding multiple is reduced when the optimizer updates the weight, thus avoiding the situation of data underflow without affecting the calculation result. + +There are two ways of implementing Loss Scale in MindSpore, users can either use the functional programming writeup and manually call the `scale` and `unscale` methods of `StaticLossScaler` or `DynamicLossScaler` to scale the loss or gradient during training; or they can configure the loss or gradient based on the `Model` interface and configure the mixed precision `amp_level` and the Loss Scale method `loss_scale_manager` as `FixedLossScaleManager` or `DynamicLossScaleManager` when building the model by using `Model`. + +First, let's take a look at why mixing accuracy is needed. The advantages of using FP16 to train a neural network are: + +- **Reduce memory occupation**: The bit width of FP16 is half that of FP32, so the memory occupied by parameters such as weights is also half of the original, and the saved memory can be used to put a larger network model or use more data for training. +- **Accelerate communication efficiency**: For distributed training, especially in the process of large model training, the overhead of communication restricts the overall performance of network model training, and the less bit width of communication means that communication performance can be improved. Waiting time is reduced, and data circulation can be accelerated. +- **Higher computing effciency**: On special AI-accelerated chips such as Huawei's Ascend 910 and 310 series, or GPUs of the Titan V and Tesla V100 of the NVIDIA VOLTA architecture, the performance of performing operations using FP16 is faster than that of the FP32. + +But using FP16 also brings some problems, the most important of which are precision overflow and rounding error, and Loss Scale is to solve the precision overflow and proposed. + +As shown in the figure, if only FP32 training is used, the model converges better, but if mixed-precision training is used, there will be a situation where the network model cannot converge. The reason is that the value of the gradient is too small, and using the FP16 representation will cause the problem of underflow under the data, resulting in the model not converging, as shown in the gray part of the figure. Loss Scale needs to be introduced. + +![loss_scale1](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/tutorials/experts/source_zh_cn/others/images/loss_scale1.png) + +The following is in the network model training stage, a layer of activation function gradient distribution, of which 68% of the network model activation parameter bit 0. Another 4% of the accuracy in the $2^{-32}, 2^{-20}$ interval, directly use FP16 to represent the data inside, which truncates the underflow data. All gradient values will become 0. + +![loss_scale2](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/tutorials/experts/source_zh_cn/others/images/loss_scale2.png) + +In order to solve the problem of ladder overflowing over small data, the forward calculated Loss value is amplified, that is, the parameters of FP32 are multiplied by a factor coefficient, and the possible overflowing decimal data is moved forward and panned to the data range that FP16 can represent. According to the chain differentiation law, amplifying the Loss acts on each gradient of backpropagation, which is more efficient than amplifying on each gradient. + +![loss_scale3](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/tutorials/experts/source_zh_cn/others/images/loss_scale3.png) + +Loss amplification needs to be achieved in combination with mixing accuracy, and its main main ideas are: + +- **Scale up stage**: After the network model forward calculation, the resulting loss change value DLoss is increased by a factor of $2^K$ before the repercussion propagation. +- **Scale down stage**: After backpropagation, the weight gradient is reduced by $2^K$, and the FP32 value is restored for storage. + +**Dynamic Loss Scale**: The loss scale mentioned above is to use a default value to scale the loss value, in order to make full use of the dynamic range of FP16, you can better mitigate the rounding error, and try to use a relatively large magnification. To summarize the dynamic loss scaling algorithm, it is to reduce the loss scale whenever the gradient overflows, and intermittently try to increase the loss scale, so as to achieve the use of the highest loss scale factor without causing overflow, and better restore accuracy. + +The dynamic loss scale algorithm is as follows: + +1. The algorithm of dynamic loss scaling starts with a relatively high scaling factor (such as $2^{24}$), then starts training and checks whether the number overflows in the iteration (Infs/Nans); +2. If there is no gradient overflow, the scale factor is not adjusted and the iteration continues; if the gradient overflow is detected, the scale factor is halved and the gradient update is reconfirmed until the parameter does not appear in the overflow range; +3. In the later stages of training, the loss has become stable and convergent, and the amplitude of the gradient update is often small, which can allow a higher loss scaling factor to prevent data underflow again. +4. Therefore, the dynamic loss scaling algorithm attempts to increase the loss scaling by the F multiple every N (N=2000) iterations, and then performs step 2 to check for overflow. + +## Using Mixed Precision and Loss Scale in MindSpore + +MindSpore provides two ways of using mixed precision and loss scale. + +- Use functional programming: use `auto_mixed_precision` for automatic mixing accuracy, `all_finite` for overflow judgments, and `StaticLossScaler` and `DynamicLossScaler` for manual scaling of gradients and losses. + +- Using the training interface `Model`: configure the input `amp_level` to set the execution policy for mixed precision and the input `loss_scale_manager` to `FixedLossScaleManager` or `DynamicLossScaleManager` to implement loss scaling. + +## Using a Functional Programming for Mixed Precision and Loss Scale + +MindSpore provides a functional interface for mixed precision scenarios. Users can use `auto_mixed_precision` for automatic mixed precision, `all_finite` for overflow judgments during training, and `StaticLossScaler` and `DynamicLossScaler` to manually perform gradient and loss scaling. + +Common uses of LossScaler under functional. + +First import the relevant libraries and define a LeNet5 network: + +```python +import numpy as np +import mindspore.nn as nn +from mindspore.train import Accuracy +import mindspore as ms +from mindspore.common.initializer import Normal +from mindspore import dataset as ds +from mindspore.amp import auto_mixed_precision, DynamicLossScaler, all_finite +from mindspore import ms_function, ops + + +class LeNet5(nn.Cell): + """ + Lenet network + + Args: + num_class (int): Number of classes. Default: 10. + num_channel (int): Number of channels. Default: 1. + + Returns: + Tensor, output tensor + """ + + def __init__(self, num_class=10, num_channel=1): + super(LeNet5, self).__init__() + self.conv1 = nn.Conv2d(num_channel, 6, 5, pad_mode='valid') + self.conv2 = nn.Conv2d(6, 16, 5, pad_mode='valid') + self.fc1 = nn.Dense(16 * 5 * 5, 120, weight_init=Normal(0.02)) + self.fc2 = nn.Dense(120, 84, weight_init=Normal(0.02)) + self.fc3 = nn.Dense(84, num_class, weight_init=Normal(0.02)) + self.relu = nn.ReLU() + self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2) + self.flatten = nn.Flatten() + + def construct(self, x): + x = self.max_pool2d(self.relu(self.conv1(x))) + x = self.max_pool2d(self.relu(self.conv2(x))) + x = self.flatten(x) + x = self.relu(self.fc1(x)) + x = self.relu(self.fc2(x)) + x = self.fc3(x) + return x +``` + +Perform auto mixed precision on the network. + +`auto_mixed_precision` implements the meanings of automatic mixed precision configuration as follows: + +- 'O0': keep FP32. +- 'O1': cast as FP16 by whitelist. +- 'O2': keep FP32 by blacklist and the rest cast as FP16. +- 'O3': fully cast to FP16. + +> The current black and white list is Cell granularity. + +```python +net = LeNet5(10) +auto_mixed_precision(net, 'O1') +``` + +Instantiate the LossScaler and manually scale up the loss value when defining the forward network. + +```python +loss_fn = nn.BCELoss(reduction='mean') +opt = nn.Adam(generator.trainable_params(), learning_rate=0.01) + +# Define LossScaler +loss_scaler = amp.DynamicLossScaler(scale_value=2**10, scale_factor=2, scale_window=50) + +def net_forward(data, label): + out = net(data) + loss_value = loss_fn(out, label) + # scale up the loss value + scaled_loss = loss_scaler.scale(loss_value) + return scaled_loss, out +``` + +Reverse acquisition of gradients. + +```python +grad_fn = ops.value_and_grad(net_forward, None, net.trainable_params()) +``` + +Define the training step: calculate the current gradient value and recover the loss. Use `all_finite` to determine whether there is a gradient underflow problem. If there is no overflow, recover the gradient and update the network weights, while if there is overflow, skip this step. + +```python +@ms_function +def train_step(x, y): + (loss_value, _), grads = grad_fn(x, y) + loss_value = loss_scaler.unscale(loss_value) + + is_finite = amp.all_finite(grads) + if is_finite: + grads = loss_scaler.unscale(grads) + loss_value = ops.depend(loss_value, opt(grads)) + loss_scaler.adjust(is_finite) + return loss_value +``` + +Then a virtual random dataset is created for the data input of the sample model. + +```python +# create dataset +def get_data(num, img_size=(1, 32, 32), num_classes=10, is_onehot=True): + for _ in range(num): + img = np.random.randn(*img_size) + target = np.random.randint(0, num_classes) + target_ret = np.array([target]).astype(np.float32) + if is_onehot: + target_onehot = np.zeros(shape=(num_classes,)) + target_onehot[target] = 1 + target_ret = target_onehot.astype(np.float32) + yield img.astype(np.float32), target_ret + +def create_dataset(num_data=1024, batch_size=32, repeat_size=1): + input_data = ds.GeneratorDataset(list(get_data(num_data)), column_names=['data', 'label']) + input_data = input_data.batch(batch_size, drop_remainder=True) + input_data = input_data.repeat(repeat_size) + return input_data +``` + +Execute the training. + +```python +datasets = create_dataset() +epochs = 5 +for epoch in range(epochs): + for data, label in datasets: + loss = train_step(data, label) +``` + +## Mixed precision and Loss Scale by Using the Training Interface `Model` + +### Mixed-Precision + +The `Model` interface provides the input `amp_level` to achieve automatic mixed precision, or the user can set the operator involved in the Cell to FP16 via `to_float(ms.float16)` to achieve manual mixed precision. + +#### Automatic Mixed-Precision + +To use the automatic mixed-precision, you need to call the `Model` API to transfer the network to be trained and optimizer as the input. This API converts the network model operators into FP16 operators. + +> Due to precision problems, the `BatchNorm` operator and operators involved in loss still use FP32. + +The specific implementation steps for using the `Model` interface are: + +1. Introduce the MindSpore model API `Model`. + +2. Define a network: This step is the same as that for defining a common network (no new configuration is required). + +3. Create a dataset: For this step, refer to [Data Processing](https://www.mindspore.cn/tutorials/en/master/advanced/dataset.html). + +4. Use the `Model` API to encapsulate the network model, optimizer, and loss function, and set the `amp_level` parameter. For details, see [MindSpore API](https://www.mindspore.cn/docs/en/master/api_python/train/mindspore.train.Model.html#mindspore.train.Model). In this step, MindSpore automatically selects an appropriate operator to convert FP32 to FP16. + +The following is a basic code example. First, import the required libraries and declarations. + +```python +import numpy as np +import mindspore.nn as nn +from mindspore.train import Accuracy, Model +import mindspore as ms +from mindspore.common.initializer import Normal +from mindspore import dataset as ds + +ms.set_context(mode=ms.GRAPH_MODE) +ms.set_context(device_target="CPU") +``` + +Create a virtual random dataset for data input of the sample model. + +```python +# create dataset +def get_data(num, img_size=(1, 32, 32), num_classes=10, is_onehot=True): + for _ in range(num): + img = np.random.randn(*img_size) + target = np.random.randint(0, num_classes) + target_ret = np.array([target]).astype(np.float32) + if is_onehot: + target_onehot = np.zeros(shape=(num_classes,)) + target_onehot[target] = 1 + target_ret = target_onehot.astype(np.float32) + yield img.astype(np.float32), target_ret + +def create_dataset(num_data=1024, batch_size=32, repeat_size=1): + input_data = ds.GeneratorDataset(list(get_data(num_data)), column_names=['data', 'label']) + input_data = input_data.batch(batch_size, drop_remainder=True) + input_data = input_data.repeat(repeat_size) + return input_data +``` + +Taking the LeNet5 as an example, set the `amp_level` parameter and use the `Model` API to encapsulate the network model, optimizer, and loss function. + +```python +ds_train = create_dataset() + +# Initialize network +network = LeNet5(10) + +# Define Loss and Optimizer +net_loss = nn.SoftmaxCrossEntropyWithLogits(reduction="mean") +net_opt = nn.Momentum(network.trainable_params(), learning_rate=0.01, momentum=0.9) +# Set amp level +model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()}, amp_level="O3") + +# Run training +model.train(epoch=10, train_dataset=ds_train) +``` + +#### Manual Mixed-Precision + +MindSpore also supports manual mixed-precision. (Manual mixed-precision is not recommended unless you want to customize special networks and features.) + +Assume that only one Conv layer on the network uses FP16 for computation and other layers use FP32. + +> The mixed-precision is configured in the unit of Cell. The default type of a Cell is FP32. + +The following are the implementation steps of manual mixed-precision: + +1. Define the network: This step is similar with the Step 2 in the automatic mixed-precision. +2. Configure the mixed-precision: Use `to_float(mstype.float16)` to set the operators involved in the Cell to FP16. +3. Use `TrainOneStepCell` to encapsulate the network model and optimizer. + +The following is a basic code example. First, import the required libraries and declarations. + +```python +import numpy as np + +import mindspore.nn as nn +from mindspore.train import Accuracy, Model +import mindspore as ms +from mindspore.common.initializer import Normal +from mindspore import dataset as ds +import mindspore.ops as ops + +ms.set_context(mode=ms.GRAPH_MODE, device_target="GPU") +``` + +After initializing the network model, declare that the Conv1 layer in LeNet5 is computed by using FP16, i.e. `network.conv1.to_float(mstype.float16)`. + +```python +ds_train = create_dataset() +network = LeNet5(10) +net_loss = nn.SoftmaxCrossEntropyWithLogits(reduction="mean") +net_opt = nn.Momentum(network.trainable_params(), learning_rate=0.01, momentum=0.9) +network.conv1.to_float(ms.float16) +model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()}, amp_level="O2") +model.train(epoch=2, train_dataset=ds_train) +``` + +> When mixed-precision is used, the backward network can be generated only by the automatic differential function, not by user-defined inverse networks. Otherwise, MindSpore may generate exception information indicating that the data format does not match. + +### Loss scale + +The following two APIs in MindSpore that use the loss scaling algorithm are described separately [FixedLossScaleManager](https://www.mindspore.cn/docs/en/master/api_python/amp/mindspore.amp.FixedLossScaleManager.html#mindspore.amp.FixedLossScaleManager) and [DynamicLossScaleManager](https://www.mindspore.cn/docs/en/master/api_python/amp/mindspore.amp.DynamicLossScaleManager.html#mindspore.amp.DynamicLossScaleManager). + +#### FixedLossScaleManager + +`FixedLossScaleManager` does not change the size of the scale when scaling, and the value of the scale is controlled by the input parameter loss_scale, which can be specified by the user. The default value is taken if it is not specified. + +Another parameter of `FixedLossScaleManager` is `drop_overflow_update`, which controls whether parameters are updated in the event of an overflow. + +In general, the LossScale function does not need to be used with the optimizer, but when using `FixedLossScaleManager`, if `drop_overflow_update` is False, the optimizer needs to set the value of `loss_scale` and the value of `loss_scale` should be the same as that of `FixedLossScaleManager`. + +The detailed use of `FixedLossScaleManager` is as follows: + +Import the necessary libraries and declare execution using graph mode. + +```python +import numpy as np +import mindspore as ms +import mindspore.nn as nn +from mindspore import amp +from mindspore.train import Accuracy, Model +from mindspore.common.initializer import Normal +from mindspore import dataset as ds + +ms.set_context(mode=ms.GRAPH_MODE, device_target="GPU") +``` + +Define the network model by using LeNet5 as an example; define the dataset and the interfaces commonly used in the training process. + +```python +ds_train = create_dataset() +# Initialize network +network = LeNet5(10) +# Define Loss and Optimizer +net_loss = nn.SoftmaxCrossEntropyWithLogits(reduction="mean") +``` + +Use Loss Scale API to act in optimizers and models. + +```python +# Define Loss Scale, optimizer and model +#1) Drop the parameter update if there is an overflow +loss_scale_manager = amp.FixedLossScaleManager() +net_opt = nn.Momentum(network.trainable_params(), learning_rate=0.01, momentum=0.9) +model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()}, amp_level="O0", loss_scale_manager=loss_scale_manager) + +#2) Execute parameter update even if overflow occurs +loss_scale = 1024.0 +loss_scale_manager = amp.FixedLossScaleManager(loss_scale, False) +net_opt = nn.Momentum(network.trainable_params(), learning_rate=0.01, momentum=0.9, loss_scale=loss_scale) +model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()}, amp_level="O0", loss_scale_manager=loss_scale_manager) + +# Run training +model.train(epoch=10, train_dataset=ds_train, callbacks=[ms.LossMonitor()]) +``` + +#### LossScale and Optimizer + +As mentioned earlier, the optimizer needs to be used together when using `FixedLossScaleManager` and `drop_overflow_update` is False. + +This is due to the fact that when configured in this way, the division between the gradient and the `loss_scale` coefficient is performed in the optimizer. The optimizer setting is the same `loss_scale` as `FixedLossScaleManager` and the training result is correct. + +> Subsequent MindSpore will optimize the use of overflow detection in different scenarios, and gradually remove the `loss_scale` parameter in the optimizer, so that there is no need to configure the `loss_scale` parameter of the optimizer. + +It should be noted that some of the optimizers provided by MindSpore, such as `AdamWeightDecay`, do not provide the `loss_scale` parameter. If you use `FixedLossScaleManager` and the `drop_overflow_update` is configured as False, and the division between the gradient and the `loss_scale` is not performed in the optimizer, you need to customize the `TrainOneStepCell` and divide the gradient by `loss_scale` in it so that the final calculation is correct, as defined as follows: + +```python +import mindspore as ms +from mindspore.train import Model +from mindspore import nn, ops + +grad_scale = ops.MultitypeFuncGraph("grad_scale") + +@grad_scale.register("Tensor", "Tensor") +def gradient_scale(scale, grad): + return grad * ops.cast(scale, ops.dtype(grad)) + +class CustomTrainOneStepCell(nn.TrainOneStepCell): + def __init__(self, network, optimizer, sens=1.0): + super(CustomTrainOneStepCell, self).__init__(network, optimizer, sens) + self.hyper_map = ops.HyperMap() + self.reciprocal_sense = ms.Tensor(1 / sens, ms.float32) + + def scale_grad(self, gradients): + gradients = self.hyper_map(ops.partial(grad_scale, self.reciprocal_sense), gradients) + return gradients + + def construct(self, *inputs): + loss = self.network(*inputs) + sens = ops.fill(loss.dtype, loss.shape, self.sens) + # calculate gradients, the sens will equal to the loss_scale + grads = self.grad(self.network, self.weights)(*inputs, sens) + # gradients / loss_scale + grads = self.scale_grad(grads) + # reduce gradients in distributed scenarios + grads = self.grad_reducer(grads) + loss = ops.depend(loss, self.optimizer(grads)) + return loss +``` + +- network: The network participating in the training, which contains the computational logic of the forward network and the loss function, input data and labels, and output loss function values. +- optimizer: The used optimizer. +- sens: Parameters are used to receive a user-specified `loss_scale` and the gradient value is magnified by a factor of `loss_scale` during training. +- scale_grad function: Used for division between the gradient and the `loss_scale` coefficient to restore the gradient. +- construct function: Referring to `nn. TrainOneStepCell`, defines the computational logic for `construct` and calls `scale_grad` after acquiring the gradient. + +After customizing `TrainOneStepCell`, the training network needs to be manually built, which is as follows: + +```python +from mindspore import nn +from mindspore import amp + +network = LeNet5(10) + +# Define Loss and Optimizer +net_loss = nn.SoftmaxCrossEntropyWithLogits(reduction="mean") +net_opt = nn.AdamWeightDecay(network.trainable_params(), learning_rate=0.01) + +# Define LossScaleManager +loss_scale = 1024.0 +loss_scale_manager = amp.FixedLossScaleManager(loss_scale, False) + +# Build train network +net_with_loss = nn.WithLossCell(network, net_loss) +net_with_train = CustomTrainOneStepCell(net_with_loss, net_opt, loss_scale) +``` + +After building the training network, it can be run directly or via Model: + +```python +epochs = 2 + +#1) Execute net_with_train +ds_train = create_dataset() + +for epoch in range(epochs): + for d in ds_train.create_dict_iterator(): + result = net_with_train(d["data"], d["label"]) + +#2) Define Model and run +model = Model(net_with_train) + +ds_train = create_dataset() + +model.train(epoch=epochs, train_dataset=ds_train) +``` + +When training with `Model` in this scenario, the `loss_scale_manager` and `amp_level` do not need to be configured, as the `CustomTrainOneStepCell` already includes mixed-precision calculation logic. + +#### DynamicLossScaleManager + +`DynamicLossScaleManager` can dynamically change the size of the scale during training, keeping the scale as large as possible without overflow. + +`DynamicLossScaleManager` first sets scale to an initial value, which is controlled by the input init_loss_scale. + +During training, if no overflow occurs, after updating the parameters scale_window times, an attempt is made to expand the value of the scale, and if an overflow occurs, the parameter update is skipped and the value of the scale is reduced, and the scale_factor is to control the number of steps that are expanded or reduced. scale_window controls the maximum number of consecutive update steps when no overflow occurs. + +The detailed use is as follows and we only need to define LossScale in `FixedLossScaleManager` sample. The part code of the optimizer and model changes as the following code: + +```python +# Define Loss Scale, optimizer and model +scale_factor = 4 +scale_window = 3000 +loss_scale_manager = amp.DynamicLossScaleManager(scale_factor, scale_window) +net_opt = nn.Momentum(network.trainable_params(), learning_rate=0.01, momentum=0.9) +model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()}, amp_level="O0", loss_scale_manager=loss_scale_manager) +``` + +> The pictures are cited from [automatic-mixed-precision](https://developer.nvidia.com/automatic-mixed-precision). diff --git a/tutorials/source_zh_cn/advanced/mixed_precision.ipynb b/tutorials/source_zh_cn/advanced/mixed_precision.ipynb index 05fadabb6325b41f472c2f2f84f6e75c4c8638d6..8f7e6ef8e579dfec7299499151965fdac87b6902 100644 --- a/tutorials/source_zh_cn/advanced/mixed_precision.ipynb +++ b/tutorials/source_zh_cn/advanced/mixed_precision.ipynb @@ -773,7 +773,7 @@ "- scale_grad函数:用于梯度与`loss_scale`系数之间的除法运算,还原梯度。\n", "- construct函数:参照`nn.TrainOneStepCell`定义`construct`的计算逻辑,并在获取梯度后调用`scale_grad`。\n", "\n", - "自定义`TrainOneStepCell`后,需要手动构建训练网络,如下:" + "自定义`TrainOneStepCell`后,需要手动构建训练网络,如下:" ] }, { @@ -886,7 +886,7 @@ "collapsed": false }, "source": [ - "> 图片引用自[automatic-mixed-precision](https://developer.nvidia.com/automatic-mixed-precision)" + "> 图片引用自[automatic-mixed-precision](https://developer.nvidia.com/automatic-mixed-precision)。" ] } ],