diff --git a/tutorials/experts/source_en/data_engine/auto_augmentation.md b/tutorials/experts/source_en/data_engine/auto_augmentation.md new file mode 100644 index 0000000000000000000000000000000000000000..c39910abf0f27cee3c14dca8ae539c4f319e0df3 --- /dev/null +++ b/tutorials/experts/source_en/data_engine/auto_augmentation.md @@ -0,0 +1,142 @@ +# Auto Augmentation + +`Ascend` `GPU` `CPU` `Data Preparation` + + + +## Overview + +MindSpore not only allows you to customize data augmentation, but also provides an auto augmentation method to automatically perform data augmentation on images based on specific policies. + +Auto augmentation can be implemented based on probability or callback parameters. + +> For a complete example, see [Application of Auto Augmentation](https://www.mindspore.cn/docs/programming_guide/en/master/enable_auto_augmentation.html). + +## Probability Based Auto Augmentation + +MindSpore provides a series of probability-based auto augmentation APIs. You can randomly select and combine various data augmentation operations to make data augmentation more flexible. + +For details about APIs, see [MindSpore API](https://www.mindspore.cn/docs/api/en/master/api_python/mindspore.dataset.transforms.html). + +### RandomApply + +The API receives a data augmentation operation list `transforms` and executes the data augmentation operations in the list in sequence at a certain probability or executes none of them. The default probability is 0.5. + +In the following code example, the `RandomCrop` and `RandomColorAdjust` operations are executed in sequence with a probability of 0.5 or none of them are executed. + +```python +import mindspore.dataset.vision.c_transforms as c_vision +from mindspore.dataset.transforms.c_transforms import RandomApply + +rand_apply_list = RandomApply([c_vision.RandomCrop(512), c_vision.RandomColorAdjust()]) +``` + +### RandomChoice + +The API receives a data augmentation operation list `transforms` and randomly selects a data augmentation operation to perform. + +In the following code example, an operation is selected from `CenterCrop` and `RandomCrop` for execution with equal probability. + +```python +import mindspore.dataset.vision.c_transforms as c_vision +from mindspore.dataset.transforms.c_transforms import RandomChoice + +rand_choice = RandomChoice([c_vision.CenterCrop(512), c_vision.RandomCrop(512)]) +``` + +### RandomSelectSubpolicy + +The API receives a preset policy list, including a series of sub-policy combinations. Each sub-policy consists of several data augmentation operations executed in sequence and their execution probabilities. + +First, a sub-policy is randomly selected for each image with equal probability, and then operations are performed according to the probability sequence in the sub-policy. + +In the following code example, two sub-policies are preset. Sub-policy 1 contains the `RandomRotation`, `RandomVerticalFlip`, and `RandomColorAdjust` operations, whose probabilities are 0.5, 1.0, and 0.8, respectively. Sub-policy 2 contains the `RandomRotation` and `RandomColorAdjust` operations, with the probabilities of 1.0 and 0.2, respectively. + +```python +import mindspore.dataset.vision.c_transforms as c_vision +from mindspore.dataset.vision.c_transforms import RandomSelectSubpolicy + +policy_list = [ + [(c_vision.RandomRotation((45, 45)), 0.5), (c_vision.RandomVerticalFlip(), 1.0), (c_vision.RandomColorAdjust(), 0.8)], + [(c_vision.RandomRotation((90, 90)), 1.0), (c_vision.RandomColorAdjust(), 0.2)] + ] +policy = RandomSelectSubpolicy(policy_list) +``` + +## Callback Parameter based Auto Augmentation + +The `sync_wait` API of MindSpore supports dynamic adjustment of the data augmentation policy by batch or epoch granularity during training. You can set blocking conditions to trigger specific data augmentation operations. + +`sync_wait` blocks the entire data processing pipeline until `sync_update` triggers the customized `callback` function. The two APIs must be used together. Their descriptions are as follows: + +- sync_wait(condition_name, num_batch=1, callback=None) + + This API adds a blocking condition `condition_name` to a dataset. When `sync_update` is called, the specified `callback` function is executed. + +- sync_update(condition_name, num_batch=None, data=None) + + This API releases the block corresponding to `condition_name` and triggers the specified `callback` function for `data`. + +The following demonstrates the use of automatic data augmentation based on callback parameters. + +1. Customize the `Augment` class where `preprocess` is a custom data augmentation function and `update` is a callback function for updating the data augmentation policy. + + ```python + import mindspore.dataset as ds + import numpy as np + + class Augment: + def __init__(self): + self.ep_num = 0 + self.step_num = 0 + + def preprocess(self, input_): + return np.array((input_ + self.step_num ** self.ep_num - 1), ) + + def update(self, data): + self.ep_num = data['ep_num'] + self.step_num = data['step_num'] + ``` + +2. The data processing pipeline calls back the custom data augmentation policy update function `update`, and then performs the data augmentation operation defined in `preprocess` based on the updated policy in the `map` operation. + + ```python + arr = list(range(1, 4)) + dataset = ds.NumpySlicesDataset(arr, shuffle=False) + aug = Augment() + dataset = dataset.sync_wait(condition_name="policy", callback=aug.update) + dataset = dataset.map(operations=[aug.preprocess]) + ``` + +3. Call `sync_update` in each step to update the data augmentation policy. + + ```python + epochs = 5 + itr = dataset.create_tuple_iterator(num_epochs=epochs) + step_num = 0 + for ep_num in range(epochs): + for data in itr: + print("epcoh: {}, step:{}, data :{}".format(ep_num, step_num, data)) + step_num += 1 + dataset.sync_update(condition_name="policy", data={'ep_num': ep_num, 'step_num': step_num}) + ``` + + The output is as follows: + + ```text + epcoh: 0, step:0, data :[Tensor(shape=[], dtype=Int64, value= 1)] + epcoh: 0, step:1, data :[Tensor(shape=[], dtype=Int64, value= 2)] + epcoh: 0, step:2, data :[Tensor(shape=[], dtype=Int64, value= 3)] + epcoh: 1, step:3, data :[Tensor(shape=[], dtype=Int64, value= 1)] + epcoh: 1, step:4, data :[Tensor(shape=[], dtype=Int64, value= 5)] + epcoh: 1, step:5, data :[Tensor(shape=[], dtype=Int64, value= 7)] + epcoh: 2, step:6, data :[Tensor(shape=[], dtype=Int64, value= 6)] + epcoh: 2, step:7, data :[Tensor(shape=[], dtype=Int64, value= 50)] + epcoh: 2, step:8, data :[Tensor(shape=[], dtype=Int64, value= 66)] + epcoh: 3, step:9, data :[Tensor(shape=[], dtype=Int64, value= 81)] + epcoh: 3, step:10, data :[Tensor(shape=[], dtype=Int64, value= 1001)] + epcoh: 3, step:11, data :[Tensor(shape=[], dtype=Int64, value= 1333)] + epcoh: 4, step:12, data :[Tensor(shape=[], dtype=Int64, value= 1728)] + epcoh: 4, step:13, data :[Tensor(shape=[], dtype=Int64, value= 28562)] + epcoh: 4, step:14, data :[Tensor(shape=[], dtype=Int64, value= 38418)] + ``` diff --git a/tutorials/experts/source_en/data_engine/cache.md b/tutorials/experts/source_en/data_engine/cache.md new file mode 100644 index 0000000000000000000000000000000000000000..f5b8de378219b814897a13cbb8d27c5c5ee5d700 --- /dev/null +++ b/tutorials/experts/source_en/data_engine/cache.md @@ -0,0 +1,510 @@ +# Single-Node Tensor Cache + +`Ascend` `GPU` `CPU` `Data Preparation` + + + +## Overview + +If you need to repeatedly access remote datasets or load datasets from disks, you can use the single-node cache operator to cache datasets in the local memory to accelerate dataset loading. + +The cache operator depends on the cache server started on the current node. Functioning as a daemon process and independent of the training script, the cache server is mainly used to manage cached data, including storing, querying, and loading data, and writing cached data when the cache is not hit. + +If the memory space is insufficient to cache all datasets, you can configure a cache operator to cache the remaining data to disks. + +Currently, the cache service supports only single-node cache. That is, the client and server are deployed on the same machine. This service can be used in the following scenarios: + +- Cache the loaded original dataset. + + You can use the cache in the dataset loading operator. The loaded data is stored in the cache server. If the same data is required subsequently, the data can be directly load from the cache server, avoiding repeated loading from the disk. + + ![cache on leaf pipeline](./images/cache_dataset.png) + +- Cache the data processed by argumentation. + + You can also use the cache in the `map` operator. The data processed by argumentation (such as image cropping or resizing) is directly cached, avoiding repeated data argumentation operations and reducing unnecessary computations. + + ![cache on map pipeline](./images/cache_processed_data.png) + + > You are advised to cache image data in `decode` + `resize` + `cache` mode. The data processed by `decode` can be directly cached only in single-node single-device mode. + +> For a complete example, see [Application of Single-Node Tensor Cache](https://www.mindspore.cn/docs/programming_guide/en/master/enable_cache.html). + +## Basic Cache Usage + +1. Configure the environment. + + Before using the cache service, you need to install MindSpore and set related environment variables. The Conda environment is used as an example. The setting method is as follows: + + ```text + export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:{path_to_conda}/envs/{your_env_name}/lib/python3.7/site-packages/mindspore:{path_to_conda}/envs/{your_env_name}/lib/python3.7/site-packages/mindspore/lib + export PATH=$PATH:{path_to_conda}/envs/{your_env_name}/bin + ``` + + You can also set the environment with the following code. + + ```python + import os + import sys + import mindspore + + python_path = "/".join(sys.executable.split("/")[:-1]) + mindspore_path = "/".join(mindspore.__file__.split("/")[:-1]) + mindspore_lib_path = os.path.join(mindspore_path, "lib") + + if 'PATH' not in os.environ: + os.environ['PATH'] = python_path + elif python_path not in os.environ['PATH']: + os.environ['PATH'] += ":" + python_path + print(os.environ['PATH']) + + os.environ['LD_LIBRARY_PATH'] = "{}:{}:{}".format(mindspore_path, mindspore_lib_path, mindspore_lib_path.split("python3.7")[0]) + print(os.environ['LD_LIBRARY_PATH']) + ``` + + > When the cache is used, the server memory may be insufficient. Therefore, you are advised to increase the swap memory space of the server to more than 100 GB before using the cache. For details about how to increase the swap memory space on Ubuntu, EulerOS, or CentOS, see [related tutorials](https://help.ubuntu.com/community/SwapFaq#How_do_I_add_a_swap_file.3F). + +2. Start the cache server. + + Before using the single-node cache service, run the following command to start the cache server: + + ```bash + cache_admin --start + ``` + + If the following information is displayed, the cache server is started successfully: + + ```text + Cache server startup completed successfully! + The cache server daemon has been created as process id 10394 and is listening on port 50052 + + Recommendation: + Since the server is detached into its own daemon process, monitor the server logs (under /tmp/mindspore/cache/log) for any issues that may happen after startup + ``` + + `cache_admin` supports the following commands and options: + - `--start`: starts the cache server. The following options are supported: + - `--workers` or `-w`: specifies the number of worker threads on the cache server. By default, the number of worker threads is half of the number of CPUs. This parameter relies on the NUMA architecture of the server. The value will be adjusted automatically by the server if it's not a multiple of number of NUMA nodes in the machine. + - `--spilldir` or `-s`: specifies the disk file path for storing remaining data when the cached data size exceeds the memory space. The default value is '' (which means disabling spilling). + - `--hostname` or `-h`: specifies the IP address of the cache server. The default value is 127.0.0.1. + - `--port` or `-p`: specifies the port number of the cache server. The default value is 50052. + - `--loglevel` or `-l`: sets the log level. The default value is 1 (WARNING). If this option is set to 0 (INFO), excessive logs will be generated, resulting in performance deterioration. + - `--stop`: stops the cache server. + - `--generate_session` or `-g`: generates a cache session. + - `--destroy_session` or `-d`: deletes a cache session. + - `--list_sessions`: displays the list of currently cached sessions and their details. + - `--server_info`:displays the configuration parameters and active session list of current server. + - `--help`: displays the help information. + + In the preceding options, you can use `-h` and `-p` to specify a server. Users can also set environment variables `MS_CACHE_HOST` and `MS_CACHE_PORT` to specify it. If hostname and port are not set, operations are performed on the server with the IP address 127.0.0.1 and port number 50052 by default. + + You can run the `ps -ef|grep cache_server` command to check whether the server is started and query server parameters. + + You can also run the `cache_admin --server_info` command to get the full list of configuration of cache server. + + ```text + $ cache_admin --server_info + Cache Server Configuration: + ---------------------------------------- + config name value + ---------------------------------------- + hostname 127.0.0.1 + port 50052 + number of workers 16 + log level 1 + spill dir None + ---------------------------------------- + Active sessions: + No active sessions. + ``` + + Where, the table of Cache Server Configuration lists five detailed configuration information. Active sessions shows the list of active session ID in current server if any. + + Cache server generates log files with filename "cache_server.\.\.log.\.\.\". Note that there might be masses of DEBUG logs printed to the screen when `GLOG_v=0` is set. + + > - To enable data spilling, you need to use `-s` to set spilling path when starting cache server. Otherwise, this feature is default to be disabled and it will bring up a memory-only cache server. + +3. Create a cache session. + + If no cache session exists on the cache server, a cache session needs to be created to obtain the cache session ID. + + ```text + $ cache_admin -g + Session created for server on port 50052: 1456416665 + ``` + + In the preceding command, 1456416665 is the cache session ID allocated by the server with port number 50052. + + You can run the `cache_admin --list_sessions` command to view all cache sessions on the current server. + + ```text + $ cache_admin --list_sessions + Listing sessions for server on port 50052 + + Session Cache Id Mem cached Disk cached Avg cache size Numa hit + 1456416665 n/a n/a n/a n/a n/a + ``` + + Output parameter description: + - `Session`: specifies the cache session ID. + - `Cache Id`: specifies the ID of the cache instance in the current cache session. `n/a` indicates that no cache instance is created. + - `Mem cached`: specifies the cached data volume in the memory. + - `Disk cached`: specifies the cached data volume in the disk. + - `Avg cache size`: specifies the average size of each line of data in the current cache. + - `Numa hit`: specifies the number of NUMA hits. A larger value indicates better time performance. + +4. Create a cache instance. + + In the Python training script, use the `DatasetCache` API to define a cache instance named `test_cache`, and specify the `session_id` parameter to a cache session ID created in the previous step. + + ```python + import mindspore.dataset as ds + + test_cache = ds.DatasetCache(session_id=1456416665, size=0, spilling=False) + ``` + + `DatasetCache` supports the following parameters: + - `session_id`: specifies the cache session ID, which can be created and obtained by running the `cache_admin -g` command. + - `size`: specifies the maximum memory space occupied by the cache. The unit is MB. For example, if the cache space is 512 GB, set `size` to `524288`. The default value is 0. + - `spilling`: determines whether to spill the remaining data to disks when the memory space exceeds the upper limit. The default value is False. + - `hostname`: specifies the IP address for connecting to the cache server. The default value is 127.0.0.1. + - `port`: specifies the port number for connecting to the cache server. The default value is 50052. + - `num_connections`: specifies the number of established TCP/IP connections. The default value is 12. + - `prefetch_size`: specifies the number of prefetched rows. The default value is 20. + + > - In actual use, you are advised to run the `cache_admin -g` command to obtain a cache session ID from the cache server and use it as the parameter of `session_id` to prevent errors caused by cache session nonexistence. + > - `size=0` indicates that the memory space used by the cache is not limited manually, but automically controlled by the cache_server according to system's total memory resources, and cache server's memory usage would be limited to within 80% of the total system memory. + > - Users can also manually set `size` to a proper value based on the idle memory of the machine. Note that before setting the `size` parameter, make sure to check the available memory of the system and the size of the dataset to be loaded. If the memory of cache_server or the dataset size exceeds the available memory of the system, the server may break down or restart, it may also automatically shut down, or the training process may fail. + > - `spilling=True` indicates that the remaining data is written to disks when the memory space is insufficient. Therefore, ensure that you have the write permission on the configured disk path and the disk space is sufficient to store the remaining cache data. Note that if no spilling path is set when cache server starts, setting `spilling=True` will raise an error when calling the API. + > - `spilling=False` indicates that no data is written once the configured memory space is used up on the cache server. + > - If a dataset that does not support random access (such as `TFRecordDataset`) is used to load data and the cache service is enabled, ensure that the entire dataset is stored locally. In this scenario, if the local memory space is insufficient to store all data, spilling must be enabled to spill data to disks. + > - `num_connections` and `prefetch_size` are internal performance tuning parameters. Generally, you do not need to set these two parameters. + +5. Insert a cache instance. + + Currently, the cache service can be used to cache both original datasets and datasets processed by argumentation. The following example shows two usage methods. + + Note that you need to create a cache instance for each of the two examples according to step 4, and use the created `test_cache` as the `cache` parameter in the dataset loading operator or map operator. + + CIFAR-10 dataset is used in the following two examples. Before running the sample, download and store the CIFAR-10 dataset by referring to [Loading Dataset](https://www.mindspore.cn/docs/programming_guide/en/master/dataset_loading.html#cifar-10-100). + + ```text + ./datasets/cifar-10-batches-bin + ├── readme.html + ├── test + │ └── test_batch.bin + └── train + ├── batches.meta.txt + ├── data_batch_1.bin + ├── data_batch_2.bin + ├── data_batch_3.bin + ├── data_batch_4.bin + └── data_batch_5.bin + ``` + + ```python + import os + import requests + import tarfile + import zipfile + import shutil + + requests.packages.urllib3.disable_warnings() + + def download_dataset(url, target_path): + """ download and unzip the dataset """ + if not os.path.exists(target_path): + os.makedirs(target_path) + download_file = url.split(\"/\")[-1] + if not os.path.exists(download_file): + res = requests.get(url, stream=True, verify=False) + if download_file.split(\".\")[-1] not in [\"tgz\", \"zip\", \"tar\", \"gz\"]: + download_file = os.path.join(target_path, download_file) + with open(download_file, \"wb\") as f: + for chunk in res.iter_content(chunk_size=512): + if chunk: + f.write(chunk) + if download_file.endswith(\"zip\"): + z = zipfile.ZipFile(download_file, \"r\") + z.extractall(path=target_path) + z.close() + if download_file.endswith(\".tar.gz\") or download_file.endswith(\".tar\") or download_file.endswith(\".tgz\"): + t = tarfile.open(download_file) + names = t.getnames() + for name in names: + t.extract(name, target_path) + t.close() + print(\"The {} file is downloaded and saved in the path {} after processing\".format(os.path.basename(url), target_path)) + + download_dataset(\"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-binary.tar.gz\", \"./datasets\") + test_path = \"./datasets/cifar-10-batches-bin/test\" + train_path = \"./datasets/cifar-10-batches-bin/train\" + os.makedirs(test_path, exist_ok=True) + os.makedirs(train_path, exist_ok=True) + if not os.path.exists(os.path.join(test_path, \"test_batch.bin\")): + shutil.move(\"./datasets/cifar-10-batches-bin/test_batch.bin\", test_path) + [shutil.move(\"./datasets/cifar-10-batches-bin/\"+i, train_path) for i in os.listdir(\"./datasets/cifar-10-batches-bin/\") if os.path.isfile(\"./datasets/cifar-10-batches-bin/\"+i) and not i.endswith(\".html\") and not os.path.exists(os.path.join(train_path, i))] + ``` + + - Cache the original loaded dataset. + + ```python + dataset_dir = "./datasets/cifar-10-batches-bin/train" + + # apply cache to dataset + data = ds.Cifar10Dataset(dataset_dir=dataset_dir, num_samples=4, shuffle=False, num_parallel_workers=1, cache=test_cache) + + num_iter = 0 + for item in data.create_dict_iterator(num_epochs=1): # each data is a dictionary + # in this example, each dictionary has a key "image" + print("{} image shape: {}".format(num_iter, item["image"].shape)) + num_iter += 1 + ``` + + The output is as follows: + + ```text + 0 image shape: (32, 32, 3) + 1 image shape: (32, 32, 3) + 2 image shape: (32, 32, 3) + 3 image shape: (32, 32, 3) + ``` + + You can run the `cache_admin --list_sessions` command to check whether there are four data records in the current session. If yes, the data is successfully cached. + + ```text + $ cache_admin --list_sessions + Listing sessions for server on port 50052 + + Session Cache Id Mem cached Disk cached Avg cache size Numa hit + 1456416665 821590605 4 n/a 3226 4 + ``` + + - Cache the data processed by argumentation. + + ```python + import mindspore.dataset.vision.c_transforms as c_vision + + dataset_dir = "cifar-10-batches-bin/" + + # apply cache to dataset + data = ds.Cifar10Dataset(dataset_dir=dataset_dir, num_samples=5, shuffle=False, num_parallel_workers=1) + + # apply cache to map + rescale_op = c_vision.Rescale(1.0 / 255.0, -1.0) + data = data.map(input_columns=["image"], operations=rescale_op, cache=test_cache) + + num_iter = 0 + for item in data.create_dict_iterator(num_epochs=1): # each data is a dictionary + # in this example, each dictionary has a keys "image" + print("{} image shape: {}".format(num_iter, item["image"].shape)) + num_iter += 1 + ``` + + The output is as follows: + + ```text + 0 image shape: (32, 32, 3) + 1 image shape: (32, 32, 3) + 2 image shape: (32, 32, 3) + 3 image shape: (32, 32, 3) + 4 image shape: (32, 32, 3) + ``` + + You can run the `cache_admin --list_sessions` command to check whether there are five data records in the current session. If yes, the data is successfully cached. + + ```text + $ cache_admin --list_sessions + Listing sessions for server on port 50052 + + Session Cache Id Mem cached Disk cached Avg cache size Numa hit + 1456416665 3618046178 5 n/a 12442 5 + ``` + +6. Destroy the cache session. + + After the training is complete, you can destroy the current cache and release the memory. + + ```text + $ cache_admin --destroy_session 1456416665 + Drop session successfully for server on port 50052 + ``` + + The preceding command is used to destroy the cache with the session ID 1456416665 on the server with the port number 50052. + + If you choose not to destroy the cache, the cached data still exists in the cache session. You can use the cache when starting the training script next time. + +7. Stop the cache server. + + After using the cache server, you can stop it. This operation will destroy all cache sessions on the current server and release the memory. + + ```text + $ cache_admin --stop + Cache server on port 50052 has been stopped successfully. + ``` + + The preceding command is used to shut down the server with the port number 50052. + + If you choose not to shut down the server, the cache sessions on the server will be retained for future use. During the next training, you can create a cache session or reuse the existing cache. + +## Cache Sharing + +During the single-node multi-device distributed training, the cache operator allows multiple same training scripts to share the same cache and read and write data from the cache. + +1. Start the cache server. + + ```text + $ cache_admin --start + Cache server startup completed successfully! + The cache server daemon has been created as process id 39337 and listening on port 50052 + + Recommendation: + Since the server is detached into its own daemon process, monitor the server logs (under /tmp/mindspore/cache/log) for any issues that may happen after startup + ``` + +2. Create a cache session. + + Create the shell script `cache.sh` for starting Python training and run the following command to generate a cache session ID: + + ```bash + #!/bin/bash + # This shell script will launch parallel pipelines + + # get path to dataset directory + if [ $# != 1 ] + then + echo "Usage: sh cache.sh DATASET_PATH" + exit 1 + fi + dataset_path=$1 + + # generate a session id that these parallel pipelines can share + result=$(cache_admin -g 2>&1) + rc=$? + if [ $rc -ne 0 ]; then + echo "some error" + exit 1 + fi + + # grab the session id from the result string + session_id=$(echo $result | awk '{print $NF}') + ``` + +3. Pass the cache session ID to the training script. + + Continue to write the shell script and add the following command to pass `session_id` and other parameters when the Python training is started: + + ```bash + # make the session_id available to the python scripts + num_devices=4 + + for p in $(seq 0 $((${num_devices}-1))); do + python my_training_script.py --num_devices "$num_devices" --device "$p" --session_id $session_id --dataset_path $dataset_path + done + ``` + + > Complete sample code: [cache.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/cache/cache.sh) + +4. Create and apply a cache instance. + + CIFAR-10 dataset is used in the following example. Before running the sample, download and store the CIFAR-10 dataset by referring to [Loading Dataset](https://www.mindspore.cn/docs/programming_guide/en/master/dataset_loading.html#cifar-10-100). The directory structure is as follows: + + ```text + ├─cache.sh + ├─my_training_script.py + └─cifar-10-batches-bin + ├── batches.meta.txt + ├── data_batch_1.bin + ├── data_batch_2.bin + ├── data_batch_3.bin + ├── data_batch_4.bin + ├── data_batch_5.bin + ├── readme.html + └── test_batch.bin + ``` + + Create and write the Python script `my_training_script.py`. Use the following code to receive `session_id` and pass it as a parameter when defining a cache instance. + + ```python + import argparse + import mindspore.dataset as ds + + parser = argparse.ArgumentParser(description='Cache Example') + parser.add_argument('--num_devices', type=int, default=1, help='Device num.') + parser.add_argument('--device', type=int, default=0, help='Device id.') + parser.add_argument('--session_id', type=int, default=1, help='Session id.') + parser.add_argument('--dataset_path', type=str, default=None, help='Dataset path') + args_opt = parser.parse_args() + + # apply cache to dataset + test_cache = ds.DatasetCache(session_id=args_opt.session_id, size=0, spilling=False) + dataset = ds.Cifar10Dataset(dataset_dir=args_opt.dataset_path, num_samples=4, shuffle=False, num_parallel_workers=1, + num_shards=args_opt.num_devices, shard_id=args_opt.device, cache=test_cache) + num_iter = 0 + for _ in dataset.create_dict_iterator(): + num_iter += 1 + print("Got {} samples on device {}".format(num_iter, args_opt.device)) + ``` + + > Complete sample code: [my_training_script.py](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/cache/my_training_script.py) + +5. Execute the training script. + + Execute the shell script `cache.sh` to enable distributed training. + + ```text + $ sh cache.sh cifar-10-batches-bin/ + Got 4 samples on device 0 + Got 4 samples on device 1 + Got 4 samples on device 2 + Got 4 samples on device 3 + ``` + + You can run the `cache_admin --list_sessions` command to check whether only one group of data exists in the current session. If yes, cache sharing is successful. + + ```text + $ cache_admin --list_sessions + Listing sessions for server on port 50052 + + Session Cache Id Mem cached Disk cached Avg cache size Numa hit + 3392558708 821590605 16 n/a 3227 16 + ``` + +6. Destroy the cache session. + + After the training is complete, you can destroy the current cache and release the memory. + + ```text + $ cache_admin --destroy_session 3392558708 + Drop session successfully for server on port 50052 + ``` + +7. Stop the cache server. + + After using the cache server, you can stop it. + + ```text + $ cache_admin --stop + Cache server on port 50052 has been stopped successfully. + ``` + +## Limitations + +- Currently, dataset classes such as `GraphDataset`, `GeneratorDataset`, `PaddedDataset`, and `NumpySlicesDataset` do not support cache. `GeneratorDataset`, `PaddedDataset`, and `NumpySlicesDataset` belong to `GeneratorOp`, so their error message is displayed as "There is currently no support for GeneratorOp under cache." +- Data processed by `batch`, `concat`, `filter`, `repeat`, `skip`, `split`, `take`, and `zip` does not support cache. +- Data processed by random data argumentation operations (such as `RandomCrop`) does not support cache. +- The same cache instance cannot be nested in different locations of the same pipeline. + +## Cache Performance Tuning + +The cache service performance can be significantly improved in following scenarios: + +- Cache the data processed by augmentation, especially when the data processing pipeline contains high complexity operations such as decode. In this scenario, you do not need to perform the data augmentation operation repeatedly on each epoch, which saves a lot of time. +- Use cache services during simple network training and inference. Compared with complex networks, simple networks require less training time. Therefore, the time performance is significantly improved when cache services are used in this scenario. + +However, we may not benefit from cache in the following scenarios: + +- The system memory is insufficient or the cache is not hit, resulting in poor cache service time performance. You can check whether the available system memory is sufficient and set a proper cache size before using the cache. +- Too much cache spilling will deteriorate the time performance. Therefore, try not to spill cache to disks when datasets that support random access (such as `ImageFolderDataset`) are used for data loading. +- Using cache on NLP network such as Bert does not perform. In the NLP scenarios, there are usually no high complexity data augmentation operations like decode. +- There is expectable startup overhead when using cache in non-mappable datasets like `TFRecordDataset`. According to the current design, it is required to cache all rows to the cache server before the first epoch of training. So the first epoch time can be longer than the non-cache case. diff --git a/tutorials/experts/source_en/data_engine/eager.md b/tutorials/experts/source_en/data_engine/eager.md new file mode 100644 index 0000000000000000000000000000000000000000..c8e32f089bfb43ac3fe82c25745511ec686b1536 --- /dev/null +++ b/tutorials/experts/source_en/data_engine/eager.md @@ -0,0 +1,188 @@ +# Lightweight Data Processing + +`Ascend` `GPU` `CPU` `Data Preparation` + + +- [Lightweight Data Processing](#lightweight-data-processing) + + + + + +When resource conditions permit, in order to pursue higher performance, data transformations are generally executed in the data pipeline mode. That is, users have to define the `map` operator which helps to execute augmentations in data pipeline. As shown in the figure below, the `map` operator contains 3 transformations: `Resize`, `Crop`, and `HWC2CHW`. When the pipeline starts, the `map` operator will apply these transformations to data in sequence. + +![pipelinemode1](./images/pipeline_mode_en.jpeg) + +Although the data pipeline can process input data quickly, the code of defining pipeline seems heavy while sometimes users just want to focus on the data transformations and perform them on small-scale data. In this case, data pipeline is not necessary. + +Therefore, MindSpore provides a lightweight data processing way to execute these data augmentations, called `Eager mode`. + +In `Eager mode`, the execution of data augmentations will not rely on the `map` operator but can be called directly as callable functions. The code will be simpler since the results are obtained immediately. It is recommended to be used in lightweight scenarios such as small data enhancement experiments and model inference. + +![eagermode1](./images/eager_mode_en.jpeg) + +MindSpore currently supports executing various data augmentations in `Eager mode`, as shown below. For more details, please refer to the API documentation. + +- [vision module](https://www.mindspore.cn/docs/api/en/master/api_python/mindspore.dataset.vision.html) + + - Submodule c_transforms, an image enhancement operator based on OpenCV. + - Submodule py_transforms, an image enhancement operator based on Pillow. + +- [text module](https://www.mindspore.cn/docs/api/en/master/api_python/mindspore.dataset.text.html#mindspore-dataset-text-transforms) + + - Submodule transforms, text processing operators. + +- [transforms module](https://www.mindspore.cn/docs/api/en/master/api_python/mindspore.dataset.transforms.html) + + - Submodule c_transforms, a general-purpose data enhancement operator based on C++. + - Submodule py_transforms, a general-purpose data augmentation operator based on Python. + +Note: In chapters [Image Processing and Enhancement](https://www.mindspore.cn/docs/programming_guide/en/master/augmentation.html), [Text Processing and Enhancement](https://www.mindspore.cn/docs/programming_guide/en/master/tokenizer.html), all data enhancement operators can be executed in Eager mode. + +## example + +The following example introduces how to execute data augmentations in `Eager mode`. + +> To use `Eager mode`, just treat the data augmentations as an executable function and call them directly. + +### data preparation + +Download the image and save it to the specified location. + +```python +import os +import requests + +requests.packages.urllib3.disable_warnings() + +def download_dataset(dataset_url, path): + filename = dataset_url.split("/")[-1] + save_path = os.path.join(path, filename) + if os.path.exists(save_path): + return + if not os.path.exists(path): + os.makedirs(path) + res = requests.get(dataset_url, stream=True, verify=False) + with open(save_path, "wb") as f: + for chunk in res.iter_content(chunk_size=512): + if chunk: + f.write(chunk) + print("The {} file is downloaded and saved in the path {} after processing".format(os.path.basename(dataset_url), path)) + +download_dataset("https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/datasets/banana.jpg", ".") +``` + +### vision + +This example will mix the `c_tranforms` and `py_transforms` operators from the `vision` module to transform a given image. + +You only need to focus on what data augmentations have to use, not any code for the data pipeline. + +The Eager mode of the vision operator supports `numpy.array` or `PIL.Image` type data as input parameters. + +```python +import numpy as np +from PIL import Image +import matplotlib.pyplot as plt +import mindspore.dataset.vision.c_transforms as C +import mindspore.dataset.vision.py_transforms as P + +img_ori = Image.open("banana.jpg").convert("RGB") +print("Image.type: {}, Image.shape: {}".format(type(img_ori), img_ori.size)) + +# Define a Resize op from c_transform and execute it immediately +op1 = C.Resize(size=(320)) +img = op1(img_ori) +print("Image.type: {}, Image.shape: {}".format(type(img), img.shape)) + +# Define a CenterCrop op from c_transform and execute it immediately +op2 = C.CenterCrop((280, 280)) +img = op2(img) +print("Image.type: {}, Image.shape: {}".format(type(img), img.shape)) + +# Define a Pad op from py_transform and execute it immediately +# Before calling Pad, you need to call ToPIL() +op3 = P.ToPIL() +op4 = P.Pad(40) +img = op4(op3(img)) +print("Image.type: {}, Image.shape: {}".format(type(img), img.size)) + +# Show the result +plt.subplot(1, 2, 1) +plt.imshow(img_ori) +plt.title("original image") +plt.subplot(1, 2, 2) +plt.imshow(img) +plt.title("transformed image") +plt.show() +``` + +The output is as follows: + +```text +Image.type: , Image.shape: (356, 200) +Image.type: , Image.shape: (320, 570, 3) +Image.type: , Image.shape: (280, 280, 3) +Image.type: , Image.shape: (360, 360) +``` + +The following shows the processed image. + +![eager_mode](./images/eager_mode.png) + +Augmentation operators that support to be run in Eager Mode are listed as follows: [mindspore.dataset.transforms](https://www.mindspore.cn/docs/api/en/master/api_python/mindspore.dataset.transforms.html), [mindspore.dataset.vision](https://www.mindspore.cn/docs/api/en/master/api_python/mindspore.dataset.vision.html), [mindspore.dataset.text.transforms](https://www.mindspore.cn/docs/api/en/master/api_python/mindspore.dataset.text.html#mindspore-dataset-text-transforms). + +### text + +This example will transform the given text using the `tranforms` operator in the `text` module. + +Eager mode of the text operator supports `numpy.array` type data as input parameters. + +```python +import mindspore.dataset.text.transforms as text +from mindspore import dtype as mstype + +# Define a WhitespaceTokenizer op and execute it immediately +txt = "Welcome to Beijing !" +txt = text.WhitespaceTokenizer()(txt) +print("Tokenize result: {}".format(txt)) + +# Define a ToNumber op and execute it immediately +txt = ["123456"] +to_number = text.ToNumber(mstype.int32) +txt = to_number(txt) +print("ToNumber result: {}, type: {}".format(txt, type(txt[0]))) +``` + +```text +Tokenize result: ['Welcome' 'to' 'Beijing' '!'] +ToNumber result: [123456], type: +``` + +### transforms + +This example will transform the given data using the operators of `c_tranforms` in the `transforms` module. + +Eager mode of transforms operator supports `numpy.array` type data as input parameters. + +```python +import numpy as np +import mindspore.dataset.transforms.c_transforms as trans + +# Define a Fill op and execute it immediately +data = np.array([1, 2, 3, 4, 5]) +fill = trans.Fill(0) +data = fill(data) +print("Fill result: ", data) + +# Define a OneHot op and execute it immediately +label = np.array(2) +onehot = trans.OneHot(num_classes=5) +label = onehot(label) +print("OneHot result: ", label) +``` + +```text +Fill result: [0 0 0 0 0] +OneHot result: [0 0 1 0 0] +``` diff --git a/tutorials/experts/source_en/data_engine/enable_auto_augmentation.md b/tutorials/experts/source_en/data_engine/enable_auto_augmentation.md new file mode 100644 index 0000000000000000000000000000000000000000..097f3299fa39caf3dc6acbea85a776e75de4eb79 --- /dev/null +++ b/tutorials/experts/source_en/data_engine/enable_auto_augmentation.md @@ -0,0 +1,232 @@ +# Application of Auto Augmentation + +`Ascend` `GPU` `CPU` `Data Preparation` + + + +## Overview + +Auto Augmentation [1] finds a suitable image augmentation scheme for a specific dataset by searching through a series of image augmentation sub-policies. The `c_transforms` module of MindSpore provides various C++ operators that are used in Auto Augmentation. Users can also customize functions or operators to implement Auto Augmentation. For more details about the MindSpore operators, see the [API document](https://www.mindspore.cn/docs/api/en/master/api_python/mindspore.dataset.vision.html). + +The mapping between MindSpore operators and Auto Augmentation operators is as follows: + +| Auto Augmentation Operators | MindSpore Operators | Introduction | +| :------: | :------ | ------ | +| shearX | RandomAffine | Horizontal shear | +| shearY | RandomAffine | Vertical shear | +| translateX | RandomAffine | Horizontal translation | +| translateY | RandomAffine | Vertival translation | +| rotate | RandomRotation | Rotational transformation | +| color | RandomColor | Color transformation | +| posterize | RandomPosterize | Decrease the number of color channels | +| solarize | RandomSolarize | Invert all pixels within the specified threshold range | +| contrast | RandomColorAdjust | Contrast adjustment | +| sharpness | RandomSharpness | Sharpness adjustment | +| brightness | RandomColorAdjust | Brightness adjustment | +| autocontrast | AutoContrast | Maximize image contrast | +| equalize | Equalize | Equalize image histogram | +| invert | Invert | Image inversion | + +## Auto Augmentation on ImageNet + +This tutorial uses the implementation of Auto Augmentation on the ImageNet dataset as an example. + +The data augmentation policy for the ImageNet dataset contains 25 sub-policies, and each sub-policy contains two transformations. A combination of sub-policies is randomly selected for each image in a batch, and each transformation in the sub-policy is executed based on a preset probability. + +Users can use the `RandomSelectSubpolicy` interface of the `c_transforms` module in MindSpore to implement Auto Augmentation. The standard data augmentation method in ImageNet classification training includes the following steps: + +- `RandomCropDecodeResize`: Randomly crop then decode. + +- `RandomHorizontalFlip`: Randomly flip horizontally. + +- `Normalize`: Normalize the data. + +- `HWC2CHW`: Change image channel. + +Add Auto Augmentation transformation after the `RandomCropDecodeResize` as follows: + +1. Import related modules. + + ```python + import matplotlib.pyplot as plt + + import mindspore.dataset as ds + import mindspore.dataset.transforms.c_transforms as c_transforms + import mindspore.dataset.vision.c_transforms as c_vision + from mindspore import dtype as mstype + ``` + +2. Define the mapping from the MindSpore operators to the Auto Augmentation operators. + + ```python + # define Auto Augmentation operators + PARAMETER_MAX = 10 + + def float_parameter(level, maxval): + return float(level) * maxval / PARAMETER_MAX + + def int_parameter(level, maxval): + return int(level * maxval / PARAMETER_MAX) + + def shear_x(level): + v = float_parameter(level, 0.3) + return c_transforms.RandomChoice([c_vision.RandomAffine(degrees=0, shear=(-v,-v)), c_vision.RandomAffine(degrees=0, shear=(v, v))]) + + def shear_y(level): + v = float_parameter(level, 0.3) + return c_transforms.RandomChoice([c_vision.RandomAffine(degrees=0, shear=(0, 0, -v,-v)), c_vision.RandomAffine(degrees=0, shear=(0, 0, v, v))]) + + def translate_x(level): + v = float_parameter(level, 150 / 331) + return c_transforms.RandomChoice([c_vision.RandomAffine(degrees=0, translate=(-v,-v)), c_vision.RandomAffine(degrees=0, translate=(v, v))]) + + def translate_y(level): + v = float_parameter(level, 150 / 331) + return c_transforms.RandomChoice([c_vision.RandomAffine(degrees=0, translate=(0, 0, -v,-v)), c_vision.RandomAffine(degrees=0, translate=(0, 0, v, v))]) + + def color_impl(level): + v = float_parameter(level, 1.8) + 0.1 + return c_vision.RandomColor(degrees=(v, v)) + + def rotate_impl(level): + v = int_parameter(level, 30) + return c_transforms.RandomChoice([c_vision.RandomRotation(degrees=(-v, -v)), c_vision.RandomRotation(degrees=(v, v))]) + + def solarize_impl(level): + level = int_parameter(level, 256) + v = 256 - level + return c_vision.RandomSolarize(threshold=(0, v)) + + def posterize_impl(level): + level = int_parameter(level, 4) + v = 4 - level + return c_vision.RandomPosterize(bits=(v, v)) + + def contrast_impl(level): + v = float_parameter(level, 1.8) + 0.1 + return c_vision.RandomColorAdjust(contrast=(v, v)) + + def autocontrast_impl(level): + return c_vision.AutoContrast() + + def sharpness_impl(level): + v = float_parameter(level, 1.8) + 0.1 + return c_vision.RandomSharpness(degrees=(v, v)) + + def brightness_impl(level): + v = float_parameter(level, 1.8) + 0.1 + return c_vision.RandomColorAdjust(brightness=(v, v)) + ``` + +3. Define the Auto Augmentation policy for the ImageNet dataset. + + ```python + # define the Auto Augmentation policy + imagenet_policy = [ + [(posterize_impl(8), 0.4), (rotate_impl(9), 0.6)], + [(solarize_impl(5), 0.6), (autocontrast_impl(5), 0.6)], + [(c_vision.Equalize(), 0.8), (c_vision.Equalize(), 0.6)], + [(posterize_impl(7), 0.6), (posterize_impl(6), 0.6)], + [(c_vision.Equalize(), 0.4), (solarize_impl(4), 0.2)], + + [(c_vision.Equalize(), 0.4), (rotate_impl(8), 0.8)], + [(solarize_impl(3), 0.6), (c_vision.Equalize(), 0.6)], + [(posterize_impl(5), 0.8), (c_vision.Equalize(), 1.0)], + [(rotate_impl(3), 0.2), (solarize_impl(8), 0.6)], + [(c_vision.Equalize(), 0.6), (posterize_impl(6), 0.4)], + + [(rotate_impl(8), 0.8), (color_impl(0), 0.4)], + [(rotate_impl(9), 0.4), (c_vision.Equalize(), 0.6)], + [(c_vision.Equalize(), 0.0), (c_vision.Equalize(), 0.8)], + [(c_vision.Invert(), 0.6), (c_vision.Equalize(), 1.0)], + [(color_impl(4), 0.6), (contrast_impl(8), 1.0)], + + [(rotate_impl(8), 0.8), (color_impl(2), 1.0)], + [(color_impl(8), 0.8), (solarize_impl(7), 0.8)], + [(sharpness_impl(7), 0.4), (c_vision.Invert(), 0.6)], + [(shear_x(5), 0.6), (c_vision.Equalize(), 1.0)], + [(color_impl(0), 0.4), (c_vision.Equalize(), 0.6)], + + [(c_vision.Equalize(), 0.4), (solarize_impl(4), 0.2)], + [(solarize_impl(5), 0.6), (autocontrast_impl(5), 0.6)], + [(c_vision.Invert(), 0.6), (c_vision.Equalize(), 1.0)], + [(color_impl(4), 0.6), (contrast_impl(8), 1.0)], + [(c_vision.Equalize(), 0.8), (c_vision.Equalize(), 0.6)], + ] + ``` + +4. Add Auto Augmentation transformations after the `RandomCropDecodeResize` operation. + + ```python + def create_dataset(dataset_path, do_train, repeat_num=1, batch_size=32, shuffle=True, num_samples=5, target="Ascend"): + # create a train or eval imagenet2012 dataset for ResNet-50 + dataset = ds.ImageFolderDataset(dataset_path, num_parallel_workers=8, + shuffle=shuffle, num_samples=num_samples) + + image_size = 224 + mean = [0.485 * 255, 0.456 * 255, 0.406 * 255] + std = [0.229 * 255, 0.224 * 255, 0.225 * 255] + + # define map operations + if do_train: + trans = [ + c_vision.RandomCropDecodeResize(image_size, scale=(0.08, 1.0), ratio=(0.75, 1.333)), + ] + + post_trans = [ + c_vision.RandomHorizontalFlip(prob=0.5), + ] + else: + trans = [ + c_vision.Decode(), + c_vision.Resize(256), + c_vision.CenterCrop(image_size), + c_vision.Normalize(mean=mean, std=std), + c_vision.HWC2CHW() + ] + dataset = dataset.map(operations=trans, input_columns="image") + if do_train: + dataset = dataset.map(operations=c_vision.RandomSelectSubpolicy(imagenet_policy), input_columns=["image"]) + dataset = dataset.map(operations=post_trans, input_columns="image") + type_cast_op = c_transforms.TypeCast(mstype.int32) + dataset = dataset.map(operations=type_cast_op, input_columns="label") + # apply the batch operation + dataset = dataset.batch(batch_size, drop_remainder=True) + # apply the repeat operation + dataset = dataset.repeat(repeat_num) + + return dataset + ``` + +5. Verify the effects of Auto Augmentation. + + ```python + # Define the path to image folder directory. This directory needs to contain sub-directories which contain the images. + DATA_DIR = "/path/to/image_folder_directory" + dataset = create_dataset(dataset_path=DATA_DIR, do_train=True, batch_size=5, shuffle=False, num_samples=5) + + epochs = 5 + itr = dataset.create_dict_iterator() + fig=plt.figure(figsize=(8, 8)) + columns = 5 + rows = 5 + + step_num = 0 + for ep_num in range(epochs): + for data in itr: + step_num += 1 + for index in range(rows): + fig.add_subplot(rows, columns, ep_num * rows + index + 1) + plt.imshow(data['image'].asnumpy()[index]) + plt.show() + ``` + + > For better visualization, only five images are read from the dataset without performing `shuffle`, `Normalize`, nor `HWC2CHW` operations. + + ![augment](./images/auto_augmentation.png) + + The images above visualize the effect of Auto Augmentation. The horizontal direction displays five images in one batch, and the vertical direction displays five batches. + +## References + +[1] [AutoAugment: Learning Augmentation Policies from Data](https://arxiv.org/abs/1805.09501). diff --git a/tutorials/experts/source_en/data_engine/enable_cache.md b/tutorials/experts/source_en/data_engine/enable_cache.md new file mode 100644 index 0000000000000000000000000000000000000000..cb0cac5cac1ee62795df73980d607d9bfd857b01 --- /dev/null +++ b/tutorials/experts/source_en/data_engine/enable_cache.md @@ -0,0 +1,401 @@ +# Application of Single-Node Tensor Cache + +`Ascend` `GPU` `CPU` `Data Preparation` + + + +## Overview + +If you need to repeatedly access remote datasets or read datasets from disks, you can use the single-node cache operator to cache datasets in the local memory to accelerate dataset reading. + +This tutorial demonstrates how to use the single-node cache service, and shows several best practices of using cache to improve the performance of network training or evaluating. + +## Quick Start + +1. Configuring the Environment + + Before using the cache service, you need to install MindSpore and set related environment variables. The Conda environment is used as an example. The setting method is as follows: + + ```text + export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:{path_to_conda}/envs/{your_env_name}/lib/python3.7/site-packages/mindspore:{path_to_conda}/envs/{your_env_name}/lib/python3.7/site-packages/mindspore/lib + export PATH=$PATH:{path_to_conda}/envs/{your_env_name}/bin + ``` + +2. Starting the Cache Server + + Before using the single-node cache service, you need to start the cache server. + + ```text + $ cache_admin --start + Cache server startup completed successfully! + The cache server daemon has been created as process id 10394 and is listening on port 50052 + + Recommendation: + Since the server is detached into its own daemon process, monitor the server logs (under /tmp/mindspore/cache/log) for any issues that may happen after startup + ``` + + If the system displays a message indicating that the `libpython3.7m.so.1.0` file cannot be found, search for the file path in the virtual environment and set environment variables. + + ```text + export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:{path_to_conda}/envs/{your_env_name}/lib + ``` + +3. Creating a Cache Session + + If no cache session exists on the cache server, a cache session needs to be created to obtain the cache session ID. + + ```text + $ cache_admin -g + Session created for server on port 50052: 1493732251 + ``` + + The cache session ID is randomly allocated by the server. + +4. Creating a Cache Instance + + Create the Python script `my_training_script.py`, use the `DatasetCache` API to define a cache instance named `some_cache` in the script, and specify the `session_id` parameter to a cache session ID created in the previous step. + + ```python + import mindspore.dataset as ds + + some_cache = ds.DatasetCache(session_id=1493732251, size=0, spilling=False) + ``` + +5. Inserting a Cache Instance + + The following uses the CIFAR-10 dataset as an example. Before running the sample, download and store the CIFAR-10 dataset by referring to [Loading Dataset](https://www.mindspore.cn/docs/programming_guide/en/master/dataset_loading.html#cifar-10-100-dataset). The directory structure is as follows: + + ```text + ├─my_training_script.py + └─cifar-10-batches-bin + ├── batches.meta.txt + ├── data_batch_1.bin + ├── data_batch_2.bin + ├── data_batch_3.bin + ├── data_batch_4.bin + ├── data_batch_5.bin + ├── readme.html + └── test_batch.bin + ``` + + To cache the enhanced data processed by data augmentation of the map operator, the created `some_cache` instance is used as the input parameter of the `cache` API in the map operator. + + ```python + import mindspore.dataset.vision.c_transforms as c_vision + + dataset_dir = "cifar-10-batches-bin/" + data = ds.Cifar10Dataset(dataset_dir=dataset_dir, num_samples=5, shuffle=False, num_parallel_workers=1) + + # apply cache to map + rescale_op = c_vision.Rescale(1.0 / 255.0, -1.0) + data = data.map(input_columns=["image"], operations=rescale_op, cache=some_cache) + + num_iter = 0 + for item in data.create_dict_iterator(num_epochs=1): # each data is a dictionary + # in this example, each dictionary has a key "image" + print("{} image shape: {}".format(num_iter, item["image"].shape)) + num_iter += 1 + ``` + + Run the Python script `my_training_script.py`. The following information is displayed: + + ```text + 0 image shape: (32, 32, 3) + 1 image shape: (32, 32, 3) + 2 image shape: (32, 32, 3) + 3 image shape: (32, 32, 3) + 4 image shape: (32, 32, 3) + ``` + + You can run the `cache_admin --list_sessions` command to check whether there are five data records in the current session. If yes, the data is successfully cached. + + ```text + $ cache_admin --list_sessions + Listing sessions for server on port 50052 + + Session Cache Id Mem cached Disk cached Avg cache size Numa hit + 1493732251 3618046178 5 n/a 12442 5 + ``` + +6. Destroying a Cache Session + + After the training is complete, you can destroy the current cache and release the memory. + + ```text + $ cache_admin --destroy_session 1493732251 + Drop session successfully for server on port 50052 + ``` + + The preceding command is used to destroy the cache whose session ID is 1493732251. + +7. Stopping the Cache Server + + After using the cache server, you can stop the cache server. This operation will destroy all cache sessions on the current server and release the memory. + + ```text + $ cache_admin --stop + Cache server on port 50052 has been stopped successfully. + ``` + +## Best Practices + +### Using Cache to Speed Up ResNet Evaluation During Training + +For a complex network, epoch training usually needs to be performed for dozens or even hundreds of times. Before training, it is difficult to know when a model can achieve required accuracy in epoch training. Therefore, the accuracy of the model is usually validated at a fixed epoch interval during training and the corresponding model is saved. After the training is completed, users can quickly select the optimal model by viewing the change of the corresponding model accuracy. + +Therefore, the performance of evaluation during training will have a great impact on the total end-to-end time required. In this section, we will show an example of leveraging the cache service and caching data after augmentation in Tensor format in memory to speed up the evaluation procedure. + +The inference data processing procedure usually does not contain random operations. For example, the dataset processing in ResNet50 evaluation only contains augmentations like `Decode`, `Resize`, `CenterCrop`, `Normalize`, `HWC2CHW`, `TypeCast`. Therefore, it's usually better to inject cache after the last augmentation step and directly cache data that's fully augmented, to minimize repeated computations and to yield the most performance benefits. In this section, we will follow this approach and take ResNet as an example. + +For the complete sample code, please refer to [ResNet](https://gitee.com/mindspore/models/tree/master/official/cv/resnet) in ModelZoo. + +1. Create a Shell script named `cache_util.sh` for cache management: + + ```bash + bootup_cache_server() + { + echo "Booting up cache server..." + result=$(cache_admin --start 2>&1) + echo "${result}" + } + + generate_cache_session() + { + result=$(cache_admin -g | awk 'END {print $NF}') + echo "${result}" + } + ``` + + > Complete sample code: [cache_util.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/cache/cache_util.sh) + +2. In the Shell script for starting the distributed training i.e., `run_distributed_train.sh`, start a cache server for evaluation during training scenarios and generate a cache session, saved in `CACHE_SESSION_ID` Shell variable: + + ```bash + source cache_util.sh + + if [ "x${RUN_EVAL}" == "xTrue" ] + then + bootup_cache_server + CACHE_SESSION_ID=$(generate_cache_session) + fi + ``` + +3. Pass the `CACHE_SESSION_ID` as well as other arguments when start the Python training script: + + ```text + python train.py \ + --net=$1 \ + --dataset=$2 \ + --run_distribute=True \ + --device_num=$DEVICE_NUM \ + --dataset_path=$PATH2 \ + --run_eval=$RUN_EVAL \ + --eval_dataset_path=$EVAL_DATASET_PATH \ + --enable_cache=True \ + --cache_session_id=$CACHE_SESSION_ID \ + &> log & + ``` + +4. In Python training script `train.py`, use the following code to receive `cache_session_id` that's passed in and use it when defining a eval dataset `eval_dataset`: + + ```python + import argparse + + parser.add_argument('--enable_cache', + type=ast.literal_eval, + default=False, + help='Caching the eval dataset in memory to speedup evaluation, default is False.') + parser.add_argument('--cache_session_id', + type=str, + default="", + help='The session id for cache service.') + args_opt = parser.parse_args() + + eval_dataset = create_dataset( + dataset_path=args_opt.eval_dataset_path, + do_train=False, + batch_size=config.batch_size, + target=target, + enable_cache=args_opt.enable_cache, + cache_session_id=args_opt.cache_session_id) + ``` + +5. In Python `dataset.py` script which creates the dataset processing pipeline,create a `DatasetCache` instance according to `enable_cache` and `cache_session_id` arguments, and inject the cache instance after the last step of data augmentation, i.e., after `TyepCast`: + + ```python + def create_dataset2(dataset_path, do_train, repeat_num=1, batch_size=32, target="Ascend", distribute=False, enable_cache=False, cache_session_id=None): + ... + if enable_cache: + if not cache_session_id: + raise ValueError("A cache session_id must be provided to use cache.") + eval_cache = ds.DatasetCache(session_id=int(cache_session_id), size=0) + data_set = data_set.map(operations=type_cast_op, input_columns="label", num_parallel_workers=8, cache=eval_cache) + else: + data_set = data_set.map(operations=type_cast_op, input_columns="label", num_parallel_workers=8) + ``` + +6. Execute the training script: + + ```text + ... + epoch: 40, acc: 0.5665486653645834, eval_cost:30.54 + epoch: 41, acc: 0.6212361653645834, eval_cost:2.80 + epoch: 42, acc: 0.6523844401041666, eval_cost:3.77 + ... + ``` + + By default, the evaluation starts after the 40th epoch, and `eval_cost` shows how much time it costs for each evaluation run, measured by seconds. + + The following table compares the average evaluation time with/without cache: + + ```text + | | without cache | with cache | + | -------------------------- | ------------- | ---------- | + | 4p, resnet50, imagenet2012 | 10.59s | 3.62s | + ``` + + On Ascend machine with 4 parallel pipelines, it generally takes around 88 seconds for each training epoch and ResNet training usually requires 90 epochs. Therefore, using cache can shorten the total end-to-end time from 8849 seconds to 8101 seconds, thus bringing 348 seconds total time reduction. + +7. After the training run is completed, you can destroy the current cache and release the memory: + + ```text + $ cache_admin --stop + Cache server on port 50052 has been stopped successfully. + ``` + +### Using Cache to Speed Up Training with Datasets on NFS + +To share a large dataset across multiple servers, many users resort to NFS (Network File System) to store their datasets (Please check [Huawei cloud - Creating an NFS Shared Directory on ECS](https://support.huaweicloud.com/intl/en-us/usermanual-functiongraph/functiongraph_01_0561.html) for how to setup and config an NFS server). + +However, due to the fact that the cost of accessing NFS is usually large, running training with a dataset located on NFS is relatively slow. To improve training performance for this scenario, we can leverage cache service to cache the dataset in the form of Tensor in memory. After caching, the following training epochs can directly access from memory, thus avoiding costly remote dataset access. + +Note that typically after reading the dataset, certain random operations such as `RandomCropDecodeResize` would be performed in the dataset processing procedure. Caching after these random operations would result in the loss of randomness of the data, and therefore affect the final accuracy. As a result, we choose to directly cache the source dataset. In this section, we will follow this approach and take MobileNetV2 as an example. + +For the complete sample code, please refer to [MobileNetV2](https://gitee.com/mindspore/models/tree/master/official/cv/mobilenetv2) in ModelZoo. + +1. Create a Shell script namely `cache_util.sh` for cache management: + + ```bash + bootup_cache_server() + { + echo "Booting up cache server..." + result=$(cache_admin --start 2>&1) + echo "${result}" + } + + generate_cache_session() + { + result=$(cache_admin -g | awk 'END {print $NF}') + echo "${result}" + } + ``` + + > Complete sample code: [cache_util.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/cache/cache_util.sh) + +2. In the Shell script for starting the distributed training with NFS dataset i.e., `run_train_nfs_cache.sh`, start a cache server for scenarios where dataset is on NFS. Then generate a cache session, saved in `CACHE_SESSION_ID` Shell variable: + + ```bash + source cache_util.sh + + bootup_cache_server + CACHE_SESSION_ID=$(generate_cache_session) + ``` + +3. Pass the `CACHE_SESSION_ID` as well as other arguments when start the Python training script: + + ```text + python train.py \ + --platform=$1 \ + --dataset_path=$5 \ + --pretrain_ckpt=$PRETRAINED_CKPT \ + --freeze_layer=$FREEZE_LAYER \ + --filter_head=$FILTER_HEAD \ + --enable_cache=True \ + --cache_session_id=$CACHE_SESSION_ID \ + &> log$i.log & + ``` + +4. In the `train_parse_args()` function of Python argument-parsing script `args.py`, use the following code to receive `cache_session_id` that's passed in: + + ```python + import argparse + + def train_parse_args(): + ... + train_parser.add_argument('--enable_cache', + type=ast.literal_eval, + default=False, + help='Caching the dataset in memory to speedup dataset processing, default is False.') + train_parser.add_argument('--cache_session_id', + type=str, + default="", + help='The session id for cache service.') + train_args = train_parser.parse_args() + ``` + + In Python training script`train.py`,call `train_parse_args()` to parse the arguments that's passed in such as `cache_session_id`, and use it when defining the training dataset: + + ```python + from src.args import train_parse_args + args_opt = train_parse_args() + + dataset = create_dataset( + dataset_path=args_opt.dataset_path, + do_train=True, + config=config, + enable_cache=args_opt.enable_cache, + cache_session_id=args_opt.cache_session_id) + ``` + +5. In Python `dataset.py` script which creates the dataset processing pipeline,create a `DatasetCache` instance according to `enable_cache` and `cache_session_id` arguments, and inject the cache instance directly after the `ImageFolderDataset`: + + ```python + def create_dataset(dataset_path, do_train, config, repeat_num=1, enable_cache=False, cache_session_id=None): + ... + if enable_cache: + nfs_dataset_cache = ds.DatasetCache(session_id=int(cache_session_id), size=0) + else: + nfs_dataset_cache = None + + if config.platform == "Ascend": + rank_size = int(os.getenv("RANK_SIZE", '1')) + rank_id = int(os.getenv("RANK_ID", '0')) + if rank_size == 1: + data_set = ds.ImageFolderDataset(dataset_path, num_parallel_workers=8, shuffle=True, cache=nfs_dataset_cache) + else: + data_set = ds.ImageFolderDataset(dataset_path, num_parallel_workers=8, shuffle=True, num_shards=rank_size, shard_id=rank_id, cache=nfs_dataset_cache) + ``` + +6. Execute the training run via `run_train_nfs_cache.sh`: + + ```text + epoch: [ 0/ 200], step:[ 2134/ 2135], loss:[4.682/4.682], time:[3364893.166], lr:[0.780] + epoch time: 3384387.999, per step time: 1585.193, avg loss: 4.682 + epoch: [ 1/ 200], step:[ 2134/ 2135], loss:[3.750/3.750], time:[430495.242], lr:[0.724] + epoch time: 431005.885, per step time: 201.876, avg loss: 4.286 + epoch: [ 2/ 200], step:[ 2134/ 2135], loss:[3.922/3.922], time:[420104.849], lr:[0.635] + epoch time: 420669.174, per step time: 197.035, avg loss: 3.534 + epoch: [ 3/ 200], step:[ 2134/ 2135], loss:[3.581/3.581], time:[420825.587], lr:[0.524] + epoch time: 421494.842, per step time: 197.421, avg loss: 3.417 + ... + ``` + + The following table compares the average epoch time with/without cache: + + ```text + | 4p, MobileNetV2, imagenet2012 | without cache | with cache | + | ---------------------------------------- | ------------- | ---------- | + | first epoch time | 1649s | 3384s | + | average epoch time (exclude first epoch) | 458s | 421s | + ``` + + With cache, the first epoch time increases significantly due to cache writing overhead, but all later epochs can benefit from caching the dataset in memory. Therefore, the more epochs, the more cache case shows benefits due to per-step-time savings. + + MobileNetV2 generally requires 200 epochs in total, therefore, using cache can shorten the total end-to-end time from 92791 seconds to 87163 seconds, thus bringing 5628 seconds total time reduction. + +7. After the training run is completed, you can destroy the current cache and release the memory: + + ```text + $ cache_admin --stop + Cache server on port 50052 has been stopped successfully. + ``` diff --git a/tutorials/experts/source_en/data_engine/images/auto_augmentation.png b/tutorials/experts/source_en/data_engine/images/auto_augmentation.png new file mode 100644 index 0000000000000000000000000000000000000000..3daa904f181d2c7a6a2b6f7f2271c8e33f2ba933 Binary files /dev/null and b/tutorials/experts/source_en/data_engine/images/auto_augmentation.png differ diff --git a/tutorials/experts/source_en/data_engine/images/eager_mode.png b/tutorials/experts/source_en/data_engine/images/eager_mode.png new file mode 100644 index 0000000000000000000000000000000000000000..4be8958b7bb92244710bad82fc6b80169924648c Binary files /dev/null and b/tutorials/experts/source_en/data_engine/images/eager_mode.png differ diff --git a/tutorials/experts/source_en/data_engine/images/pipeline_mode_en.jpeg b/tutorials/experts/source_en/data_engine/images/pipeline_mode_en.jpeg new file mode 100644 index 0000000000000000000000000000000000000000..bbfa93af5be8c6a658d21c6e0b482cdde8df3f17 Binary files /dev/null and b/tutorials/experts/source_en/data_engine/images/pipeline_mode_en.jpeg differ diff --git a/tutorials/experts/source_en/data_engine/optimize_data_processing.ipynb b/tutorials/experts/source_en/data_engine/optimize_data_processing.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..2b57d4ca36a6408f0566fe5f50b92bf1c2d817a3 --- /dev/null +++ b/tutorials/experts/source_en/data_engine/optimize_data_processing.ipynb @@ -0,0 +1,787 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Optimizing the Data Processing\n", + "\n", + "`Ascend` `GPU` `CPU` `Data Preparation`\n", + "\n", + "[![Run in ModelArts](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_modelarts_en.png)](https://authoring-modelarts-cnnorth4.huaweicloud.com/console/lab?share-url-b64=aHR0cHM6Ly9taW5kc3BvcmUtd2Vic2l0ZS5vYnMuY24tbm9ydGgtNC5teWh1YXdlaWNsb3VkLmNvbS9ub3RlYm9vay9tYXN0ZXIvcHJvZ3JhbW1pbmdfZ3VpZGUvZW4vbWluZHNwb3JlX29wdGltaXplX2RhdGFfcHJvY2Vzc2luZy5pcHluYg==&imageid=65f636a0-56cf-49df-b941-7d2a07ba8c8c) [![Download Notebook](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_notebook_en.png)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/master/programming_guide/en/mindspore_optimize_data_processing.ipynb) [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.png)](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/programming_guide/source_en/optimize_data_processing.ipynb)" + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "## Overview\n", + "\n", + "Data is the most important factor of deep learning. Data quality determines the upper limit of deep learning result, whereas model quality enables the result to approach the upper limit. Therefore, high-quality data input is beneficial to the entire deep neural network. During the entire data processing and data augmentation process, data continuously flows through a pipeline to the training system." + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "![pipeline](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/docs/mindspore/programming_guide/source_en/images/pipeline.png)" + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "MindSpore provides data processing and data augmentation functions for users. In the pipeline process, if each step can be properly used, the data performance will be greatly improved. This section describes how to optimize performance during data loading, data processing, and data augmentation based on the [CIFAR-10 dataset[1]](#references).\n", + "\n", + "In addition, the storage, architecture and computing resources of the operating system will influence the performance of data processing to a certain extent.\n", + "\n", + "## Preparations\n", + "\n", + "### Importing Modules\n", + "\n", + "The `dataset` module provides APIs for loading and processing datasets." + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 1, + "source": [ + "import mindspore.dataset as ds" + ], + "outputs": [], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "The `numpy` module is used to generate ndarrays." + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 2, + "source": [ + "import numpy as np" + ], + "outputs": [], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "### Downloading the Required Dataset\n", + "\n", + "Run the following command to download the dataset:\n", + "Download the CIFAR-10 Binary format dataset, decompress them and store them in the `./datasets` path, use this dataset when loading data." + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": null, + "source": [ + "import os\n", + "import requests\n", + "import tarfile\n", + "import zipfile\n", + "import shutil\n", + "\n", + "requests.packages.urllib3.disable_warnings()\n", + "\n", + "def download_dataset(url, target_path):\n", + " \"\"\"download and decompress dataset\"\"\"\n", + " if not os.path.exists(target_path):\n", + " os.makedirs(target_path)\n", + " download_file = url.split(\"/\")[-1]\n", + " if not os.path.exists(download_file):\n", + " res = requests.get(url, stream=True, verify=False)\n", + " if download_file.split(\".\")[-1] not in [\"tgz\", \"zip\", \"tar\", \"gz\"]:\n", + " download_file = os.path.join(target_path, download_file)\n", + " with open(download_file, \"wb\") as f:\n", + " for chunk in res.iter_content(chunk_size=512):\n", + " if chunk:\n", + " f.write(chunk)\n", + " if download_file.endswith(\"zip\"):\n", + " z = zipfile.ZipFile(download_file, \"r\")\n", + " z.extractall(path=target_path)\n", + " z.close()\n", + " if download_file.endswith(\".tar.gz\") or download_file.endswith(\".tar\") or download_file.endswith(\".tgz\"):\n", + " t = tarfile.open(download_file)\n", + " names = t.getnames()\n", + " for name in names:\n", + " t.extract(name, target_path)\n", + " t.close()\n", + " print(\"The {} file is downloaded and saved in the path {} after processing\".format(os.path.basename(url), target_path))\n", + "\n", + "download_dataset(\"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-binary.tar.gz\", \"./datasets\")\n", + "test_path = \"./datasets/cifar-10-batches-bin/test\"\n", + "train_path = \"./datasets/cifar-10-batches-bin/train\"\n", + "os.makedirs(test_path, exist_ok=True)\n", + "os.makedirs(train_path, exist_ok=True)\n", + "if not os.path.exists(os.path.join(test_path, \"test_batch.bin\")):\n", + " shutil.move(\"./datasets/cifar-10-batches-bin/test_batch.bin\", test_path)\n", + "[shutil.move(\"./datasets/cifar-10-batches-bin/\"+i, train_path) for i in os.listdir(\"./datasets/cifar-10-batches-bin/\") if os.path.isfile(\"./datasets/cifar-10-batches-bin/\"+i) and not i.endswith(\".html\") and not os.path.exists(os.path.join(train_path, i))]" + ], + "outputs": [], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "The directory structure of the downloaded dataset file is as follows:\n", + "\n", + "```text\n", + "./datasets/cifar-10-batches-bin\n", + "├── readme.html\n", + "├── test\n", + "│ └── test_batch.bin\n", + "└── train\n", + " ├── batches.meta.txt\n", + " ├── data_batch_1.bin\n", + " ├── data_batch_2.bin\n", + " ├── data_batch_3.bin\n", + " ├── data_batch_4.bin\n", + " └── data_batch_5.bin\n", + "```" + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "Download cifar-10 Python file format dataset, decompress them in the `./datasets/cifar-10-batches-py` path, use this dataset when converting data." + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": null, + "source": [ + "download_dataset(\"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-python.tar.gz\", \"./datasets\")" + ], + "outputs": [], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "The directory structure of the extracted dataset file is as follows:\n", + "\n", + "```text\n", + "./datasets/cifar-10-batches-py\n", + "├── batches.meta\n", + "├── data_batch_1\n", + "├── data_batch_2\n", + "├── data_batch_3\n", + "├── data_batch_4\n", + "├── data_batch_5\n", + "├── readme.html\n", + "└── test_batch\n", + "```" + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "## Optimizing the Data Loading Performance\n", + "\n", + "MindSpore provides multiple data loading methods, including common dataset loading, user-defined dataset loading, and the MindSpore data format loading. The dataset loading performance varies depending on the underlying implementation method.\n", + "\n", + "| | Common Dataset | User-defined Dataset | MindRecord Dataset |\n", + "| :----: | :----: | :----: | :----: |\n", + "| Underlying implementation | C++ | Python | C++ |\n", + "| Performance | High | Medium | High |\n", + "\n", + "### Performance Optimization Solution" + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "![data-loading-performance-scheme](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/docs/mindspore/programming_guide/source_en/images/data_loading_performance_scheme.png)" + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "Suggestions on data loading performance optimization are as follows:\n", + "\n", + "- Built-in loading operators are preferred for supported dataset formats. For details, see [Built-in Loading Operators](https://www.mindspore.cn/docs/api/en/master/api_python/mindspore.dataset.html), if the performance cannot meet the requirements, use the multi-thread concurrency solution. For details, see [Multi-thread Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/master/optimize_data_processing.html#multi-thread-optimization-solution).\n", + "- For a dataset format that is not supported, convert the format to the mindspore data format and then use the `MindDataset` class to load the dataset (Please refer to the [API](https://www.mindspore.cn/docs/api/en/master/api_python/dataset/mindspore.dataset.MindDataset.html) for detailed use). Please refer to [Converting Dataset to MindRecord](https://www.mindspore.cn/docs/programming_guide/en/master/convert_dataset.html), if the performance cannot meet the requirements, use the multi-thread concurrency solution, for details, see [Multi-thread Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/master/optimize_data_processing.html#multi-thread-optimization-solution).\n", + "- For dataset formats that are not supported, the user-defined `GeneratorDataset` class is preferred for implementing fast algorithm verification (Please refer to the [API](https://www.mindspore.cn/docs/api/en/master/api_python/dataset/mindspore.dataset.GeneratorDataset.html) for detailed use), if the performance cannot meet the requirements, the multi-process concurrency solution can be used. For details, see [Multi-process Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/master/optimize_data_processing.html#multi-process-optimization-solution).\n", + "\n", + "### Code Example\n", + "\n", + "Based on the preceding suggestions of data loading performance optimization, the `Cifar10Dataset` class of built-in loading operators (Please refer to the [API](https://www.mindspore.cn/docs/api/en/master/api_python/dataset/mindspore.dataset.Cifar10Dataset.html) for detailed use), the `MindDataset` class after data conversion, and the `GeneratorDataset` class are used to load data. The sample code is displayed as follows:\n", + "\n", + "1. Use the `Cifar10Dataset` class of built-in operators to load the CIFAR-10 dataset in binary format. The multi-thread optimization solution is used for data loading. Four threads are enabled to concurrently complete the task. Finally, a dictionary iterator is created for the data and a data record is read through the iterator." + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 5, + "source": [ + "cifar10_path = \"./datasets/cifar-10-batches-bin/train\"\n", + "\n", + "# create Cifar10Dataset for reading data\n", + "cifar10_dataset = ds.Cifar10Dataset(cifar10_path, num_parallel_workers=4)\n", + "# create a dictionary iterator and read a data record through the iterator\n", + "print(next(cifar10_dataset.create_dict_iterator()))" + ], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{'image': Tensor(shape=[32, 32, 3], dtype=UInt8, value=\n", + "[[[209, 206, 192],\n", + " [211, 209, 201],\n", + " [221, 217, 213],\n", + " ...\n", + " [172, 175, 194],\n", + " [169, 173, 190],\n", + " [115, 121, 145]],\n", + " [[226, 230, 211],\n", + " [227, 229, 218],\n", + " [230, 232, 221],\n", + " ...\n", + " [153, 153, 171],\n", + " [156, 156, 173],\n", + " [106, 111, 129]],\n", + " [[214, 226, 203],\n", + " [214, 222, 204],\n", + " [217, 227, 206],\n", + " ...\n", + " [167, 166, 176],\n", + " [147, 147, 156],\n", + " [ 78, 84, 96]],\n", + " ...\n", + " [[ 40, 69, 61],\n", + " [ 37, 63, 57],\n", + " [ 43, 68, 66],\n", + " ...\n", + " [ 55, 70, 69],\n", + " [ 40, 54, 51],\n", + " [ 27, 44, 36]],\n", + " [[ 33, 61, 50],\n", + " [ 37, 65, 56],\n", + " [ 54, 72, 74],\n", + " ...\n", + " [ 47, 60, 56],\n", + " [ 58, 66, 64],\n", + " [ 36, 50, 46]],\n", + " [[ 29, 41, 37],\n", + " [ 38, 60, 59],\n", + " [ 51, 76, 81],\n", + " ...\n", + " [ 32, 51, 43],\n", + " [ 47, 61, 54],\n", + " [ 56, 67, 66]]]), 'label': Tensor(shape=[], dtype=UInt32, value= 5)}\n" + ] + } + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "2. Use the `Cifar10ToMR` class to convert the CIFAR-10 dataset into the MindSpore data format. In this example, the CIFAR-10 dataset in Python file format is used. Then use the `MindDataset` class to load the dataset in the MindSpore data format. The multi-thread optimization solution is used for data loading. Four threads are enabled to concurrently complete the task. Finally, a dictionary iterator is created for data and a data record is read through the iterator." + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 6, + "source": [ + "import os\n", + "from mindspore.mindrecord import Cifar10ToMR\n", + "\n", + "trans_path = \"./transform/\"\n", + "\n", + "if not os.path.exists(trans_path):\n", + " os.mkdir(trans_path)\n", + "\n", + "os.system(\"rm -f {}cifar10*\".format(trans_path))\n", + "\n", + "cifar10_path = './datasets/cifar-10-batches-py'\n", + "cifar10_mindrecord_path = './transform/cifar10.record'\n", + "\n", + "cifar10_transformer = Cifar10ToMR(cifar10_path, cifar10_mindrecord_path)\n", + "# execute transformation from CIFAR-10 to MindRecord\n", + "cifar10_transformer.transform(['label'])\n", + "\n", + "# create MindDataset for reading data\n", + "cifar10_mind_dataset = ds.MindDataset(dataset_files=cifar10_mindrecord_path, num_parallel_workers=4)\n", + "# create a dictionary iterator and read a data record through the iterator\n", + "print(next(cifar10_mind_dataset.create_dict_iterator()))" + ], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{'data': Tensor(shape=[1283], dtype=UInt8, value= [255, 216, 255, 224, 0, 16, 74, 70, 73, 70, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 255, 219, 0, 67, \n", + " 107, 249, 17, 58, 213, 185, 117, 181, 143, 255, 217]), 'id': Tensor(shape=[], dtype=Int64, value= 32476), 'label': Tensor(shape=[], dtype=Int64, value= 9)}\n" + ] + } + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "3. The `GeneratorDataset` class is used to load the user-defined dataset, and the multi-process optimization solution is used. Four processes are enabled to concurrently complete the task. Finally, a dictionary iterator is created for the data, and a data record is read through the iterator." + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 7, + "source": [ + "def generator_func(num):\n", + " for i in range(num):\n", + " yield (np.array([i]),)\n", + "\n", + "# create a GeneratorDataset object for reading data\n", + "dataset = ds.GeneratorDataset(source=generator_func(5), column_names=[\"data\"], num_parallel_workers=4)\n", + "# create a dictionary iterator and read a data record through the iterator\n", + "print(next(dataset.create_dict_iterator()))" + ], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{'data': Tensor(shape=[1], dtype=Int64, value= [0])}\n" + ] + } + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "## Optimizing the Shuffle Performance\n", + "\n", + "The shuffle operation is used to shuffle ordered datasets or repeated datasets. MindSpore provides the `shuffle` function for users. A larger value of `buffer_size` indicates a higher shuffling degree, consuming more time and computing resources. This API allows users to shuffle the data at any time during the entire pipeline process.Please refer to [shuffle](https://www.mindspore.cn/docs/programming_guide/en/master/pipeline.html#shuffle). However, because the underlying implementation methods are different, the performance of this method is not as good as that of setting the `shuffle` parameter to directly shuffle data by referring to the [Built-in Loading Operators](https://www.mindspore.cn/docs/api/en/master/api_python/mindspore.dataset.html).\n", + "\n", + "### Performance Optimization Solution" + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "![shuffle-performance-scheme](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/docs/mindspore/programming_guide/source_en/images/shuffle_performance_scheme.png)" + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "Suggestions on shuffle performance optimization are as follows:\n", + "\n", + "- Use the `shuffle` parameter of built-in loading operators to shuffle data.\n", + "- If the `shuffle` function is used and the performance still cannot meet the requirements, adjust the value of the `buffer_size` parameter to improve the performance.\n", + "\n", + "### Code Example\n", + "\n", + "Based on the preceding shuffle performance optimization suggestions, the `shuffle` parameter of the `Cifar10Dataset` class of built-in loading operators and the `Shuffle` function are used to shuffle data. The sample code is displayed as follows:\n", + "\n", + "1. Use the `Cifar10Dataset` class of built-in operators to load the CIFAR-10 dataset. In this example, the CIFAR-10 dataset in binary format is used, and the `shuffle` parameter is set to True to perform data shuffle. Finally, a dictionary iterator is created for the data and a data record is read through the iterator." + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 8, + "source": [ + "cifar10_path = \"./datasets/cifar-10-batches-bin/train\"\n", + "\n", + "# create Cifar10Dataset for reading data\n", + "cifar10_dataset = ds.Cifar10Dataset(cifar10_path, shuffle=True)\n", + "# create a dictionary iterator and read a data record through the iterator\n", + "print(next(cifar10_dataset.create_dict_iterator()))" + ], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{'image': Tensor(shape=[32, 32, 3], dtype=UInt8, value=\n", + "[[[119, 193, 196],\n", + " [121, 192, 204],\n", + " [123, 193, 209],\n", + " ...\n", + " [110, 168, 177],\n", + " [109, 167, 176],\n", + " [110, 168, 178]],\n", + " [[110, 188, 199],\n", + " [109, 185, 202],\n", + " [111, 186, 204],\n", + " ...\n", + " [107, 173, 179],\n", + " [107, 173, 179],\n", + " [109, 175, 182]],\n", + " [[110, 186, 200],\n", + " [108, 183, 199],\n", + " [110, 184, 199],\n", + " ...\n", + " [115, 183, 189],\n", + " [117, 185, 190],\n", + " [117, 185, 191]],\n", + " ...\n", + " [[210, 253, 250],\n", + " [212, 251, 250],\n", + " [214, 250, 249],\n", + " ...\n", + " [194, 247, 247],\n", + " [190, 246, 245],\n", + " [184, 245, 244]],\n", + " [[215, 253, 251],\n", + " [218, 252, 250],\n", + " [220, 251, 249],\n", + " ...\n", + " [200, 248, 248],\n", + " [195, 247, 245],\n", + " [189, 245, 244]],\n", + " [[216, 253, 253],\n", + " [222, 251, 250],\n", + " [225, 250, 249],\n", + " ...\n", + " [204, 249, 248],\n", + " [200, 246, 244],\n", + " [196, 245, 244]]]), 'label': Tensor(shape=[], dtype=UInt32, value= 0)}\n" + ] + } + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "2. Use the `shuffle` function to shuffle data. Set `buffer_size` to 3 and use the `GeneratorDataset` class to generate data." + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 9, + "source": [ + "def generator_func():\n", + " for i in range(5):\n", + " yield (np.array([i, i+1, i+2, i+3, i+4]),)\n", + "\n", + "ds1 = ds.GeneratorDataset(source=generator_func, column_names=[\"data\"])\n", + "print(\"before shuffle:\")\n", + "for data in ds1.create_dict_iterator():\n", + " print(data[\"data\"])\n", + "\n", + "ds2 = ds1.shuffle(buffer_size=3)\n", + "print(\"after shuffle:\")\n", + "for data in ds2.create_dict_iterator():\n", + " print(data[\"data\"])" + ], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "before shuffle:\n", + "[0 1 2 3 4]\n", + "[1 2 3 4 5]\n", + "[2 3 4 5 6]\n", + "[3 4 5 6 7]\n", + "[4 5 6 7 8]\n", + "after shuffle:\n", + "[2 3 4 5 6]\n", + "[0 1 2 3 4]\n", + "[1 2 3 4 5]\n", + "[4 5 6 7 8]\n", + "[3 4 5 6 7]\n" + ] + } + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "## Optimizing the Data Augmentation Performance\n", + "\n", + "During image classification training, especially when the dataset is small, users can use data augmentation to preprocess images to enrich the dataset. MindSpore provides multiple data augmentation methods, including:\n", + "\n", + "- Use the built-in C operator (`c_transforms` module) to perform data augmentation.\n", + "- Use the built-in Python operator (`py_transforms` module) to perform data augmentation.\n", + "- Users can define Python functions as needed to perform data augmentation.\n", + "\n", + "Please refer to [Data Augmentation](https://www.mindspore.cn/docs/programming_guide/en/master/augmentation.html). The performance varies according to the underlying implementation methods.\n", + "\n", + "| Module | Underlying API | Description |\n", + "| :----: | :----: | :----: |\n", + "| c_transforms | C++ (based on OpenCV) | High performance |\n", + "| py_transforms | Python (based on PIL) | This module provides multiple image augmentation functions and the method for converting PIL images into NumPy arrays |\n", + "\n", + "### Performance Optimization Solution" + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "![data-enhancement-performance-scheme](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/docs/mindspore/programming_guide/source_en/images/data_enhancement_performance_scheme.png)" + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "Suggestions on data augmentation performance optimization are as follows:\n", + "\n", + "- The `c_transforms` module is preferentially used to perform data augmentation for its highest performance. If the performance cannot meet the requirements, refer to [Multi-thread Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/master/optimize_data_processing.html#multi-thread-optimization-solution), [Compose Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/master/optimize_data_processing.html#compose-optimization-solution), or [Operator Fusion Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/master/optimize_data_processing.html#operator-fusion-optimization-solution).\n", + "- If the `py_transforms` module is used to perform data augmentation and the performance still cannot meet the requirements, refer to [Multi-thread Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/master/optimize_data_processing.html#multi-thread-optimization-solution), [Multi-process Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/master/optimize_data_processing.html#multi-process-optimization-solution), [Compose Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/master/optimize_data_processing.html#compose-optimization-solution), or [Operator Fusion Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/master/optimize_data_processing.html#operator-fusion-optimization-solution).\n", + "- The `c_transforms` module maintains buffer management in C++, and the `py_transforms` module maintains buffer management in Python. Because of the performance cost of switching between Python and C++, it is advised not to use different operator types together.\n", + "- If the user-defined Python functions are used to perform data augmentation and the performance still cannot meet the requirements, use the [Multi-thread Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/master/optimize_data_processing.html#multi-thread-optimization-solution) or [Multi-process Optimization Solution](https://www.mindspore.cn/docs/programming_guide/en/master/optimize_data_processing.html#multi-process-optimization-solution). If the performance still cannot be improved, in this case, optimize the user-defined Python code.\n", + "\n", + "MindSpore also supports users to use the data enhancement methods in the `c_transforms` and `py_transforms` modules at the same time, but due to the different underlying implementations of the two, excessive mixing will increase resource overhead and reduce processing performance. It is recommended that users can use the operators in `c_transforms` or `py_transforms` alone; or use one of them first, and then use the other. Please do not switch frequently between the data enhancement interface of two different implementation modules.\n", + "\n", + "### Code Example\n", + "\n", + "Based on the preceding suggestions of data augmentation performance optimization, the `c_transforms` module and user-defined Python function are used to perform data augmentation. The code is displayed as follows:\n", + "\n", + "1. The `c_transforms` module is used to perform data augmentation. During data augmentation, the multi-thread optimization solution is used. Four threads are enabled to concurrently complete the task. The operator fusion optimization solution is used and the `RandomResizedCrop` fusion class is used to replace the `RandomResize` and `RandomCrop` classes." + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 10, + "source": [ + "import mindspore.dataset.vision.c_transforms as C\n", + "import matplotlib.pyplot as plt\n", + "\n", + "cifar10_path = \"./datasets/cifar-10-batches-bin/train\"\n", + "\n", + "# create Cifar10Dataset for reading data\n", + "cifar10_dataset = ds.Cifar10Dataset(cifar10_path, num_parallel_workers=4)\n", + "transforms = C.RandomResizedCrop((800, 800))\n", + "# apply the transform to the dataset through dataset.map()\n", + "cifar10_dataset = cifar10_dataset.map(operations=transforms, input_columns=\"image\", num_parallel_workers=4)\n", + "\n", + "data = next(cifar10_dataset.create_dict_iterator())\n", + "plt.imshow(data[\"image\"].asnumpy())\n", + "plt.show()" + ], + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAQEAAAD8CAYAAAB3lxGOAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAACRKElEQVR4nO39bcxu3XYWhl1j3S+YiBBsH1PLsml8/CGQ/4AdK7ZFVLlYVOAinB+U2I2CiSy5iiBy1FSx3UpNFbWSqaoQo0ZurCapqWgMcUKDEIISxyiKVFwMQU6xcThQE9vyRyDGuKCGPPcc/THHxzXGnOu+72e/+z372ec8c+/7WWvNtdZc82tc4xpjzjWXqCpew2t4DZ++4XjXGXgNr+E1vNvwCgKv4TV8modXEHgNr+HTPLyCwGt4DZ/m4RUEXsNr+DQPryDwGl7Dp3n4SEBARH67iPyEiHxCRL7jo3jGa3gNr+HtBHnb8wRE5ALgvwTw2wD8NIC/COCbVPXH3uqDXsNreA1vJXwUTOCfBPAJVf2bqvoPAXw/gG/4CJ7zGl7Da3gL4YOPIM3PB/BTdPzTAL6qXyQi3wrgWwHgMz7jV/0Tn/8Fn3+e4oasyNmJs1v09vWAeKKb5zwQpB9snvWWJ2dqT/RtT/68md7JyTt50AczqdCTKtQeUeO1XlOZru+3Vo2ml30cdw2R2BcRunbXU57RPwEspHxfBXH3c0n8f/Vf/dTfVtVf1+M/ChB4KKjq9wL4XgD44i/5Ev2D//v/w+6q2HShX0DAa0S5grwWdR6f1ZrMPyI9bo0XOjjbPwsf1vTSKFtuXVh4W55zBn7bXpi70o/b9aIkavyM6LhKUW2ft5FXOhPlyf15ndo+5jmLV1UMHdBhWx0YQ+dWbTtGPEtEZnvJbFs/Xn7H3B7HgeMQ+819kaMcH8dBFVcrVNvxLLe1n1VN7Gueg+8j46MvY+nyd0HhX/gX/sDf2sV/FObAzwD49XT8BRb3SQ6v70S8i/DWa13K5q0+4GGWt9zx/Ds/ivCA3nkofBQg8BcBfKmIfFxEfiWAbwTwJz+C5zw/3Ky0T2XQeBmd9rGwz6s+VIQ75l678vEW3/D0txWe3TR5w9vy6b91c0BVn0TkDwD4swAuAP4dVf2rb/s5bxQ+SXJ+4hH4yMO5vf0OcvPsR76NWnvcFyN4E2jUN77zZpIU5PzUzdgPEz4Sn4Cq/mkAf/qjSPuTH7zhX354q+DzSccNbduT028SZPo0Vn2uJw69TQJl+9H1iXehPN6ZY7AEd4RsgjsAi19X23F3SlF/0hZZvM5ub+p0/0YeJOPm4zQcf2X/4U50Vr5T928GoYz2e+/GnMe+tbBNXreHWhtmvYSj72DCc8MuGbF/FFF/mHb3Wvv32vye+XD72Ny9i1vx/nPezER4GSCAW0Xd9QzaZwGPnqQZTR708KyqztaNjQLKowBVp4rIHghIIVRwaCXYjP3caqzVwfxYyy5M4JNmHXD991NNmqk5c1ivX5Pp+SgBYoQA0aZqLnPf7grmdVLqpgl7jAbg9iiBDSjMJGg/875yMR66XMWazqluG89P9TqOEYTY9/gbbXESXg4IbHI9q5Ti1bVvRmT9sPArsQAeUkLWmLWgxPAg6/WUbgaAmcfZGTju5vDgMwEginOaXGvkrkLvocCZ5r6rcKiMu7F3XffLSGVjA2dDg37sQ4IVxEHDokrZpjRDmFqBSKM7AzgdGhTBsQi/DSta++dcgbN63NQRlfb0OseyVoZV6HNIMeqlAcGj4cWAwBqagEc0mwK9050Jv8W75sAc507aP8VfVKE0GWQXJgA0dmCgspoHKwAsQpp4cyewltlK/PKo++E5nSVrfXdn79PKEcs+AQAJfra50jlE2yUrcFnPeQLc8auDlPd5XgeRARPy41i1/yGVBYhPDJJmSmye1/OxnFEqB1z2aZ+EneurzilYgeG54UWAwDLBpZ3dAYHfB9DElaCGFQiYNvJzZuNqAAEDAMtmZwNTI+QVsdcEuhZp1VynsrwFhQ4gm/O2XRjUzes3z3/oOWj53wleLXMR+ADzzXHcTrTf2xMox51G78rd2zTZH/+wsINDDsixmgvBKqzRTwY11/0OgHy6lGNtkCr8t7fPDS8CBICTTsuV1iqQj6NzFWFHFfxiR85UssG1uIhSllP7rn4BZwVn5kDrBAwABQPq05Zbd+nqbvtAOO1kbxp0c1g1+Nzu2+3W8V7z+04FB2YEoUY3+XPBFZoFurX9DweABIjOCLaTmCSfWcAeDlwUt80j5X8LCl4XO+F/cyB4OSCwZFzXk2edSblyOhjUDuL7UyM0TSBKU0r94SsQ8P46amC9YdO2a+l0e8SBCOwClG8myh8SALa3d6lHaZN5xoFwE7+5zpNTPxcmAoNDgn3NQ4JHCUHl0wRIUGj+gLKP7COmLAgHNhVT60Fb/M5ZyPcGDgQDqIIOdOGfx3OqNAPOY+EFgcBZxk8EfqMRCwCMHQjkvgAT6enf0YmdpI5m4ff87oYKnSGU/C9l6fs49Qv0DhSC1DUo1c2zVMEenc6PN/evzdM6e0/nAWAoHv/FLCBlqVkvN8sd/p9s0yLczQFY/APlPJpDAbvhnH37tDJzGVjYlwr2O0nw+/EYSum8ryBw1tuoY4e+1P1Wt8I/SrwDgQgg4zCqB0CAcdR51DzBhIW/HrN9mHle9UImtgox756gQRCSVk+PtveuY+iyc/78B9Jb8C2argt5u4iBYxH8mnh0/MIG7NoOrK1E6vvi5sA6SrBzDrLzMFhEqaneJmvrI8qyO9f22awhBuv1039jrGDwnPAiQCBAfz0TF6hfeAICaReOrKDRwICOJ9UbEEMAgeAYgnEcAQRbRknCD2h0ppOSbWR231GXMtenNqni67RsHk7y9CR3Rjk9tUZ2YCOp5S1pchYMpWuq4FMbkuDv2IH3g3wm53M6f20P6RjsZuF8IzD8Aibt6VD0/ebHofpoULevGxfyEyAIpsApMuhtwYBY7zPCiwABIAvaIhHV4KgIUEeg+7rQD43XSDM+9+NVUgggB44DGAAOHRiYQOCPc1uf6/a28LcibAWf0pN7KWnbq51c2zWes2djzRuEXT9e8SoFQ9eTGyAg7X5iFqymgWLJCKwerDKcBUCS2q+Ufx0a5NGAAgbdObgIe8Zl/97UQTMHsp9Ru98EgBEA4H6B54QXAgI30KtQyazktAXzfKX+uT8YCOy9cjE7QMYBHAMYUwOMARxH2muQOknoZiks3U0beikz/3xBx4hTLdPTfKS1z645A92ziFuF1/220+JF6Ft9BOOjZELIi06I5Ivm5zRaIJctkvu5ObCyAV8vACAfj+Rw4Nbvs2Se66D116Lp+T6qqzAFvN7YHzAW7R9rKIxNBdwILwQETrrqDgCiEuyuja00Ghh45fDxIQLFgcsxMIZgUoGB4zjmtTigirp4xp1Qhwqp8VvnSI13VhnZLVbGsdH7pf/kwV028JzAtGX73L6/CvbiC9nGNRbgbXyDIXgarG3XTCEHgRenXmr/Q+roQDapGnOs6SZAcB2dg2K0nvL1Nc1bpgADwOL30vfYJ7ALheYvAMBaRbPjmKZnoQ8WMGyVGWMCKoIDCuhhq8KMCQSqwADkcIqmwSnvMYG1DPv950hmgN0i0aepl/BWgaCnyYlv9ln76nLzIw84CST8C4icJRcUnhkA/XLcsJoDgPkS0r0o5F/YZflZdV7pwEkbEwAq9csT0yDl47HwYkHAqzIVvhZA4EpAsYtM+McwwadjzXgRweU4oBdHznAHAscx21gnKBw+o7BXbHiYHwytbbbJ3bj37E6RVCq8z+f7wzT7dM2AUj64r7uJ0p7jTCVNIc23Mk1gfLsW6H5cgPBabIpjtla1tFphfZZF+gKIDHSPf/MDwEqREZpxUTc5hbyWOrfZOF6x2TAKr9CKL7s+kganQqAYriBDFs4qax9eMAhkKECpgOqYhe3DgWMU4R+jxRFLOMzWv6hCL8QwDAwE3rDzekzeADs5r/AOKLehYG2OPV1NTfKI/X2eegWCppbLM2Z8AEj88VPpsUww8PNzpiVUoDbjMudO7MSeO30rn6XD2S16lVBNt6yoJVfvNkAioeULGQywgkECovWHcn8+yetTRcyEXAVfkXNLOCt+NhXfCjRCZxGpcQrOjVriD4QXAgJR7SU46fODFHYsws9aftBvfzx9AhdjAZeeFQhExuybKuTIQqgXsQYHvA/fX1tgr6/2191jF10OUuvlcT5IUnhQ4yfQrZnizlrAAanEvLMKAQHr/+j4Lk2949t7G5SVVkKujRSGxASl+Oq7SZHN+6ZQ0WgAWq/r2OT3FAQ04ZQoBZco+oXAABK9Lfi+WlfeQ24RnzUkAPQ6eTS8CBBQYF1DjjuMGsopbNjPhd/pf4LAuLrAX/dAcJ0sYToGL7h0rWxaPTzGOiB6wDU+VE3gxPa5CdcGPGWwvLOR+BunMrTlckLrdWkAugkb8aopKKn5HOhqBoS1NWBvYjLd750ae61Nwm+W1rxMskijp6dB6BEdfanIbRG9OHM3hH/vF2BGkMJOLKDYVqtJoJqqYJansQBnCbJhAM6+Sj0+wiS1/p7pGLy70KiI/Dsi8gsi8v+muM8WkT8nIn/dtp9l8SIif1jm58d+VES+4vGsSPtlUDAQ5M8p/1DFdQxcrwPXcZ2/qx1fr3i6XvH0ZL/rFdcn/s1rrg0kxjBP65g2l4NQOGYWBL4VdlelB/hWEntcv4X2qeVq9E71gWbCsQggxsdBQhPHnr7LeYyn1+E2lP2TiTnL0Bzqm3uSz96Ny7MnfVMVcVMp+r6bRYUwICz3Aig1FXXX696uoTK48zFmIYKfU0cj+Jm1ZbLsdQsXkqiV/u8sPLLa8P8FwG9vcd8B4AdV9UsB/KAdA8DvAPCl9vtWAN/zQPrb4FSbqbj7PoYNhQzz/icAtP3rNYDAf09PT/Nn4DBBYwr/9ZrsIdavb0MwDkRe2S7FDr73IOGMGegtuS6nci8EUTZXSvTP9ZIdGFB87/CeiKeX6bvw25Xe2RsAhPAfJATL9Fy0+A4mnFEqaVReLVpx/BWtv6qb7izcVhMJb9SMsHCSSDN4eQ46GKBeU4SeMiLcpuhmonc61v4n7ODDgICq/qcA/psW/Q0Avs/2vw/AP03xf0Rn+AsAPlNEPu/eM04DAcEyRTLGRdnmv05hdgAYLvjOBp6SGVyfCjg4GwhGMJRMj/n84WzAkTX2W509ULQq1Ep7u8bzM+uzFu22rUeYcG8oryyXkcbvnXHuhXACKRxxveSzTrR/avyd8CcwOCAlJd8VkQThrLNTOZc8kCD3X5ajaWUGpQCYzGcATxF8qjvPB93H3AD0fJTY3hYAA2D0Rh8ifAwD3tgn8Lmq+rO2/3MAPtf2d58g+3wAP4tbIVs6g67iMBgMhqaw2v716n6ByQCGgcEwDT/3J1hcjgOla4tAhmCIYIwDxzGgQ6DHEUAAVaioOQs1nWFhAmuWo5gMpWC399VsyRvVRRZ3VB8VpKbLprNfSH4A7mx+PYtbP+/FnI+SfIC4b6A8CICUvz4S0MvgWc+RiikwKgIfiRARmrdxq2d7fAJJyRZ6xKa2pZa9piEoIw0MUlS+qblnmdNXYH2EmKMAsy8pHcfWjiIDVG7tcQ9I/CZ8aMegqqrI4na6G4S+Rfixz/mc3RXo9o5i2ufD/QHEBK7GBEL4x9XA4EpUP4/1oqbtALmmhjhELM35MtFQhaiNfTsAmXs8hmbFMbgi9lIE3mrpqq1SAe2db73kBlDQg70zaztN+Sh3nmhdIc8t47W6N8+vKZ11L+ylXhgv6JwM0pIEDgkAaDdSYh20bMd1zf7H7KDDQ2ZSWspSyksPikZMR6DHTcz0OvOaSsDiqRHi81MUtT12TDHk/5MDAj8vIp+nqj9rdP8XLP7hT5ApfYvwC7/4i1sRgaTIkuyGWQD5Ba4bIEjhTzbgtH+MK1TnwGA0/hg4xhXXITjGwNCD/AGCoTInDdEkpdmQ/rulu5fSL0c7ICgk42bD7p7dEKhpeyeg68yUDQCQOi1ntb9FmWkVThDaMUcRmi5dc78RzF4kywJp+1qTZLHEzmISodcei3lPi8CgVbmbR+7D8jrT3rOpqtTLafd47nsxHbxK7nrlhfD7fJfHw5t+huxPAvhm2/9mAP8Rxf9eGyX4agC/RGbD8wJVujHxtM0DCNyTPwoQXMfGIeiOQBsdcJ/A09M1QcLT8A9Y+lRjdU2/t9h3YY2/3TDb6w3a749AuKrQ6EXd2bV0ROkHebGUKCmny109cXpo0aobmx/kD4iVfY/9aIE/eBVYpXqqNclZguU1ulShBEiWED938K0CGOUmOBGqmOoDaEAWqZe7ox5LNVI93w7ZT97EFAAeYAIi8u8B+FoAnyMiPw3gXwXwXQD+uIh8C4C/BeD32OV/GsDXA/gEgH8A4J9/OCcLEZDS/XeOwckC2CdAv7EygOv1CdenyQQOvUTnPK72ddnrwDg2jkHo8vyYn623q13vbPnKrXng5p/c6BBFOFkbeu/VhU766c4eFwZQ4gTSEqk+EUssjvniDd8Jlad5IAIZWgCgAE5Jrx9XlFqBoAt7CiyDJgPDfL5u063H/GzjPFS5GqBirMj2nSn4U2YtaDAEZwvcftUUaLVwb5hpE+6CgKp+08mpr9tcqwB+/7NyEOE+5gGEdwwIJLRj1BGDHCpkdjBwuYA+LT1wXK9zfwwchWXM9Nwpla9zJjtwJ+a7Dr0G2W+3yCPfs6l6WVOrANEmGWXasjzLie78toPfP290/0q4D5UAwJ9ZkeluWC+h9Fwz37ijkBv+2zFseVDC+Hycl8lTINuFJw3pFOzcZ8dixdTIz4lSoFZ5OLyIGYO12i0mVgCW02uW4LQ9ZhMqgcOI4cMxrhBRjOuBcVynA/BI4ND4uU/gmPvizAANDPAWgGA2305b02Z7ru62DiIpsNovPsnwQzK3+7if3xDxjX4onZYZcZbMG4dScFAPaiym0HS/8TFFVB6TySH0eKlovVfdNUWqEBGDRnWgzIad+znMyMzluTMGXwQIOB3jML3OBADcuHlX0mUQALjvgF4gCgeimQsAcBwDl/ZuwShDj/4SkmLImFNdOxvgCj+r+x08byG7S/x5Y8pSH/vDfFbVflokcn/74yLRwybfy2SW2/cTvN5PG0AylUbOrXMlEGxG3YtWfbDUDMBST1TIJV7fKnzfLSSmG4d5IJNFBai53yWy7wCRfpbnhBcBAjN0JkDIdgOlp1ZG8Y+UyUTNRHDzAAAuxxXX48Bx2YBBzBacpoGoVDbQgCANhA8jPGudJI2MqOd32Bsp799bfdN0z9S6nu6dC/ubhewzzAIQAhJCX2nBGxZ5NZqc7ki86+C56vvJD9IMQBV8pKkQzxKSBiF2UBjNe8gEgBW91DwilSUkC9i6P7T5CpYZhSMmDgmA6+WYwn8dGBcGgPpGYpoDA0NlojQ9663VARdxPXPaUXv1nCaxzapsr7knExuX1J07HrnmVqqP13PUo+3kkGBjCw0IdmXegXrhGrTRtncayumeek1FBOY7YJOAcmCAlkxAPpWYgFc/Nd+JlDgbUFThLG8XtncKAOByHbheBi7jiuv1wOXSzYL0K4iZA8IjBv5cnzvwjLKunxLZFm05d3rJI+3+iJJIBdXCThw+GsF//jM8MEci9eiaJIYvU+SLwDyEAisP61tn/udsYJNz8UcJul4JXkNDifWlL2LOdu45yunlgMDSv9L2yfPtIq0HCQQ8oWg6+NwhyD6B6+XA5XokI+jrDvh8AZswVIcHyf5oWdlXvwBy7gjbgXcv8nLJm7BYvuGWY3DbZx/r0DcTp3Pa4mrsmzKsKqTJAqpKAQPBqTjffURGBAIEoaeLH3II5Vmp2OOMIPLKIxziQJbxubjLY+FNJwu99SD9H9l16RfIq/04/AB2xmUzJhRthgxjrQEfLbiO1WyI0YG6omv9PdsR+4z6wO2+eM80ePghN6JOAeiegO8deo/C5XPcgfsg3EVQIICdzXxtf9490yluXY0FNhek33DXEbtC1/mkIjJ0nOgA76s50IXcYzcVKet0U7U/p28a0hChv1wEyDQFuj+gzDlwJjCB4DhykpCzDuDD+wUeaTNZdtYae7Tpi566xwweUfo3E7gVn+c/LJYW/3EIJ80LICFJzzpzgJ3Qbp5TtslSY0zA6ksAe/kpmUGpSuXUOP110liYF57PsPuFcI38Bc9kAi8DBILSZJhDWMwCGlC0XlOZAJkE4RDUst4AAFyvB66XC40aDFwudA+9pHQc5hw80ifg7xDM5+7A4ESCInp//pQF3ACAW+FOUpt8PXDtEp7nH9iL/RuaBCmLpZcsWtgFCKBtv7L1sdPCC+0xlCgJv1+nN7ZrtXuW+P0L8SFzBq4GdgEUz2QCL8Yc6GGHzD1mrrUaB1bvPEloLOaAv0rMaxB0kyAmCo10LioNGRZzAG9uErDLoxV+PbwDAEsca75nZer2s/eBKkB6/D1yf155uuydeltuhNScxYtO50sZT/XMSU8sPqt9Qo+0QaRIiq+oPaFjYgLVbXbWoW6HF8EEduilivSAMv2xO7RdO4XR6T+bBhUIKhO4mImwMwnSNJjrGB7JApTZAPAc2/YE99uR1uOH2rV3xk3yN/JabjlhBPuwo2TrzUoe0efU1uOhM8Y+IzAvY4Hb1f5Z6mW/NcwUQH8XoC44fosFcKrx2nDL7GQCmrLgJYg8fDgm8CJAYIbWHFTAfQe3g94HCQBGdwxe/VXjKwQwFnCpAu/zBGhpsTEU41AcZXQAy1Dh26uGc7V/OorwQNxNRNEVgM6Dnly2iQzfiR8yEjSj4EGQ2nvb6WinEIsApcDEDeiAsD7xvJ5Z+z+jIxS/QJoG2d8JFBiwJHlCvmdBTOB9nDb8WPDqqROIOs2rzp4auI8tcwqI8sfIwNiYE+PAOCZIHHrkcCGll0j8tpDhpEA3wk1zdpekoHSeYpvfYu/bDrcDCdZyZPtrjZlJun8lza7Ikdbr/F4WYVOau1LWLDoQ2A3+mIKVaqnHNxbyGtfyYisH9Tyeb1tGSigPSK0fLVrBqvbpJfGHwosAAafy6wlthQrJJ6/vMX8HfUfuOOzz0gft8z1MFanzgE0KWrdwHPXFIn+pqI1CLB3VtSB34lvAcE9qH5XsDxlKlRdzfNeJ6aLeuRdLIeuCQYABpwPy8vo2ku35/hTaWjkVCEziAXQJUhL22C0KXTKfxB74afNZXDd1/yzOGSRhYwsVENbL1EAKoYx8nsxz9M+LAAGk/CzRHNze8THTQ+aagEd8S/6IeF4uzMGigwFzQp4AVCcb9dGCfKlo+QKS/cQo2UNK803q6iMBAl3+7rR0KcRJ79VdPAu/dhCgLRAmWK3fyhBqwlIEMQybBgTzWxGopkDPYgjcSWMpXSP5jHqLLnWwA4IEWB9svhWo0RuAlPoZeE9BACca8qRyQoPLATnGXI/ukFgfIBjACRhUICiPCyQdun69qPoNjBG0jlo6QO5U4fpQ9YT1k2BvKbiA5jGDAWntjmY7oWGBoGsKCLhzFazZ98DKk7MqINAj7VCIFSyMIItSw2LB9Qii+yg40O7zvFSg2oGCckY8a+X567PpAbmNftf74mPhxYBAz3UIP6MAD5+I4BBAZa4I3JmAmwVyHBsz4MRzzBUZDCDNgQIAZUnyCgBC9nUVJC7dg9UCtwd7kHbNm4Z9XtikyU68aq9200nyJ8K/sAAX9P7pbWYA93Vm1tkOCE7uuaX99w9Y2MC9tDowbMGohxuOvhymbmxp3DQ6l/BiQGBbTq2NHmMFMXR4QGSu+lOYgByFAex8AjEJicbDWMsM1f1biPHVo2QBWDprFIA2H54FtMpB+Xbbh2UFlMfahSoAdB/HinGbnh337UGgOs4UucDrKACwrWNLI0eTEFVzCwi6oFYvf6+D9AvwoBSzgb2cboCA6idqW08vJye39ddtfde6Gc+0O18ICCgW7Gp9Sdo+OwYP0cIExJcN6w7EZSHLmoPoaOb8i6XHt/MG0oEVQ4Xcmqea/7lgsOtIb8cSkNqV1kdubfdK7/dZ3Gi5qCPer8LPNP/W+xp+j9e7FJ+AcBYW0yDKzvH9JKbQaRF1T3aeWVDgNNxob2JTe91dYKYmab+dibR1st8Ij3yL8NeLyA+JyI+JyF8VkW+z+Lf7PUKtv9AUpTypxQsQHMQExE0D2ZgCDgSHEQExp45QheZqxjFEeL0W/0A4DQ0AhmZGy0zC+PeA6G870r27dLv7WLhHGVNrx5FyHAmt/yNBjevv/cpLWgOdAcRn6IvJtc/5jobvqbnGOeW2A07zOFp+Oc+rE7PVxYmwltwQaOqi8na5J8XFCuzG7yw8Mm34CcC/rKpfBuCrAfx+EfkyvMXvEe4qH95AFJa5AeHwI5+AAUL4B47DwGGznPVMreQkKZVT//w8mQ8RDuqofRiLSrQp5wkgFO8S/zL2Tu2dPvO51wU0kPCEpp4HVFZHOwcIBoPspEXzK1/Dgq9NmDbff1zYwyb/O6EvcRpZ3oFB7Yt7UOA2jzUnNCenPSqMer85MgixFiqXZhW3en0cBB5ZbfhnYZ8RU9VfFpEfx/y02DcA+Fq77PsA/HkA3w76HiGAvyAinyn2oZKbz1kfTCUFwmrbsIBDBVqYAA8briMC+S72Zo6AV6TPNhRaZSgmE2kZIsRS2ZnupuneSqhPWY/fLMXcT+2UgoIon2LpUxvNVoBso80T8Bt4MCj0zowEAN7ySI/qju4jJvX4vAJngDnZB8iPqe3abPUNxAOav2E7s9P8Flx3Sn/7s+KNwo1z0Osi9htoPSc8yycgIl8I4MsB/DA+5PcIpXyG7NcthQQcnbcZaRRfIYOYAPkG1q/f2jWAzTfwxqmabujAMQRDQH6BCgh76ldnDHr2e59Q3PdY54UPXPfmN5ymo9bhK+1MLVqbrLZVB4Iq8Dvm1xhCmAe1bv2hHQC2QIAqjSyc8zzsnQaet39mhwt4PcaSlv2pPqbqbEw5pjTb9WtIYFpusJsKMD6g9XfhYRAQkX8UwH8A4F9S1b9XK/z53yNU/gzZF32JntK7mov4x0BwyAE9hlH+ozACYTawMIH2oE77xhx5KAuNKA8RsoMwc+tlaQ7rPRCclPRdBu1HIajJlro6KxpN2+3gzrmj2qMI+fI5+HLfeSdfgAB1hCDj8+UeMeFWAoKqeAnQu6Bb0kLHhP9l5CmD1Gs2V9ClNvLgk+Xt4rjPd1bT5TnhIRAQkV+BCQB/VFX/Q4v+0N8jrKHTHUTjRz5ggCxWMTTcNzX8UbX9QaaBmwdHgkB9MzFwIDTREIVsRgjWOQL1t2gILqEkvu+QYavHFXe/VHx+8zMCaVpOMvetPRwQFtVP1/FxpHnfXn6OTbvt6o9Qc1b0RUNXFkCEHKCz4BhFYQjYXO39de47mNz+FuOt4P00miKY2ZuBwCOjAwLg3wbw46r6r9Opt/o9QtX6yxPU+aIxc6LPbhJQflnItxccxwWXy4EPLhdcPrDf5cBxyenER4AKirMu7K/SKfPjJF1zrbYsdZ6uNR8Nm8tPhaCB5+NBNhJz49J2fCuk9j/7nQn+eTkCyMv8DzvHcS2TfaboWVyXpc7mk/llfCNINY76bzncBkHriCeBWJIfPQiiHh5hAr8FwD8H4L8Qkb9icf9LvNXvEZ41dtqhHtKOmwfiWtJ+/ALRcRy4XC64XC744IMLxvgAlzHwK9Q+PsIs4WgOxSRgMydhe6WmC5obw1wjVyHaaSB4nulbdO05N2vJcfDkhqIETzXUPjjLcqzlfssMZj7DVrM1RhUdjO/vanNRo9p+jwdplbBTBNjEzegOBDxfpO3fzkU5iuomMuGiHqNa5B/oQLCb96UtZkdoSu2RAn0OGXhkdOA/2zzfw1v7HmEX9ko3aUuCz5Udw4U+HHgILscF18sVl8uBMS744HLB+OAC6AeAatH+PoxY3itYap1Qt7GCMXjfzUHSTH6oVIAow60Wu0EB7mntRo17uAc+whaL0gczhfznDnbOcBYg0Hr+ROgfYQDsXeei70DABX4LEFb6lMUuZIIWEQdLc7HpEcVfASQBgIGggi0fb8u9azE2C4IVnKW0Dy9kxuCt4FpGS41N1PXeZY1+kGPwmFT/Mg4MW0dwfHDBBwYAqmoAgOg0fDzBZUX7tMVM+xcgGLYg6VzINPR8CIYUb/SZGN6S3VXb36m5M0ayBO++lqy4NsnuGaCgmF/IMWqiBmz9a1up/DuYZw4rVU3h32s0rrcs/FbYcWzZQNf2ZU/acdt7fqC5KNSfpD1nS5hcVwAI5UeAxfd0JvCRjQ589EHXIz3rPFg5qze0OwTdHDguc+kwHVD9ADyU4iaEMwlLJqmc/ynOCs1/W9trLjwigip8BAS1Xz2TBbTzTs/5jqXb3gELzl/dN9eY41gAGeZHWAwpAjBKx7UHF8qA0mOVey/YZ5Dt7mkzKHUAPWUC/q+BABeZmUCNf1D4N6xgQyAQQNDMkVmSCpUFCIBwKu6evTMFgA6et8OLAIGpYXe5brTQeyLWBnRBZjt/mgHGBvQS7wQwCLj2y+TaMeXC89l//SWj44gcUVmcvqzk86xOHuqGj6r7RxPcAYA6O/BhNeSweQjp3GFrZyP3CIEH7UcHvuUMdOGf+2wSnNF+YVSHDy17IbdPyDq4EW5WZZzMXpSW5RkTENrn+BuS3MhTmgIf0RDhJydsmMAuVtaOFZpckC8PkWPQVwxmN64OFm1DY+qku6cH9PZRguYbABSHYBl78U7GKP9offQzH4aknoUk2TY0xoAgCOBUwJybU9pjok0mUP02/QFAgLlSnScQtHs2B/d9AqtvAA4M+8RPasTzV8NjjIvUCbETr0++xUGR9Fzez8yH0s9eSnUZoPB4eEEg0EIRSBbWipbwTkmN7a8PT3PggF4upGEcBMasvDTyp3C7aiugABN8mBlAjCDmjtMbhgoMCGRMRgIcC8WLTD+/UtC73xTKVI29L+467PmTTTe5hofEaMyMsxEBOZloQ7k8z78ucTkhqJ03W2TNr2tVE6ybQLAbLjwr+7143eyVHLU7Oht0AJAESxf+IugMCJWbrrWXwEE9/KQsa3ghILAOBdaza3CNBFjHcwA4BOOQ6RTUA6oHLnoJuhRpBjugd9Z1hJBDfW4A5yGBoLIBzTkDw4Yfp00AHQAOTQCI4UFv/BsdU88OiAuXS07iT5J+6Mqi1U3Y6V621x0w2FxITVdHFOa9utk6nXVh8cQyQ9L20xl4BgaI37bUevNwrZC7ApZA4IZIPx9MQJx5TSDP+koWsAwela6gpT7fb3Ng1xBNe0egdnA7T2QO+U3te0yNbObAFOhLTZ8/NEoafK7KMsIZw+DEdiuzgUEvvMwvGNNXi+xGHx5MDYpzSXShiNObRj299yT+3jlgofEiPh+A6b/Hwy9KNcRlCoQuKIqsUa1HBAQzCWcYNYMuyLxd5gfgFhNIHq6ZbMkj0/N9PXoFtUtumBo7YFIwUE5Hq2v/XM3A/DBS0+1soAv+e+cYBKqwcWzZd8eaV5yYjlEyBQ6Zgq0H9OIMYwJAILMIdFwxxrx2yLAFGoGBAQyZnm/rKTx2HSyhsYB8vXjOEwggwgE2JcRafssBniHr/UwBjIfmGO9C6qD4ayZAaCTxl3XmNeIAh84KKgvYFzTrF/B6tnjJFNIBiDjm4b46NHhuGsDAQa0NhLLiTI2BIGuU8t+lrx020YebALu6nuTKaX9/ePa/dG6eV6Nv2Mn6aHgxIHCWaabx3KFCAVF/nx3igEj9eCgUwAeIDnMcgnH19wGuEJmrFuuY1w7xN9NqxbOZsAwR0odOJshMANBDp71Owu/U7zHqflYvt9JYQeYcTJoa5GhtV1ili5kGcNNADRC6X6AwAhT1VOcGOAPwmoU9J3OcY+xoQv0YCDgAeBqABF4iyrCpB6Q2rqf2ZsFSz2y6FIbibWgDhEIrJDELaPVWn0XSQTvvrznQg2ZnsIg8F3QV1pBTI/k8gUMFwFE6mjhYiECu1nGu14gX2OfMD4XgsPT4qdlFdnMD8jcXJcVxJCzrNAFUHRCq8O6QO14v8b7HQpmMdqbndzAB2ACEC+wWDYrqlni+ho0Qp8AG/zwkreWJ1QkDW4qdfbWpMi4kJ7kR6kfMAYD385H8drAWptNrzYGg5n8Bgl63khsBn0ueEbMqifYXVnBPnl3w4YrpzvWb8IJAQO8eBS1rjqQK0rkXAu4dwgHCBfSYZoOYto4FShYtgqUDuXAPmxwU3zq07xxyEADmLjThmM9zkFpFdS0/O4dYTnKM2euBnE6aHfguLdClS4csA8QCLL6PBJR7GSC2tWGATW/UibMvs4tv2/RvBgDchlwF00yT5fsk7qCMnC8I6qNJCEbaWYnnq+YNCH9Lq7x0PKdQJ+t0xrRpKCCf98zwgkCgh50OnlulfY8hPlTvzzaoHeQQiKYPwY/lOHAEQAhEDngrB5vwp/mowDFNgesYELmid/3QAgrImIg/REOLrzK53s9XppDPuKCNMCpp1CEZA2k5f5ZrwiCebxiapHNez5SSm3IpDEjhOMRY3gYANoJUhP0OAEhWSGU+HCwua+WWajV+kDhQhb6BEgjEAjWoDwQjDGFP4ec3UxHnap2C8/CM8IJA4KyySbiVAMD3nQ4R9U4K17SVYNuBDlm1/4yfNr13Su8x0VY+OmB+ALkKrtGrSIPYwwWKIWNqnaHT234jLARUWMhTw4eQF5OgetcVCB/CDnR28lDzct5C90SlXGt0N5b6ktSwInNGhQ+bndn1WwA4vXYFgcoCHhF2LqnflXE+/HkGULuyiD0ye6qz2wSAumaFm5vUzVveAijfVxDYN0HV7EXoOwAUhoAKuFYxSrRzFfhZgQkG00w4/O05bzjKizfY0Ln4SL0mu8wU1hEoPeArFqlNJtrUwNKO9IKOCYzXgYTgdJMg6XYmo0lN7kl+Cy7AboTcF5tNmUILGxAg8zc3k3nttGgVJk/wOQDg25avYEWZ6xvvdy0ldBBZmQuqKYCedycDlfUuTmdXNrS/R4J8xsOojBcEAmsgUlkKy97PFQB4NAGgfidJ1bpJ0E0B9eOG5KVXEBPQMTBEIGPEU6OhQ6vBrlHTgtSgVF7PdPOrNUZTKX/YtPHIygJ8GGU7cnizs98RdRPmE3tgLVe/DZLvH4AprQsOFgFmAesa9gwUKgAkENSin5V1F7/eLyHUFaiKKSB0dZfT+IagkrbXWMWqLFTTenmk6o9xU+PB8EJAoApujbe9ru15SInspNSVdqVpf6egMOE/1DU/lg5TWcF05kUbBr1OygZjAjuXYHSEIRAZUBEMlWAOy3DOVA1rlO3M0645rVxAjDf757MDCIBV+G8J/h0NcpcBbE9S+7Lw23Fk5wCiEl2ImvDuAADt/C1mcFs69qXbQ4CgLxDGQr81CxgUIp2axuzmu9EnZgHY9BErm6V/7pFZwwsBgV1gDdlBws0BZgK4LTyhmRUH5tTivWNQIEqmwCGZSCJAMoFhYIBhbWMgtOsIYwKRyPQHPDzFk3kqAJ/ookypw7udrEBoUpL7DXJWmnnjgXAaOs0vLbCRi4wSLN3tBkq42PjbibV4FmF+2DQBsBHkxhBuaP+874h7QXXn5bSqW8wCz3cvFItxHC+CT2BFXCGYgI8+LP6tZJnJAtrq1rMQ2eeJBby3PoFbCBzbwgaYCcAhlH5wmUeo0Nm75inJznGIbh2DcS6ut7TgZop5+nXOMhQFcPRO2bSB2jCkzRuI2XLIzsddbnYWihdUIe8v9ACoHLv37KyDe/Wf4SYS3Ax7LYpSl3mlxEXdJNiBwJn2r8zgKG2wzWCJfrBg5foJbAsIUD7tgPoRauMiFXz4BEj4+1eOoN3wzSSj/h4Md0FARH4VgP8UwGfY9T+gqv+qiHwcwPcD+BiAvwTgn1PVfyginwHgjwD4JwD8HQD/jKr+5MM5KqELdhhLUXcuQmEo7Hqd7/JyX4sJcMwXfRZWUDtdPEKRWnBiwKT7LohXRCdw7X8YAxh+fCQIbLseRVYAyEkloe3VTYL8Ll9lAeQTUEq795V7MkD06nmksz0rJiTJeg4MAgm+yQ5q3KL5b5oGXO707Jdof8fjPPPbwiUAeP5RBF8aCIjSM6LzNlOgsQBnDWs3l6wzkWfh2COfIftvAfxWVf1NAH4zgN8ucxXhPwjgD6nqlwD4RQDfYtd/C4BftPg/ZNe9QdDtUcHNAATab/go/Ota+Whb4e8VbH4tR/wCUS4qcs2JQ2WZ8vyqcQ4ttgbnDsBFRNMSvO/UMa43WCRqmfW1qcNSUzVq1SXPo5lnt4YQT2nZ1zW1DVobVQdfp/39RyygbWvbngPHs38hjP1bF8z1+JfN4YS2mASjjwwgOwZXriQYnALgJtwFAZ3h/2uHv8J+CuC3AvgBi/8+AP+07X+DHcPOf508ZKRUjc8afj2X224alPSonrPxTzpd63zxVeP4qCkKjc7hQfsOgQHA9TpnDY6rCf2VAMK9vYPsO/qWXc4KQwEHlC0S7EjonR56rXkevSp2imEPszWcNdwu/pFWzpuzLVgo+YMxsQisr/78RsLqvoDNvdgI6hlIbNKsTPL2PUBPF4QBxDC9L2vrF95/GPQbYt+rm7Pw6MdHLpiU/0sA/JsA/gaAv6uqT3aJf2oMoM+QqeqTiPwSpsnwt1ua8Rmyz/7YxzZPXTslA4JXVAUFuk9Ar2naARG/GAHow4Sbxh3idl/NW6C2A46mlruKQMZ8Q3FuDxxjzO0xG/ewRq6pt3K7HRBlMv9Fof86Oxifg72AQ3EwE1TYT3DLNKAs1Ha4ZzPUqwvfEHoLMYYx2SxgfZn03+8NYfIrbnb6w+6tguDyUHtEL+cjdlGW0fMnDeD2x1663qNWBeCT0TpbDNDXTM3r5SGdS+EhEFDVK4DfLCKfCeBPAPiNz3rKPs34DNk//vGP69ZLvtD9FDwXvqIp88bYCxkKb7C44TeddLKfLSjOAlTttWAt+VLajw053UTsy0UyIOPAkPmOwXEMDJ2AoD3dJYh1RV7MM3ZSdgUVCJxuh6zP69NLcCLzLSvbrOnmwuV6Zm5Y5GkCwQSjua3HLjQhKG5jbwCA66r/CDMKCISjVVyQJKvV65OKeV+mZgEni0ATRCn3d+Bxs877cB8VGBuTchklKHUt4Fmlj4RnjQ6o6t8VkR8C8DUAPlNEPjA2wJ8a88+Q/bSIfADg12I6CJ8dmOCnriczwMGAL7IQ/Y5JAGu/0A4IVgAHhMPfImwdTavRURHZK37ODJwvEwlExpxDcIj5CQ47trcNTUvzNFoW8ew0TWeV5XuoE3sRFfDXU0GX7rR/75irjOtywrUS2i87ZmtIQoIG0bQrmZFgBvOPtjvE619KzlHGHlsmsl69OE0T0wIo87GPUKPO5Gp+YQIZ+VZXSiTg1+k/ul6v8zeutD9iy2bl6MfKDkSqjwfCI58h+3XGACAi/wiA3wbgxwH8EIDfbZd9M+pnyL7Z9n83gP9kr+YzFIfYxgFyzgQojRMgCFXgEaQVdv6AxbY70v7jTlE9uPbFYk2HYGk0O3+l/dzWGWHsLEwbEGXf62HdT5vR68KrMWREN91Dy2Z7onhfqK5VN2lGnGtlP1+FKuMEXYh8y6Ab+U/8Jx9KalXuT0MRP6/H0dOg5+rmlx1nKQKd8/4hJV1vDxfca+kj1/xdr3ja/Z6e4ne9PuF6vWL4jxjCspjuis+n4REm8HkAvs/8AgeAP66qf0pEfgzA94vI/xbAf475vULY9v8qIp8A8N8A+MYHnrFksmjXbafXAIjqESei7NoPSAF2NXLDljwmPYB/38x9AsE+KOw+TAqxBUtEosF9hmCCwoFxzJmGaozEJjTCV+ypimjV/skCJotg1sOswJkAuuK069x0WJqkaxSvZ1DdU4Pt+tp6LMsZFsIex1MeMgWfacnMrj3HZXbRP3yDlM0mG5GXVdvvrytoi00prU5VNZzI1+vA05W0//WKp+vTBIeneTz8m5e0sG0oisF+gm4a3w+PfIbsRwF8+Sb+bwL4Jzfx/z8A/5Nn5QLYZjy7XNFD2FhCRAW8m1ADq+sckoobIBAe4EPDWQggZ+W5WcA2HI3nTlPDXigyABhuAjQqJyaB4m/UgWg72fG7d/+yfCjmQAi7YxabAl5xDACc0K4tCsXXeqzZHgujuNEXb4OB58euU47JMvnzN0mhCDjHbyui3iabeqYm2Z9raNv7ZdaR9WPVYIyTDczt07iGxp/AYNp/JNU/Y87BnjfK6lZ4MTMGt3neUGCc7VtI8pmtxXZ9mgjO4FaToIwS2AxAHQYcLgBR8Qh7zCk+IDZfQJpJ0MBA8+3D2RcbGIBYTeu4of1ZTzUgYMzzOBagE4u3N0E+tygz8o94x+zXoO7v2lhF9ieW0GbqB0sRviQwoVwtgM+s3PeP9TlF8weJbBOI6MDVDANilJsEf24R/eU6rni6kj/giUyDJ2cDT6b5URWQPUS5T+J5TkHgpYBAUeke5QVE3XJlxo0N+WTd9UZNxbe+rroCwVyvMMeVE+HzyZoTgRwEBDjOAKD9JByDYVJSjrXkflX5fKmBA0m/akuKQaEly4uFRgtQnYZpEHVvkdEBs17W/aqqt2CwPX4EprBve62nxcaLy/hINxEhZH4ksC6J77LllUf1noKqse+0fagxgKvNLxmm+Yf7A57SLLg+xUdt9u2T+xu37N3wMkAAjpYtLsUstlnBiD7pHdSI9fIueA4zxdHc7jR/A4JDjhiPR2hsy104n3JlIQeB8AOMARmKIxYirT/xNxSRRoybBEexD4DSCdU7dr5CzBof8HkC9daFVPC9pT0QFbwhA40JuHbKVqNbKL0qPZon1vgT3p0wwsCY5zqGrqs6ExpaHbrJUUdm0BEg2v6MDXA9Rq/tlB1G69kUuF6Lb+CJ/ALuEByLiXZ68OzwckBgF9OYwMIKdLl8M6pDACBMCWXDBKrmP4752m8AARmFQc0CCNzbP9cWGkcdCbiOMScLtaGdXIxE5qfL6Dds9yg9bWUBQk4ybfIR/fkWGHSB077rYFCZQDhk+S+D9Jrcpp1XOr9mZTUZmHEs5fLdqBs+3n/7ITlUX3DN0SIfsl2XIUCH+alT/z6El+tRXsfAEw8POgDQiMDT01NOKqN88PEpMD0QXgwIbDWC003lzte2SQXsJhSFmXM2DACErWwGAKfkuZ7AaExhp8ncKcgTOwCBXAfkqILv7MBnCoY5gHyV3p/CY7ejHJNU88qYZ9rfUKEMn3d1Rlqwavvc9nM5IlOFv9221Neyz/R5d30fy9+BSweDzgqw0eB0/TqBMhc89aXOSvrt/pqNmYp3y9GBwN4fiWHC68C4Tofg0yAGEGAwj72OXBHxVpBLsnl5nhNeBAgklTw55/s7JhAYUNep5xA02zR/VFQfDRCf9dVYwSHzVWFucDIFkgnMhgUkJ4EUIFhngU3QOeZbiEcK+xT8JgDlqKl0pvRtX/g24DHp6IIfAKAbAGhAcMICdpFn/VV3ZWeAAtnEG3MmgoFMrS2vlJR+sWcSOYi/cdyp/25fmQmAHMf0IlBhhNfCBNwX4Pv/3dM89qnW2Yclj0Xg34WcTKb21XvhRYAAkB2pxTY2QOwgriiX7w22aEGrnBiUB1UkMNcPyNeC8130agpwnp3exUtE1wkCnQHEsfqxjQ6oYs4yPCBGBSYYCOz7RQDm0mTBBqLzExsoLECTArHg3AIE1sZF8L2+W71vAaC1i/oD2E/Q1+NpoZkERQG0/QIOpAHLfitihZbK63PRldlXgvbHRAruT1j37dnFMch+I+0mIs0ItBGCp2IOXMMcUNV8ua35r2LGq0NXq8N74cWAwC4sGj90EQgYUjPNNrXGIvgOubedaNioSDUAGGkC2L8DvvR4mgyeh4n0tYGvNhU4qH8AgeLgTmA+BAkwmt8uHACOAYxDzUQw4dcqSLFHWlD51MIEZoV0Wnmz/kt9U717/A4ACHR2+eW+ebYfcdq2Pf1+U2FAhTZstPnC4zsmlAh1Bum3+NyRklSCsvePgWkS1JmiOau0TA2+9pmDT+EcVCiOIZDjKB/fncvfzT574Ig6eC+ZQA/7F4oYECKqXJDr71Fcter2QTR/N1DUBZ+dgvEJMgMD8eFBbXO8mR1cB47D1hmU+eoyfM7AMd8vgNgLRzgMGPo8b2MLAtQFP73DuuAHCXY7KMRbCmBS3TZ169d3Ady3Q41Yzt2i7wuK7Z6jy/MLszE07JMrPf8Sv9sSI9Sf3IRMcpMA6XfX6eA0dfzap5Hv3g24xpDhnByUIwbXcbU8WBuawE8zR8Ph6ebAcwAAeMEgUEflN8GhNo6Z3q0TJoZqUCcVAN5gmo4attV0kBOHfhwXHyAtnyHT3F8a3ljGOOZ6g9dZxkPs46VyQI4ptMcxvdiKucTZBAZiL8ZSHAhE/PNrXn9sBnBtSkCAL4+W7Cb4TanmXdXvWgzWIadkNE56S/DJmelC5s0Zox0bEzCYga+56JHLgdH7RRkMixvbYeKFGch5zcwuNZ18V2vvMa4GALZvJsAoowB1iJB9R6vTww0VclQvfi0/93h4sSBQkLdG17jCDPoN1JOMN494466uCFQEloT77JzqCgYx/KOSJoKmH0DGmKMGcg3zYgrOAdX5LsFc3FShOtcdgE4AcLBA6aRJCwGUEYDUeY0lI+WDTRs2GnibWk9JC3q1uib1G1rruK2yPNwfJ+2YMldasYmeO984s9K3kwJUNuCs0P5p1qP7AtIjcGz6Eu8VDWQjAK7FU9vH/uCXflzwfRjQr+UXyFDLiLBn47cTfrGVsZ4TXgwIrFM6skVvskekjVobX6lDmZZSwGfoKVH1rvGrBm+gsAi/TWKCrThMTGCERpgNNKcRC64D9lHUmbtD5gIj/s0DVZ1AgMPYgKa9d/SuOtM8DnH3ItxDDLDcNTYgzAy8njTrU9zXcoOPbdGlo7cQkLBRnadvjQyd//W2t5L5EEBhApySMwHbd81P/wIgFPDl6eft6R1x/8eM1syFJhMY5vBzoU8A4HgaCXgajQnwK8G1um+963K40zAmnDwWXgwI9MCdd4ZzGIjT0aEqcChyTr6qL9fNAr5fvCFNhdlo+crmGSg4AMwOcRgruI4BOa64mgkw6flEgQtgy4/PLxqrHgYALozkBTgUOqyxTegdCFSPqIaB+RLkVkZjP82COzVbElA+sP1IpQMBJ87zF/YZqvuNRLjA68k+ILayEkxojWmpxler2SSQuLeygvniqIGsvwykLuju/PTZf/6sCgLXbgbs9q/XZhqw4iF/k9VnQnoYBOVfLsv2HjOBBbh0ZQErI1gpmeSteQn5Cbwv8qSNrvlV06m38wtEQ8WbgwQA8KXDRjgJfZUhkSuuIoCzgCtmJzpM+GnEI34MCJhDlj6OOYHAQELHBANjDEoCxPJXZSz/BltqwNC3eV1rt61pgMoKNixge67YLKlxo51BWthBX/KWTItYIEFfgoGxI2IFM8tpV8eznPFpPyb6rtWpN0LY274DALECB44AgNGZQDUB4GDFv4NGDJ4RXg4IrCjQhP4EArRGT63f2aC2F2Q0tP/WF9DNhMYMqjPQOwEtCy2TAYiaw0mvmNMHxOwAANfJAqCKAxdcwMLO+zUch9owgQC2XJkaE1AAKgOixxyvBpfZaDM50Vz4XUcyc7rJD07tMwKCaCCpmej30/4KOBm8VviK4isowp+bWE+RSlno/7wo4qCw+RbJHF3w84OgWn8w9tc0/RjpI3BwKOfJZOwvoY0AG0TBXME7F0j/0FF/Rzeub4cXAQI7C2aWPQ3G1cxrkh6IUDsH94gkjlo1vQPAGSCQr0AbEBRTAMPMgGMOG8poaH1t5VRATfz1wEUvRcvkVSaUgjmT6aCtCoADUJtjqBUI+8zi+nUilhsXYKUqa6LHFN3a5pQR9EYrXkitbG2zH32fvX+aXEXpeGM7IIZCC/bsgMB9CqRhbTGZ8B11oR9jEz9i6DecgrHPjsIEguhPI+9PFqC1SOwUpBEip/+8Ktbx/jKBGsKNZfbcDb0EprFZcwkIROzmnjfgbohwc6yO1HptIwTJABIIrEF12pnzu4NSUK548VUnD1Du1F0rWx8YAI4jjH5RQIdADxgTGBh6YH4E5QgMdROpmwJMlLMm0Z6PKqSlkRKk67UNCDLhev6WrSdui3NYmYCW+M2lfL3lN14iCp3qUNCYAHKtCF44Jm32un+1OQGLX4DNADcTxrV9WKRv09cAIPPqZssyKmAAcLzHPoGdS6D7lnaIr3wBaQ33bNtR1SjQhernMY0aNCaQjEHBpkEFgoGBA2JAEO8cDKdwmWeFTu1vW4tB7+xLBR1TWw0HBR02KGLzDdRHFTKV5SWZOO7uQdqSbV40fhHWCgSZV6HrdgJv0Nx9BqTY826DRqfnEde05RI08wLAbQZnAnM+vq90PNlAMoF5X6z337autX1OSfSPK2v+E0Cw88H4iFHAlEhhWWQOBACA2QCbBAI5PiImIHONwR8B8DOq+jvlrX+GbGcQnBqfdJmSUMEa2itT+USJOweBBgBhHlw3/oCRmgKDGtIBYEDmhkrlDp8p8M4EEhQ8r7Z1Ol/JxOwMhwmzHvYewpxwlA6rlLeoI2Ll3TfQecCOf+luX1jI6Jy0nTsmwLJ/88lrHs9uq+VwBpRbwEYP3ATgyUs0xTv7CY8maekrbgZ0ge+AMEHA87ZRUkth0mhOwScGIDxC8NH5BL4Nc5Xhf8yO/TNk3y8i/yfMz499D+gzZCLyjXbdP3M39ZZr/+ZennyA8gWSZsVq2/r5IuTdIdgZgNb4OrxIQGA/wD846gCweAKgmPMCggE4GMzCl3oRkEmoMEo77f+hDgQTDIJKCrn3mAVoldl15KDygth/AI/b7CMwiKVzomv+DhAbmIk0Qv8TG+B7fIr0kkKJKSMFzgYsj9Jsp84OeUbptTmWC+VnBsAjBnZexxxJkmgULybrfy+/RAxoklhlATxX4Fh16o3w6BeIvgDA/xjA/w7A/1zm+MlvBfA/tUu+D8D/BhMEvsH2gfkZsv+jiIjuXxO8/VxsOuBJKp66mpB04Z/CuTpyikOwOwOjga8VAMh+K0Dg3dI6T4D8yA6rwJwUVLzLIf61kNGQjgS+mS8aSdi4w/wP863EMmxFwr/u83p6zQ9g5YhYkl3IHWErgk9l8cbcmAB+ShWlAztzSt3vBUjt6cxqzXsNLF7JAir999ECvzj6wDL/fx8fTGARfGYI81q3VALkQYyP5V6xTGxaAMFHBY7DZpA+jgKPMoF/A8C/AuDX2PHH8BY/Q/ZZn/0xLATGvde8VrbSdulMZuMBOVU0EuudhX5K4qmkWcinkHnq+4zG3phpv/UQwAHg0DQTgtYh2rVSOkGyA2czmSigbgO2suGAdxaxIUXYS0hDAWDMacpYO80qWHSOKmN3RYKF/XFhbQLPbdf5r4KaYAFNBXvn45nRlkkq9qBgGj9sLVD/k3LML3zdehmsgsK1XdPy7/kutF2Wtufj4zhwXA4clwsux4HLkdvjOPInc/uc8MinyX8ngF9Q1b8kIl/7rNRvBKXPkP33v/ALdZUZoo/ekYQAwXXYIdNbDtNs4qZEppLTYcgXTi+MHMhVhI4BDDnmO/3+Xq8eGGKfFj80NPnlkh3yoheoq/MNRct3BbqwWacYY+ZhjIIgLOx6KPQy/QjHoXOSkXWOOdNwdgC140MPHGNqhsM84qrz+4uHzrkG8xXZo0xMZDpfRFO3u1GmAghN6LwMcUyOr84R2ZWT/g2vp1yqK9bc9/qx+vQ00nF4YmMHBU81XI6R5oCSQI++H0PIJtyEgmmz5yugAvvIqgs8Cz8fw+8FjuPA5WK/44LLZba775f4j8Ax+FsA/C4R+XoAvwrTJ/DdeKufIZPQoB6URDcF344PQIZ/6cWGyAboY5x8J/NLCexwD+shc/rtBIP5zcAJBoJhW3UhMpub6fxFFbjMpySl744b+zIucT3XtGIMxtcbnC6E1BAxb+Byyee6sJvAHwYAExAu8HcPjmOuUXDogQuDQExNPuabh4evYvQ4haxtxTtS45DCCLdylcG48zVvNSu3ogFACn8IHvg61HsDRBjSqgZOVoeyr0CCTkziyQVkfNiQR4xiaM+Sij52HPEa+FRWSgI/BZ0/v34QKDgIBBDYdh5PRnC0+OeERz4+8p0AvhMAjAn8L1T1nxWRfx/zM2Pfj/1nyP6fePAzZFYz9TBsM6f+wBTvqc+VQQGAHNbBLc7NAh/vVbSOx+OthswDw8ZY7SvCyHf4UxsdU/DZ7p440IqT9lqCnD+z1G/8hifmw0f0O1RT8LvGD0AwoKBrjmPg4mBxKC6wF5S8S9oQI8ZcTu20QU7jqCyncQ4M6xXBBogdMJXnGXvxzT3SwMun3Vud1l/qAPrT2sTBwM0BYmqdjfCkIaL8XEwRzPf/B6D2STu1l7NyYRD37KNN+jFQODAF/TiKwCcwSB5fjk/qZKFvx1v6DNlWB7EPgEQ7NHuYc84IfB2eyQicUuY4eLG+bOkusWW7ZC4pZuaF+NeDh72zLzKH3o6qmbrgR3AA86dFh6tkM6nr7NR5/9TO3KHHGBUEiglwgV4MLBpDuBh7uahCLwroQS4E8h3Y3ANulKItrWDSwPcWuhdTxu8JFuAUXbBLhQFghDYfBQAGaWQG0kVotfoPOP+Vga4m2wIiG/veQQDUNyzxbPHDnbaAO6x4ubAAAwYFBoSDBV0KCJRzh3wk5gBXyJ8H8Odt/+19hkx6YwCJ2Uznc18Oa5BC/eVc+6PGzTX8EObAYVNxBwgAjgNjLvAzvwOgarPzkgHMzlrt37VLy8l+7TQ65uvIsCGr+VaczKnIMr9izPa+HpdJ+y+KrU/A9icQzPcTEM8z4Y88DWIr83SutKvRoUOOTkiBxp9d+RMAWPhT80upTD/LLCCHbNukrqahc3HPdYrvTtg7GHCXYzbimj6dlChxRG0iLbfvXfgdB1zI/aWfw14AYgA4jD24cPv24sdS43PJsX377MKLnTEIt5tCyLvAa06dVVdq6RPYaf8al7QvniGS6/rZp8MPm6arh2lRRTKC3tvd7lU6U9lhCJH6gcyhvfnc+dahv/o6V0GSuVyZjLnakAn/dAzqxidADMH9AxfzXeCSWpnpr9h7B+a2YMtZXfuT8M/x9WiIm2wgykplX+sh54Q4PDlOLZR+kDnA4/YODCz4m9l9NUiWJ6Kqc9D9GWpIFfmKeK3Ap5FyMTMmoT0SXwTmyXehz/3j2MQfktfbvr81mGCQoPGc8HJAYMcEitaRei51VGxX4e+MQNu5KYTJCswv4MeH2Hs5LnxzaM8dUZybSFNs+I2umxpUw7pR7jXWiQam5meHor8T78eTnag5/VaNnwyB/AOXAz4zMTQr0jgRkVzlGEfOnKMyZc+WWoVF+hvDaa1pPHg+39/TPzEFPJddA7umr690X5ELtzaPPR0zO/CyhLCf+AO4IMWSYAXg3dSQNbS+p0jP4RGiNAOqFk9BPzYCToyhsAU3GQwYqi13N7wgEFhicpfU6uyHKwD4PWEW0JYFPxNMXwDmuMB0ArpJYHCAQyHD/GeqNqyWS091xgEBZLCDCsUWjX3SJnZimgGadnd0IPFxYwmbfxkWbP6BoYrLZZoBUBTP5QcQXBMFZt0PwSEjRlhY2VfaHFkt8TfZgFN9EvwAgMYCYqtY6jA//KrLGhDXUU2E9ZwxBigJKWlqVtFRVtkUTEqBmfG5ORVtZyxzt58AQJT+MkHgUoDhIABI4b/lP3hvmcAu3ASGdt3sKOYQHF3oA4uDFQwAUKFpBmYKHIctDT6BQIE5F8GG1YCN2Rf5mJ1ARasnma6JFDQ3rmbSXmVYy4oQILT+agJU/4ADwxR+MwV6rViHuU4UMKRDjGQWyJTs/LO+pSNwr421zAga1M4ZSBC19noJIPBRAJ7Lr322XrKDffw0C1jYq+Z3Aa75T91OLJLYA1/h9Rpsy5icz+uP7eLQc+9/Tgwqzj4S/jKaECMIPKx41ib78CJAwLVcDVNLNCVbtL8AbWaqMYBDzJObQp8mwUxw+gFpH3X/sNdzh6oN70x+gHiclJ/Qfoz5q85RCOu8ilFmMnoHzzS93FiEBVZHxxjYDwsqhjsBCRh2le0KS661U40xP7jCWckeZTUemhMbANg5RelclLWaBWchmUCaAQGsy6y9awi8z9i79vf5HQSC9sefqF8qnRVX6HoeVp4XOFtTp/i2KGykFUJqs/mEBZ6H+fZDgH5chX6my0LPE42QOX4ovAgQiI7Jgef+bstTsdnpf55zzrphBaZu5sjADO4HmPuzIc0gKBM9KIOZk0KtxSb9yJwXfPg3B2HvsaNquDAP2Jm4MR88b+afWIYFmR3YfrKAWm1OS4fYuodiMyblgBxzZaLoWIU+536OJt7qbMQGGgBwtqr2r8LfnYLLq90jP+DhL+dcYwEP+rqPzdmvzlwhXDszEZjGb+Lo+DAmxff7dS78Ps03hPxyWYSfJwL5OV8tyNvFs9lnHAI7Bn07vAwQwKYrGUorUD81HgpJiV1ajQzmCJIr8PATYqEC1vxzz+m/v/h32Bh7jhPaDfNhLbMCQa4kNL1tiHvVVxlSXcoaswLbftmaEE0GYFOWtfoHqrPwsrKJ0P7XZAHXOV35KpP9iE0aYs4DaGmPcuocpZeg2WDEBjzd1UzwUR92Dp5+0KV/zIN/FMeOQYA0pvARAUG8msusiRbzwATPWbQcQoRizr0IABDS+peY+89TfxME1nNFy8d2F/fegsDqzPDGCsbmpIDkXNVpGjMGv5jZANmjhR3MkEBgwCCYk+jssiNfJJjDhfadwmMBBLuRygVp5+M3opwh+ApbZMPHoM/p8uI1N/hymp4jENnZidjOuI6PQyHH/DTa/BCKf54NEPsOwqBvNN4Oa941MkYjFQEKGsf+qq42jR+jAdcRb+rdEvhrYwYTBLzcU4CcxhcsM9bG+MCUG7Qfmr4P4TVb//AXftjW9/2LoJoHe6cgtxmXYY1/HhC8EBBYaatEfJ5xeY5js/R9f0lA24HR24G1hg4A6lxAFbBvAahYx8dctsuFYr4GiAlA/KVQX+/bBD0p4bx/yJxH7h8jiRloSC3CFBhxbsZXB9GRNWUgMsYEA58QJCAau6l2H/fGUMhxwXHYxKTjwGGsQB0YlEChIIjmLsUR3ISQe17hbacEAH7cXtUd/F5+CPcTbccCAvyhD9+f7/AD9KXRolUr9edfOvRiKG5z7lKE/VKofxH+Y0717eP/ZaqwYNditZrbMdf3DVfLEl4ICNwKe4BQ2vdutJoU/cp5lcvvrpaTFeTj51xsNwnMi+4mgn8kUMTYgswVheIjIT72r+EgEnPADVHoUae9ujCjxxlIuFAfbo/CMAvOXgZUZ+eaRbzer1Z134Jah5yM4BB7W9EYgo2QBs3lFgjtGYkbLFdqUjR+HGuCeTKBBgCjavz8em+u4Rdf/OnHkdYwNin5MRcSoQMOsPQGKB0fvD3asbg2vxDlJ1AgB2Bx+h0JAGHbU/dNjb5+RE32vT7q/9HwYkBgFWDrN4L4mGT+zeu7iK/prZABwXT92QtHoz38QMqI+wgmE9BpAgyTOkeTw7TuwKTQmFN9p6ZR+/aAaVGdzGIcuXDldlbczjFm0j7pae0S7EPwOeo6mv+BTHxnHPDnDQMBoq9qAOCTpKCI2ZI8hyFrmtWX5F9vy2BujQF0gGje//IFH2YET0+4Xp9iYY+6vPdo+9ecMWjvgSDMSab7B83GO/KYGACDQNkvwn0hYb+0OLGhQP6BwEC4WpFC3Xs/neNLHHgfDC8GBNagi6DPuMW/G9d0+plAnx1SS0UmM1BMljgoDZeWw6FAMOcTik8n1lihQw6bDzDUXhYx00ONAYypWYdObTtUMVzICAiGDfclGFSQ2NUPgDynWUbllVpZ819I+CjtS4DAxQBhbi9hohgYeGf1egpsJZodeXBqqhsg6L6Bea74Adz+9+G/YhJccX3ij3vkvIHCAugH2KxLVZsdCTLZaL5+TORJNtAFnr3+yQQ6ABA4yBoXcwCksQFiBatOp94+UX971aPhBYNAhlXLk33p5j5dE9oHziJaaluUhVH/hBpmAoev6mkPG5ia05nAGGoUfEDUhX+yg7n8d3Y8PRTHqALvr8XmWgVjmguxbiAtZBpClVoVmMtjF+p95h2yS4rnfSjGRY3OqnVk0/wXGBAgfAjpHDQoEFoi3mHIRkO8llPwHYSsHC1+XJsj0JkAA4CxgOv1KZf6brMDc3nvjBOIvWWa9ej9wpkA2/c5f78BwrFnATnjjwR+J/ySLCDAJ9gI+XBMwHNZdG9DRaULbw4ELwIE9mh3Ru9BE4S0dLIq/LK/mRPePlQIOeaD5or+OhkAYD4B67RGvcSAYL5noAUIpgb1z5OZA49WKIoFKQ7aD+DwCTJHTJaBgQJANLs7E9XfCkRlAoqi/acpcEAvE5j0QgBwsRWTFP7uEXyCwOEa3+oqvNewuRB2TqkeI5+UqZ2TsH7M1dnAOhLAn/U+mybcf5OVGRAQgLrg5Rt7lyLYLLyyxK9AUN/06yDBpoW0NQQYCLyLLlyfOvKmEz8TD14ECJyFsw9mrt8jqHMC/ZygyTRQzQNT5M5YXevnhXnjHA08bB0CtY+Dqq84YvtMr902N/3GAKBKLyIZi/B35Y+BY9hMxWNOX3ZWICo5dDZtDxKc1OguROEXpfcGZt4u4TOYE44uYQ64g9ABIIUfKKuhTsdE+AV83oaYgS3Q+ioyXNSTTiQDsLMWXxyDNNtvjA4ET9Mx6C8RbQU/VwVWW5JdbOTHMaCaAzmzb12/rw39kXCLnb/0a3mOgDBg8MiCFEDwUew9kXNY3RjCzA4WBnweXhAInOXaXgOid7G9Xy3OQXNAFd8Un7SLp4aaEGMyHOBwuEAFMEx4SfpPtNUEFlBzNClkqM1ZsnRYQO1ZB8eFb2A6Cw/e6oHDv2soCh2CgQkA/s6DyjWmJsTyW1df+tzLr8kALpd89nEhRjDZgA7A5xnNtROAy/x0Mlw3iS16MrU9NmCwHq+mQAJDj2dvfl+qex4/FSBY1vzzKdsU79+VEIEtEAMyCSyvLozkHD2b479lCUX472/XocgH/QD9uJgG6yjCvfBiQGDL2LszIER3byrMapndKQHBajXQgZ0FZ5UlVLHzST506MJ3WKwLv9r7w3poPG7KXwq/MwLEcTIBPawTi33SXI74mMgcUpz7kDnkN6DzM+dOGDXTcuFxFqAK6GVeMxcXMu1/mM9hzPcOxnD737S1a34V4HK1mhHqZO7FTjvWP+2tYQ74NipkK/jBE1S38wRyWnBzCj7lp+F4paECDLQGoAhwHCPaIVs8bX3f+vTd3Vx/Nhe69k8HI00gKqMKPqyIAAAg92NuRwCC8xXTTjcs3TcJLwYEemmEGmnlCPmysH+zzpNwAAgUsfM5K8yE/wRY03rln4QZMNypZ8fRkW2IgYfy/P5QxLHPQICkv0d2XpH5YdP4qKnt+41q/oYAQJNbF6DrdUTk/L6JU38E7dfjmL6Gi+IyDgMK/lkhLmnvX3EFkJ/uCo3PpoAtBqO2H03Jgs/CHyAwK2idElwBYMQkoadkAg6mxAICDNzvMgYOX5HK2wkObOwTcCG/rE6+48BxuWxAgYCg2fp9MlDQ/g2Ieh8MxZ6ewOjhW1PgQ4RHPz7ykwB+GcAVwJOqfqWIfDaAPwbgCwH8JIDfo6q/KBPWvhvA1wP4BwB+n6r+5Q+TyU771/PJDkr1MCq4Zl8uIsFXmEc9K9y38XYhUf8+5Zf9AaBHlK3ysa00fJjgD5utJyPj3GNtnm24TW+sIcijwjShhi3tebkoYnUhtaXK/WWjcVxwGWrvHHhFeD5J4xsAzI47MjbMAZ8LYcIP3kcIf2cAYRQQGwg245OEyhd9iQXYpCGfZpwaX8vqv/whUdjIQDEFgDAH0h9Aw338hl9MBtoAg5CDb7NgaFkAJBypXn+ejQTc+d9ZAPfwnSI7UW4PhOcwgf+hqvIHRL4DwA+q6neJyHfY8bcD+B0AvtR+X4X5VaKvupf41hy4c029dnUKzpsYAOpWoFB/s2/3PGNfzAQUbg6Qr4C0mCJZgOeay+LCn8eKMUzw7ece7LQVHQxgADCnH7v+YgaiBgTxskyAlE5npr156IBwMVCYDshkAZFLZRPA8nKdDskKAE5h/bqkt6VuQuBRBL+Aw7jGj4cL6/RgA4ana1mGfFlvMEAgTY2oF6doAJWh+gRS2C/1Fd/L5m3A4uDDFgREMj57iJsDtae79crdk52F4g1E3OBNwocxB74BwNfa/vdhLkD67Rb/R3S28F8Qkc8Ukc9T1Z+9mdpio5H2ADUaNdy8jxU7iZi0zpyJVkDwZ4fzym3bac+6L4FfMpr7M/HpFnSQmPuM11j221sO6loT8XPZbhfa/QwjVCcnx/7Pc+xfR3MOMei+cZ0CPQS4XlMwrteAG1xNO6lemgnANi1Cs4avwAV8YQNY4mNuwLhC1QBB5296QYcVZMAmYswn+Cq+VkHe7seR5T5s8Y3U2BY3v8VSwKyUSfh8PT4OsuF9BnkIPuhakCmQWj/bswp/jbXe06S9v3hXGcNj4VEQUAD/D5mDwv+Wzq8HfS4J9s8B+Fzbj8+QWfBPlBUQEPoM2cc+9rEQ4PJE3vaTK3ddbo4KYUDobB+bfaDWYwGTFH5+/4B1e893teL6u4Emkv456mqUk2puZTa48ZETcSYkMFqeoBGwpPPafIPRGcRkAUOvOOyT5yJzjYEJBoor0g8zQcC/pVeFvQACrJNG4atPYEZleRwI5qIgTxjjaQGAOS/T8mE+ES/rAcSIhvqis/4zMyCo/IWm7hYb3oTZgRn+DGR9ki9GJOvEab33kwomDBIa5kDpO1FV7W9hA4UXWB56/PPCoyDwT6nqz4jIfw/AnxORv8YnVVWlfrniblD6DNnHv+jjCwbYVSidv9txTVMuypMihE8yEKDuTxNAl3hgfbko2Bh0bYIS4R2+KXilMoZ+7j8CCTQqC+6ACQZZRGI4ngcME3pjBib03mkHddphgHK1Tn8lNqKHrZwb5gASAJDAkPZumkwp8Mj9iIexgCf7HPxTZQD2m0BnAsWpyCzTBDdJByfmsc/Su5QtjL5Xre3CLwSyEQ8GWpigMqOTjA9gZpbg/caFPGClmAWdC8xDOlY6T9HPgYSHQEBVf8a2vyAifwLzewM/7zRfRD4PwC/Y5f4ZMg/8ibKzJ2Cr8rcCryc/mNI/RYlF6Nki2INCdlQPc4iwsQHAZhWuj2bNvyUfZ5o/3f2xP7VmBwiNB8aEHVNH3lkzJ0YE7E1DBTDU50Yccx6CXKOj2zhA7DNYHXqga3s+du2UfoLU/mW/jQwA7hR8gjoT6OaAVjbgFRqaH4h3FuLYLqtOPX6dlxgBa25UMHCwjPol8J3lN2BYgKSl69qdAYCYgFgiewbAPfJM3B/XyY98kPRXAzhU9Zdt/38E4F9Dfm7su7B+huwPiMj3YzoEf+m+P+Asz03rKV2Ptq8nJ0iwmXadSqW02yV1dE2HtICFAzlRqKczd0lLb0Ct+EC2v7GwAE8j9QgJB5eNMpKMZAD2VeIxSNNR571GASkPGNBBq9004Q/tJqzVOA9V6Dn9CUzXYAMu/KoV/ITYgJdfLQ/hZ1A7Q1sX/m4KHMd05slBYEaaHaVuuhmmCbg7Yb/xi+YhMAgAzQMEvFC33onMYhk/EB5hAp8L4E+YA+IDAP83Vf0zIvIXAfxxEfkWAH8LwO+x6/805vDgJzCHCP/5x7JyI/c7FqDUgRYAmEE2ByEPG6EXoUpsrIA1fTIAXU2EBVxyyNDBo+Z0I+zKwm7TXSmOTYK0+YmehpbR3C/PE6PIMr+6JJOiqtgyCd7RSyd3h5x9wMRXu7HOmxqfNJ8dJBBUFE/Q84qbseXFoeEMIIFgjpZo/rxUtuPvLpSh2mACYl/xlbmAZ4AB4iWeMAcCYLIeA2yl7k9zjDS/NTgzgEj38HcvGAAQdRlpdiCIsiGUUVF7Cqrrx8MjHyT9mwB+0yb+7wD4uk28Avj9z8/KkhIJt2YrLsp+RyOqjV4pldaK6tq/gwI9c5tmEzKedlxhmYSfyxKUn36h9Xeaf7Rj+wUAZKcl3R/PjZqJSVPzeEDmegjAfE9hfro001dbZcHyOJd2qxo/jxMYgtJKrYusZF3yOcf0qxngQFBMAeh861PUkhd6XyGTVGoxNgMu5BPIt/jmlQcLtIGiA22aBR0opQJAtAdWIOD6AtUTHEzZpJIAt/W9GW9arUrsGeHFzBjs0zgtNhDPLsLCAvhaCzthLXECl9INI/AHWu+hzrtlA7qyAQilU7yBVRNOYNgINP3qaIGDwmj1kHTUOyBMMGShJjAwsDrwG9SmIsPuBXV8dRAYmB8zmebAKvCr8LNPIMremUEDiPp1odHMgVn+NAcqQgslG2iv2QuOQ2i8vy7qwU67vU+gHYOKEkBR62H3q0yDmYPPx2CQmHGx+PbSpzIDBPHPCi8GBNaQTEA1OzqfLgxhExiR49j3u/ZXuqizALqRHY8ZvcmbUBG4sy9gwNq2/rojUAkEtTAESyucVFn2nmdYV4lREE9SZtxcopBfr7rYMw4Ah5kllxxFAMgsWPcL62qCgu2+T/AZNgPwmm8AujmAMcsZmnh5SBEoFtfwCfQhwni3n1gAyNsP8gPEKExlAyh1QiaEAaHPIdj6BQiPWfiDFcC0vbUV96PooorS0o+GFwQCujnUcuzjyE4F8xwJJmuEDQDMClaqOVSNz6DA5y30iUI+V6CyAY38ShP8nD7r19EvhLqNAJDQq67mQHVOZQdU1aUephLRUkwv/qD0RgGkA/NjrPZVo1hDESdanzp+rfwVBBhgrQ0cBFzwdTGVqMyWjnAhIXRsQmTCtCzyWRyEja5HfpVMHhbsLEqABpe9pKUl7ji8zGdAUONhM1tnN3A/SGoafq/ouSbBywGBJeMulF3jIWRIilmwlrzqgdRQEVkEXbZAMOu/Dv9J3zeBLkDgE0oaAPAk2lPB75OGyvku+L0OZqa5c/W6Cf3Bt4mvVOxazd+VZACQaQYch01rnjXAVLlqQwRAsNCfH3vN5Nz/sg4jqjmU9nhKYwE9ZwOxzdeCL80nsAruZrQkylTbwBkYOI0AhJbmUeMASl8yr+kXmL/ZDaQAd+mo1JyFfT0QXg4I7IhM0N92vFxbAYIrwRsPRhujw7EqtPMeJ/4ooTRJa23ZgF3IcucCqq6tYsaca2MW8B0jUIBHBug8z5xDdMSuvbwgVEekJsqEHbURAmcBh8Tz57zbCQKQuc0JL1Xjpy3bgSDrL9rHBCMays7Fa9FKANDqRsQ/LprChvI8IaE6Ii6YAI0MXCT9Amc+ATYDuB+xqZAiyoLuJgWl280Cy/gKngwG3vym9kfvwBneXyawBD3Zb3F0ijWK5G7tkHydtouCYqxP80WGy7P8nPCkodyPvs7mimU6O/A6xh/auFN/P96MGhRWIClMez8EEM7BYCQzleHX+tecRCYQHQcOA4FD8+3Ac8Ff9zsI+Lm+P0Gp/iK+1xFSy7qwVC3Lw5dShwR9zj9PF2bBdLZBfgeh5y2mAlF+88o2kwDLdcUEiDrrjtac+xD9tNLOUC6EtQ+HFwwCGfaFEsSwGFAYQEXw1FRxvKECseeQC0H4W73edU49zZXE6Bip0w9PwuRLVEyreZyYECowpk/+UEAPMQHrAuTan1hDCO+672ZSMT1sL7ZC4JQoRKBkeRSrKR2R4hgpWO6x9vwqWjsUwZ9t1kFAgFi4yLOiHQgAVBMJhV3MB9trUYrQ15PtHHMYUSS+LDX3r4CIjT6IZWLAh0M7KC+zAxfhlmV/02sDYBgBUxdlJbCfY9atINW8EMB7F3XgXp97K7wYEFjzncUCFy56mZrgRy+NjraAgN8CrmMGAn6qec9lCrhvXfyAFHI14T40j6E8T0BSk9l5d7qJH0cnts5+IIFAOHcbwb8BBLlvWjL6DjEoKnpoOKqH3eUa7x8mSAQQeIckweY2CIbiW6Vja0OICy4DAahMm/3ID/EqxVyVSS09o9dmGcSr2CLXCQg+lTreUsyvTxRTpvWp/mZhAQOvi9aWDJSlATbH/rzo/vknK1sT5kmvPRxeMAjAwI56VwOA2bCJjKUTy9pgdhXWd50qIDDWMPCmsFf6715bWmvU7GeNfdG59oDvq/p3CpJRxI86TWggJc30KBA0BlDqmOpgpbp+ca2nQkldwq0DMgvwYWzIZivrMQNBru3g034ZEPz5Wa7SlJEfz6OPnM/fpNUeI/ZmodAohCBHIXLViM4GqnDnsGJ594DrkiVaWn6phc6BgeHDy0k7Jicnb+HdDS8bBMBCjiqVDgAEDEXYCyCQ/WrHoYYKiGgIr/dGv9SUeZjKDAgTFAQHCb13tuFsACT81mASWdDohJNdSHrfo/9shP0eC3Da7Onw8abeo+Pa88JsCiKRwgWzU11LuuCrULqi1AaoIMCmBANB5CaFPoWfygA0X4EQgrT95DU4DpvjcMw6ntOlBTqnHi4OSDY9clq2FuWymgJ+7OCzEXAKqX42O1wMdYnQOObGm4Svn7gfXgQIhGByIHt1KqVEPZhAzfNeaBq3RgeBVSMlwFQgiLTEOxmBggu/XRVOWgYFJDNwwXehVxJ+t3NlAHoZ9hn0gTGmk2pQhyoafdH0tcOubCGKRGyobiE1rs4y9Dy74DMQIMqTTdS6obgAzYboAu+g4GDAwOx/EwPWKcZ5mKKkBAAaGTMBUZ0gcJ0AMGSOGGAM6GFTokFzE6KfJIB2Te8rCeWHYsmxR0C+VvxJkJODpWIdnaXWzb30W3gRIADczrd4gYEAgwQNB4Cq5baAwJom0MDUsXUy9sT6vte9GjC4MxBazQFmDEqCLw4Axgh8O78YNNf7uxwDOqbnOh1vrse8nEZ1H2IE/rZdrV/xGm0gGfXMzytMCcZucnBTWDAXAKggzkKRQJAO2137l3S0HfNVLuSmGDSOEccBJMcFw96AHGMCgI7rXHSVJyY1gJ157CMSFQiWdQhoy23gWRGgsR8K0va9km06uC7n7GEEho+GlwsCJOPCkTwi4HuaoNCF37e9w1ft78eWhpsFKPhQUDedg/aLYxd+BwMUwc+ta0hb5MOHrgZi5lp9h3/V+ufn6loDIfhRfvOdl/qpLRBA0BXuoot9UlTV3qBz4fkvPzJ3xJlAglN28i70qGl742huEO1QMdJTGsYCdFQAmBfSMGyvV66v7W+yAP6wqFdm9Duq5uhxVTOhPgi5z+yXnVXcSO3wkfByQYC0dPgFQrl4SefWX1DZCX+kzRqJ089UkA9JASnaT9zbL6vgm6c8jiWFPU0BE6IACPMbHCM+ctEpJZsEnQWU9weit+fEI//gpjOlCgYEBNIgQGt9c9TmYBH8vo36F8QXdna2dAJCtkVYJj1Q25Y+rxpVkT8lMJjmwBhH/PLdhCt81aVkAOwQbFsW/MUnQOXjzJ4FOT8fEhCIqo0B9Ep4XnjBINDPOQvotnwer8KvIfzLtjxVCVw5J2Qe7Cpd2745Fd1Bk4zA9KqzANqHAJdxYJS17qgTeT4YAFQpbrccWTy9VaSBAWlcfk4wHiprzioE6jLdEUnH54DgAuHm0xEPR2bkoNs6I6D2kn7C8lUEfiDmZnA8gBR+mw49dH5/QdVXJGVA5aFCqjPb5/cN8qtCBN580xuFVhGl7PXEm44PvAgQOAdB19Fd89crYn/R/LIVftesPRXW/j2+av4b+5Z0iCLR5LBNwSYBMC6CYwh9rIJs9piZFqnYA2jRzRs+Ah+ByL7orKBpLlCX0hRn71g51yGFX0ueQHnJeL/GQQA62UA6Pg30gLmegZsJhQG4LiUN6I1KNWoZbWCgGLGd7Onwj64MF/5L+AL4zU3herS2iLc0u8Y/GLyzR0v73Q1nFzML8Igy/4NOPTO8CBAA7jCBdpJBQYgvprBvGAClI9GBPDR90x5YBglAQn8DFFKgeKTZOmx4q2e6x+APW2IRzswjaygtnXTxEXhmSJDm3zrb7aB6SirgT/RZjhrCBfUZfMwEGgsgIAgwsWcNSd9EjLH7BYL0tBagN64XzS62pXazfMxvCxgIDP/mgH2YxfI1VzI2INCLLV7CQLBfuCVMqjAFshxl0pDVZ/d7rAJekK4FadfWU1X4LWIBisfCywGBhS5lQVLEjRV0s6CPRzMDiLhkAAsG3MHomDXo9ZukJFlCAIGgkGLNW8A6ndppHIJxCK6dCUjmuQCAd/gdC1g6b5bQ63Hpk9ZhM98u4M4ESNA1f36RdiBY/BaAHEjBQAoOjpyGfPhLGEHJfF+z9WUKeAgbEM8v04yHxufGxkB8nQkKjOPAOK4YowJAWcZNfXSFpw8jmQCyfeL7A3YiGKhdlF27INtJ2JwQn1Phmt/ntPj1ulz/nHDcvwSQ+QGRHxCRvyYiPy4iXyMiny0if05E/rptP8uuFRH5wyLyCRH5URH5imflyJ8ZnXXVHEll+bjZY8hGSmEysGF4tuP9XHB+bj4zl6Pi5+Y19Us062q2ubClv9BSv1Vfy7ZhBl4W7LZ1zns/n8Doc+HX9Ocvgaa/zJO/Ud77928BenxZE6B8HHTEV4KVXxUeI9INau7fY0Dub0GveQPLF4eWpcp4rQIt+zsGwGDg7VEUNQt8HLMgrkLppuM5bNf7um7XzWXxnsszw0MggPltwT+jqr8Rc73BH0d+huxLAfygHQP1M2TfivkZsgcyIuVXBfuB40BcGp7xLSMzUDs+uN4SGDjdpBU8kt4l0s4syCPtgQQ8rJ8tM7NcvpKvf8X2yPwwYPSyuxby+itl8+BaM2VoK+DjPG60LXveYx9YZDMen722bjN7d3r9yb27sDbyzYSVHs5mT2SL642zQXF+3FnfvF8IACpcqz1zKOynsY169jxqy1Mr0W1wqeGRJcd/LYD/AYDfZxXzDwH8QxH5BgBfa5d9Hz7EZ8hc823OwNcBcEpYGNUD+2kquJABSe2Wp90IRrsmH719zY0oP4wtqZY9sCUQTIfW/ET5fCtu4DjmVOMJHAMJgP60hK0lhHNv5oYFfzTB5w99VnMAWcBpC2y0GZ236nPAVit5Vgbfh3JfAY9naTw2gTbmUCRHAs9AVtIKMawgl8WP7AtoZCnKSPv7rNpjSi+xKC3Py+s+XHjEJ/BxAP81gH9XRH4TgL8E4NvwIT9DVoOcgEBTttv9WwBhFRj3KJ+sz7gVzBbL7iygVR8bMKxpx+sI23S7ueOmxNGA4MitKg7NLwkVdkRcwIu6lE+tGzdAu8UChq5xkRhvt31SSAZckIUqp9cbAwAVoFVxgElTdTsBVz550uLRZUia1R5UNfE8U0ruDMH7Fud1uy8Zuauzgp0MTsYCWn6Kxt/oolvhEXPgAwBfAeB7VPXLAfx9JPWfmUtj6uEgIt8qIj8iIj/y9375lzca8IEfZBGeognjOm97icqvGuBubj3Ta9yy3R2dPUzaSYk8OhDMHy2GacfFZCDzqJgym9yoAwCruzv0v2/d4z6db6yhtHVIev3Le+eul2i7JuJP9imdTkhKFZPCmN2BAfdGPoqYM7NxAcS5APZzVNU8KqS7dChu6Fy9aujJtf25PeypzjY8AgI/DeCnVfWH7fgHMEHh52V+fgzyBp8hU9XvVdWvVNWv/Md+za8hLZb/orODO/uxFfotQJBQBeU20NjZ6Nufa9QTYS/hEUTJHplRC7DR75h+gOMwQDjSLyCyqwtKO/wOXum85fH0sx+SBXQ/QPncd6WqS2/1gsfz64s9W0A9BQCc9PodDBP5l7Ud7zZXk7Bi59uf9KucCDSkxA/2Cfg+3Zs+gATsArIbFlCzmi7M/jsLd0FAVX8OwE+JyG+wqK8D8GPIz5AB62fIfq+NEnw1HvgM2RT4Y/mxQKyOsI0gIAWBtSHHP6r7ew7Llqmc9PMb7X4Cy6G9/Q7PdwOCaRrYPAJnBsdRRhu2QELp9qABBF3o+afhkO8/d1r1e7tAANnR53OdLt/YFu1eM68FHFAlIcAj2R43k2N6jZdIzPX/fA4lrBr5j/0GCGVf+76UOol1I+gafyktj5XqmoGB2JabBlw/eH54dJ7Avwjgj4rIrwTwNzE/LXbgLX6G7MwnwJSO5VdO9gHuN0o32f5uMvpScxRht2jYz0Lp7rYtoydF0hbjoxBds7tv4CBGoHpgDPIFsFPQ2U6phyY4Lm/KeSGKz5r/BgvoxSw1YAcKbtt72xb6K3bdP9DPUa1yq+S+iXwxlXbP5d0Eh52wV+G2Z4skaMTDDZiU2oWuUehcQIfuUWjmmdJnkMl82k1r57obHv0q8V8B8JWbU2/nM2RyBgKzREsfwjkIRN8oya3CyvXk92v86Xl5tFZvt0CerT05eIFJbx8OFGIAIxbDODCOAzIOCMY8h2QCEMlO1crC8ls6kp6MDLS4wT1/8wjHy2iHR7DTUctv0gbutwDgpJ5nkpmWKtW3cM2vdTTrIysozBxw76F+1IW76wTvV9v4NRe5oFa2kB9qZIj77GlSd8MLnjHYzznMnrOC2jdY4B0lWSeA69eS1v2sS2YB21EBj28ZuxsyHbfl0xRIcyfo/3HgGAdUjvnm4RD7ElA1BTwdbU9aO1SDpI3wnzIDaDyjatsm0+pbWbaL4JfcUDgDAG8HV4AL/poS8dsk+0ro193jom50T+8JA/u+F8nzHPniPEeHnfuqCUzBnnp+6CCgZ8sInh9eDghsWqPggtVSWR9QNtdRFWq9zCqvAkq5kLSEx2tJ5QGVtmm9La6AOks8u5sCaQbMmYQTCIaBwZBcHacCgQvm5sm67MCH+86GBXfMYHbs3mOl4KSbUQ166QbaupRGdTdk32nPjQDzswj+CQ6qs7Tp341gORisMX55gEMDKsGNN1CtrpR1E7zelJNJX9ZG+3Oe3jS8CBDoDrKMx9JfZlwVuK3jCwg9pdEtCAiKmtxAqj+MOZ5kQzxYsCUi2tv+eAMvDk1iA4fkZKExjlwWS46yXh6nUR0DvXzu6PJ461w7BrBhASO9fq2g2dOZCWRd3touiL+qRDrvGBSdpLRLBT+30fMFII/fpa1xb5gDTfuzY1CRnvqi9RnMNgCx0y2ex/hbmAHlqWc3C/pGaPAiQAAADtkNVBBMFhAAHayNWercKZrVjmuI2m8p8YeEnNRdAYeuebe9jJ5l+ok1aWMDYiMDzgKOQyYTOIwBHAfkKqTfcl8sn9psnAAAZjpm558OFzZQiLptJlZUp7a26vTfzQKuo0gP2ApOSYequzXZqv0ZbL2W13tqCH3PsFC0MY96xKmeN8D6hz2VQWFROlrLF0k5p2nZe0vhxYDAGbeTfkoAF7x6jm3UrDQRt14r+DL9CridN3hycajO2R4aHbgh+KflroKf3ZTjkd5/PgfW/OlcrPnI8iX1z3i1Hpxj0iz42P6yvqnsIunMsnpUqkev1lJDN7T9cv5GNWu5jxmJCy3dGM6BNMF8H63+MyskxG1fATN7WrysRUtVFKs6bHJbayG6X0mkrYX1IUDh5YDASUdQ53ySVCn6h8d5hztXHwDETFhOw9fHk82wTQMGdgCiHdPip3VO/e1CSuR5c9lSH/3dwK4EU6DVt1QevoeVf95XNVte7xoxgSJz3RZ8CQM365lXgfKPuSQLyGdxtU6Zqx6c6meozbOtcmIoQD6rzsM4YMsEo0zN9v3jgBwXiFwgh32IdXedmW25OkoDFNoG0ynn+Nhz6/U481yLqPQX0d7t9MPhxYDAdkaT1E6mHCcUd0f4M736RF4kPzWYdyrupavA53FJEayqWCdtgze4bXVzfQq+llgt57wjrAmknGdeecTbgcAnocw4bfdy+TzeBZ6EvJW9XKPZZpnKev+WJewAIMlHZtLXeuQ24hCMyQVfwuFawcAF/7IIPY7LvHYHBPy98SL8SFAgoc+RHCCZQ17j7GIHALHPXZDQ8Dk48DJAgCunhZRH0jK8qIhQpwot1JKnuC7XcUOoIiT10rxLNgCgWp8tIUhrWSIP5C3n9M/uK3WhdcJKdIImqKGvQ3P7pQQEhhwMHGEGlH2tDIAGz4uudi2/1LaDdYp47KvOjl7wIMG9iPIGCByww2G31HjdnUut5dArZCfIl/WYf3RuftPcf5dF+MVYwV2Nv1EGxVHO/ZRqv/wNMKcLHwwvAwRwlu3sWG67C9WImpYRq0Bduw7tV80QNmoDAO9xOUsQyCXIq2bzZcwANwOCUkT+bwcSGOE4tzPP66dofxLKiM9+QZqiAQHfVeI6qNR0tGS5SaVXYqnue+2x0doVXxa8XuZzxL7USEFMmtIQtpxqDgIFMaEOM8CFvr/Fud06sHTB3gt/mUXpcXEd4h7NM7SyELeJRoOEKfqcESy8IBC4zQS0VkRQLNS4uIM7pSWP2mf80jLqJ0BQBMHU7AIU7snOQTonMP/FdpbeWZlRr3dgauXvAqlaLzqLj7SIEaQfoPsMUvO7ti8OQ1BceQgLNKln4RqntinqHHHPqv9rMjzcWE2BClQ1tBELJBMAmOLvbPzmG9gJvt9fflXofVRiAYXCBtD2+zUs9LMiwozzmvU2W5D/fngRIDCzvwrP4vQTKVf5UF809QmQRHpC91HHDIApblja787ABgABOjcXeTzJW3yQUNplt8ui9IsIsJgqMQWk1gCBniYL6MyhOAg3CibTZYHvACBUEhfzZhY0MChJcizha2mOUidndYxoMjc9OwuopsGlsYF1C7mYf6CaCatwn2xDkVXNz2XJOhAUgaftAtjKVz4WXgQIANgKsMZfrxANJGTqpMgPkBSN1Ckqp7XItaWhII3u3yMg4U76UZ+zn29cZLnrxpW17TuxCyVv8+Sqo8t1dD3bi5U5OgPwe9IvEOcsJ9XP0AV4s7WFV8VZFPtwNmCwuAd3wm/xLmvdJ1Kq026eTevDqukYxE7jywEfEVgcf0ImA/sH/FfanBlA28Yl3ObSugAdh0mjBaA7YLN58Gh4OSCwDdn60TV2PgKpjb8lBD2OBL8O/1uv6697VScCohP3JYPusJE1D8QCWiiaHvWgCL1Sh6DrqjOPsh+ag+I8PswELR0szAfOSzfDdr4Buiva0B2CmVBLY43eCn+rkxp81qIE4Jb5Fe4HKCyAhgIPP6Yhwt1oQAOF+eiu9Xm/CfrSX/J8qQLxtmZgJlMt5neMO/WyhhcDAnsqN4Uv9IVVmFD1KA8jitMm0+KFDVjYPKbMMm9j2TlTo+jweNas7H2nP31gOe0MgzrJJhWPIz28XmRoENcUFpAaIjWHFuHmLSe9O+byCmxa7uIcTPDM81nGrFugA0rBVc7cVvirInCHbqaXGpiFvpgAcXzJ35EMIWh/A4AOBisLwNK2t0GglJr0UFZAZX17QHhOeCEgIDcqw00AE37JuQHuYU1/AXWkxrK8YwYbc0XfFXqJ8yGseCJWg7SDwg2h76dcO1EfUX8O3bCYAKian6l8soOmyeHCQw5BpbjNkCAfg48LXMyREafbUc+beQQJcLn4q/Jfbd3AwYoxfEEHRB2VyCJ4ORaf/oC6ZYFfJgoVZ+DKClCABO35nv+uiGpnuEcgdftLs63/nhNeCAicsZd7JoB1OIsvYAGyRT01AoMQfKL/ZRpsZwAGMstElJ3TMLO/4QatMyil1TpG18ARFwJcNWZoaNb0BhAZhxByIIuX4FE17XYMPi5odUFxhY3xNUX7o92/O6aHyy4feS6jKgDkDyT4XXjTPFhHBwgInBXsAMAnC0UW/Jktu03iT+V/OaGlHQvY279Y6+EZ4QWBwFoV8/1+oStI85ejwPk4ZtZQ95HCx52RAcGvJwDY+gIAohP3y/S8pulp1fuVdtgjzJoi+8xqHiDuMc3BwNKGBf0hhXJGIgmgPS6AmKBwacUywd4BFXvJOGEB+Uy0C3I+f87tPzAdg1KEt5gEx0r9HQh4HkEHgPQJtOwWcioLWdR69WkobQtmf7PNhvb2eSy8IBBYM+5aC3A5VMTIgKrND4BRT+9Ulpr0fQ32EPJb7GRWfUrPtsrWto2Kb+dIbe1oWf9KTHYInxbtLIe0mWR8Oe77re5IVMG8IeJ1f54TqR3vVjiT3BqIl21YsfSLHgxN48u6X6btkilQwMDNAmYLYe8LYkShpaVU/8pantv5gXLdquOEZMUgoI61HmBfdzLn4HPCIx8f+Q0A/hhFfRGA/zWAP2LxXwjgJwH8HlX9RZm19N2Y6wz+AwC/T1X/8q1nKBTXsakCyW4cM6jV9oNK2+e0TOCTODh40LmWJvS28PP5sqAmAcOYnGx+ItTOR/q9l7Mz0c5t6babOK3T9tWF4zzoWgYZYjMVEPZAoF4N2OWpZC9OchbKeYlNFh0Ub3/EyxppreBwE4CawAcnDN8RZwJZb6A6LAXZC3eJi7QduNd6XdgAkP3rFAySqmmLD8+/DvoGhH3KrcTz59QeC3dBQFV/AsBvBgARuWAuH/4nkJ8h+y4R+Q47/nbUz5B9FeZnyL7q5jMAXHfoZapdonIqEwAzAQIGp5vLMRD31ymWbjevwu9OM6uLKfR2vS+uwaCQJr5E/qPna+6LmRFB5wQxDJSafyP8h4RmqmBgdcYdGZ6YVycDAumWQglOOo+0U9L6Mj0vRKdp+S7gAQbIskhJi0pwKjiZkWBSkaWZE2V2wECw/aHUPYNBFfx8hXhuUfMB31fa7zv1/L7usz/mSk8s+FPox0iQCPPuwfBcc+DrAPwNVf1b8hY/QwYFrltnhposp8BDTfOfnfP7OgPYgoCrPfeC+zNdo+d+Cjri2pUZ5LnQSmQfx5ClSgxtLiTcGQJ650xqOuepy9JZXQD2NufMU4AenVNkHQQo7dqDGUD8qedTBlODVpl27Z8XFxzz8twKS/mcBSQAsuAL7bOmn49bGUDX+gwO6ltkPSdPXd/4WxhCb++m+s9AIQQezAQ0tL8Dwxhj33Y3wnNB4BsB/Hu2/9Y+QzbNgRM7hplA7FuVExhEU0zJa1qNzIl2T2UALPwo4FC0vV/intkAg7kfwh70tCrSaRXMa6owmsZhZU4MwJcQ2wl/sXk5HSMfWawNALR9bpedr0b6gbR4Wa8txKQwgSq8frR/UOaVo9n55w8IFrAT/hMG0AEhhm8dVPw5oLYCtx0xuUWcmYUBYRosJcr+xa3D9r+DQRH+xgyeEx4GAfvmwO8C8J39nKqqlBVAH0rvWzG/WoyPfc7n3GACbudravotEzBg0BtdfFKAuLfQfweXAIcEG6dXIfCRHY1kHCT8eWmdOgsAggjAlHLpRHZnXEidmU2Do5oBi28gKtie3t88Mzhau10yhN4U5eoTJb08fpcllyna9zQrEz8X+LmpCbugs8bnOCXBvskAGlB0ANGWeW1tOPPC9Wbb4pfZC7+WfmvXERDsFn4NUyDYAPsEHg/PYQK/A8BfVtWft+Ofd5ovb/gZMgDfCwAf/+Iv1q1PACDNr4b6d5jATJ1+qPui+d7/lgF0UGDBqBqfgcGTM8WbQuudzn48+cjqYe1AXVKcBRw82y21XfbfHSBI63adCSioqGEmPdSNomySxyzcoHzlQdW2of2bj4OqYn1u7sbnOW4Iv5B5sPoDsInrgJDtET4A/wUYZJ3GVrHUfV7DfXB/HLGKKeSL9mdT4CMcHaDwTUhTAMjPkH0X1s+Q/QER+X5Mh+Ddz5CpYj86YFreiVYIjwk/a30pXv8dAGziGAyK1nfVWZ2C2bCr4IPOuxmwyLTHSX1plnNW7mIGIAdgXxniztzXJMwkmrbsNUvVpO38qrHOw56AZBlij+qE8c33mRZ0EDktCwEOLRKV+ySc4umQkJ+zAWABhZ1ZsclX1J1mnMXUreYd98CAZwKOjYOQ/QG+/5zwEAiIyK8G8NsA/M8o+rvw1j5DpvvRAQDTBFDESAAzgYj3sf8zBkC/QOeN4G+Ev/oFEPfT6UijgIBT9/h3LEKQOSRCLiCHFLLDBRuwEYJjr9UiYX+QIna6sGccuwR7y/CVyLS1yO0JA2iJdblhUyeYADMHyscZSAA5EUho8c7OAiKDXm+UyRs/De0vRetHWwFxrtYX1xrVrrZYbVfrfp99AJUJaIsb0DP/2kl49DNkfx/Ax1rc38Fb+gyZ4mSIEEiKr9W5F6RMgUbQKNUOALVJmO6zatw7C++AgB1MDW9CPxDvlAh0fv21qa6d/5iRoiw/HuaAMYODmQISCAqb0JL0rmY8UuNs/cv5EgBzwVB6DNpjeXcDBjwcWJhBmAaIm7Sn1x7m9R0AcAsItlT/RlyA6soEZg9k82Be33tflEGL3m+AsO6HarB99geMBRDWUYLnhBcxY/DUHBD2udL+FhDyHHdxdnrVcyCNb41Agl+Pq5B7nj1n0YCudQU4jjG74jjm6B4Ghhxz0ACCg/yorFm4s0nrmH2SUBV+7qSeDqXL1Uvax0GPfQNbCOhpcHw7DupNFyT9b0jB9DwOs0UZEEr2Ld5SDzaQi3M24YfXGT2T6L+zg2VGYOzPXdf+hcmF8Kf/JXsZ1a/HL9q++qh2AFG1PpsGDAgK/SQMEX5E4cQcUNbygPAoAFXQygS0VHD5tYruQJD0nxCZJP4MBHJj4/9jfj9QMCA655SLOhuYk44OFlQqFVpHLevh2Vp2iylQhL/UmpW3v4IaxS0dtpdzJ/nSjljo00nYr01hmpvU+jPbDfDaM7U8gx9fNf8ZEHTh9uE/WepwV6fWNs3s4E+skRoqbVqPNRRJ7pNCuhXngo6q9et+OgifE14ECCjOJwuxlndH4OIjaAJ+CgJ8nvl8E3weGUhETsH3PBfhJ8ERmQCgY8yvBumcVuyv1x62zTfAKB0P1PlgpsC8/6gddQGDeh+29bprga7/6b7OAmSzKzV+1oPHS7skhSvMAGIKJU3Qi18Uzy9niWtnINckMaBZ5gkwWG6EXyPjKOcTYJIJzLZLIAdId4BGkqLPdTDg/sXKaH/ONT8L+x4I3kcQUMXTuC7xXr0FCHxf1vgzwa+OwF7RG8G3PLH7PGSgaEnPf893AoDqgA5AjgPHmKu+hAY5ZgmvY1ADa/l57geXY83BWqldlYrnnRkBpXgDKxZw8iYgwXcBdgFjXiB87Q4QKO1gFR1oyETIOAID6ScRL5ghrmWIS8FSVIFyL/sYIxyDx5j3HvbMaEPBPGeoMBWCt5umwzj2tVyzCj9KXGw5T2Ng6JVGBWj/fQUBAHjaeDQX4ZfuF7D9WyDAlVzObwR9cQYmSCDuyrAIjs4cuz9gKObnww/gGIohBw7Q1syEMa6z4/VfQX8CB4BAIrOYJasSpLqZnswF6lRmc1gSjDMmxmQzs2Ovmgl5rgMCnwdF61oUr+aaG6FcyVIM27rIjSrwOnAdAxK/K2RcgesVOJ5mO0Fx6AUC4IBmHC4YAObKghrMK55Fpmf6BRIM0OIil9Ye5f7B04IJDMYVfXjwvQQBheK6YQJ+lidqooHBMkFo5xjcMQAS8LwGwQbYZ+Cp1x3elbwXgNrEHjFNIoB9RVgxcMTW5zZoEfxrdtJoaOs4Lvito3lJl3wZAOzEeYXKtXxZNp3bmO6YAgvahsxLir7T/eD7lC6z8oW5oIEBPYovLQCwlJDKGhdmXQaFHtf4XQ0EJhAcOHTedqgL/2UKvU7hZ5BItpFaXam/ORB025+V1BkriBeGOhjwy0TjfQYBBf67m+aAb1nXNWbAv5u+garxVxYw08s3sapKXLXp2gXnqAAghzmlxhSEcUgwAlHFYeaAAwCDQdC/7v0NIFgBQD07UVrLW8tiyW2A37Z0JyGl0bU9U/t0uiEYQSp/GtUo6TWQWMyBljsChMoYvC6k1I0XNWn6rFOxer4aC8C4Qq5XyPE0zQBU4T8wWRizgUOB4XkMpUEaHwkG5wwA0R/3rCDnABSBHzxbcF7DAPNIeBkgAL1pDgQT6OsLKFAcg3aOBXzrD1iEnBqBGEE2QslsyeF6gdvBAzLEGADiBSBnBCIHhs8VGNnACQK7ueK2oAR8i9iWGmhUO7NepxBHfJR/q5Br2WjeQaHvZA6Uc/6XHZdcfS1uyfYGDCyry/mgKQBy3Yg+KpJ1KTogJlCiV1zHZYLAeAKuBy5yzDouQq8QBgTA4tQyfyb8iD7l5kFXUtE25K8qJkRoel00fwWDxwEAeCEgAMWpObBofwKDxUzoQr4AQCLwbRaAvCY05K53ajuG5af+G1AcI9cEGDKFKRZK9QY0BqCDx3w7AyCTIBhBlmFpfjOatedXK3BQ9AYMpikgoXKV6P1qDpQxd7IEmP67DyDiOKVW1TuB37KDOMnt5ftEz4lCq445PD0GYCYA5IAeT9CrFKGfZoGZAAYMzBL8aSzA9ThBKPpj0fobVsBpNRCIvrHEv4cgMJnALRCYVxWhp9GB3YzBPQgwM0CcSxawiePthq9GDI1TibaZfiIYJvT8O9yj7ADADToM1YdSA3efAP/bCIQI/OMfkdmN8Cttu+jsG0WqwDq95/jCCJgJ8PFSndvjhQH0cwD82xSL0PvPop1FCeaQs5jwyzgg4wD8dz3MHKhCryT07hR0ljBHMKrAr8K/AQIS9NL//E515loBgIX+UwAE9qMDXcsX4dcbzkGqvBKnnQkgKz8AAFjYAADW/EL7AQaRnml5/7cBBJ8eO5wJdGEPQFgbN0cH5q/DX/6y5npd747ZNeDiq5GKlXOZglyleLaPhKzvmECaDY1I9P2TxzBLOF/dvQGCzmMfDWI6LccA1H5jQMcVGAf0euAiYoJ+Wah/HyXwr8mdCTwdwX0TKfDZf1cnYcar2/veJ9AEn5TFc8LLAAG9xwSawG9GB5gNJLVCxNV4BgBUje8NtLCA+ZT4G72x2b4wAFAE5XcwQAAAAgwiH03jY+R+Nrp3or5PRaWsKu2bLIAv5arIOrnFCAj8JMtcz3tcZwJlE5d2c+CWTwC1ygsUr7md+zwhKwTRaDTMJxCmgBzQIcEC9Dr7lY8MiAl8ZQEJDjwBjAV+J/zZ72p/vQUOqeVr3whgQ+4/J7wMEIDi6boBAUlzYG8K9PVdPDUW9pV2LcygsYBtvPXcEPZgIqbtmCkoCTsLvyJBgSbNJN1bqR9oeHD1B2RnKvIcUiLFBMj6rvtsCS3C3yWNpTQY/qr9VybA5SZEaFkuj7in9aVmT0LIXEV4u6d54MLFw2o4DAjkCh0ClQPHVXxx8ibwxAKaXyAcg13gNWJnfpj6U55A53eKK7U8A0IzFQhQHg0vAgSgwNP23YGNxheN+AQCdWlCVBxVcKnQjdAH9Y/rvSHyeD7CBJd6aY5SeKeOWesGAGqg4NcQE/Be3kDgfGsmAXiikPbSlUUuAHqVFhzftGgWc6/5KWRqG3Og7NF+E3IB19mdkFVbM0H5reWyWKUYL58kmE7pvELHAZUBlSvkekDlCSrAkD45qE8QmjMIfbLQbHnS9dQ2QAp/qKRCxZrAe+kaW4gtAwBG9pP3FQQUj5kDO1OgM4WovBD4SqtAlbRjAdqOuXclfSf7154+wcCEHjKdTmYkTg1o10kKhrMFjQ5pzy8IX+PTMcgAsBPVXT2f74c2xQYIZK2LHQvojIBHCaIOFkqf8wi2zOCEAXDWAsMJDXpPiDLrdLgeMaZ+QPQKDJsKLFfoFRiYTGAosQDFwgIU6Rdwx2DxB9gzu/D3fpqKiIWeUmG2oC2uAMJ7CwK3pw3n0CBieXFmBv1V4kXoCZXzeO57I6FtU4l4HswEMMq3PW7bzL/YqJXEMJuzgXiGZuMp7fM57zwFCGr3yny707HVaJeqW91lDwbrmVX3nzABBgOPo7QZCBbZT/zdhpD/AgS1LNz+zqzmhCGBysChV9ufbTOBQKGSTEAlmcCQZAID856dtieVkxqczjNMdbbKMM+Owg4Aa1qPh+P+Je8qbDo5VXBWossLVzjQUbYDAKM0ypaqUTk13W/1/JiSjA7QOwQLvNL9/fk9H379/uCZ4Rn3djNgI+4n99SYh67dXESk4yFLAsC2fEv72663W9Jt1HZkgGkUnCcJ3QMAZgyZH+6DradnpyRFtzv//PCCQeA8nGmJtxFYyN4onGmyk7h87vPPPP+q51/7pvc+v0lku/tRhtPHfKTPX70p7zq8lyCwdMTn9uoP0wL3nvUhMWQNj2X28SK99QzeD0vmdnnQ26ffVbhbsbcu2J3TZ5x/JHx4OHkvQeB9Dbea98M25eNd56PSQc/hMi9BD8pjdXb3ontelVtnduefWzcfHjE/NUDgk19vb/SIl9D172rhNw7npXsZ5e7hTcv8nJb9ZJT805QJfJRVezvtTQfoN7xh5k675DsyL95qOCnDW7XJ32a932Psj6VyI/5tegY+fAeR504x/CiCiPwygJ941/n4iMLnAPjb7zoTH0F4Ldf7F/5xVf11PfJFzBMA8BOq+pXvOhMfRRCRH/lULNtruT51wntpDryG1/Aa3l54BYHX8Bo+zcNLAYHvfdcZ+AjDp2rZXsv1KRJehGPwNbyG1/DuwkthAq/hNbyGdxReQeA1vIZP8/DOQUBEfruI/ISIfEJEvuNd5+c5QUR+vYj8kIj8mIj8VRH5Nov/bBH5cyLy1237WRYvIvKHraw/KiJf8W5LcDuIyEVE/nMR+VN2/HER+WHL/x8TkV9p8Z9hx5+w81/4TjN+J4jIZ4rID4jIXxORHxeRr/lUabM3Ce8UBETkAuDfBPA7AHwZgG8SkS97l3l6ZngC8C+r6pcB+GoAv9/y/x0AflBVvxTAD9oxMMv5pfb7VgDf88nP8rPCtwH4cTr+gwD+kKp+CYBfBPAtFv8tAH7R4v+QXfeSw3cD+DOq+hsB/CbMMn6qtNnzQ1mv7pP8A/A1AP4sHX8ngO98l3n6kOX5jwD8NszZj59ncZ+HORkKAP4tAN9E18d1L+0H4AswheG3AvhTmHNc/zaAD3rbAfizAL7G9j+w6+Rdl+GkXL8WwP+n5+9Toc3e9PeuzYHPB/BTdPzTFvfeBaPAXw7ghwF8rqr+rJ36OQCfa/vvU3n/DQD/CuwLWwA+BuDvquqTHXPeo1x2/pfs+pcYPg7gvwbw75qp838WkV+NT402e6PwrkHgUyKIyD8K4D8A8C+p6t/jczrVx3s1DisivxPAL6jqX3rXefkIwgcAvgLA96jqlwP4+0jqD+D9bLMPE941CPwMgF9Px19gce9NEJFfgQkAf1RV/0OL/nkR+Tw7/3kAfsHi35fy/hYAv0tEfhLA92OaBN8N4DNFxN834bxHuez8rwXwdz6ZGX5G+GkAP62qP2zHP4AJCu97m71xeNcg8BcBfKl5nX8lgG8E8CffcZ4eDjJXCv23Afy4qv7rdOpPAvhm2/9mTF+Bx/9e8zh/NYBfIgr6YoKqfqeqfoGqfiFmm/wnqvrPAvghAL/bLuvl8vL+brv+RWpSVf05AD8lIr/Bor4OwI/hPW+zDxXetVMCwNcD+C8B/A0A/6t3nZ9n5v2fwqSNPwrgr9jv6zHt4R8E8NcB/McAPtuuF8zRkL8B4L8A8JXvugwPlPFrAfwp2/8iAP8vAJ8A8O8D+AyL/1V2/Ak7/0XvOt93yvSbAfyItdv/HcBnfSq12XN/r9OGX8Nr+DQP79oceA2v4TW84/AKAq/hNXyah1cQeA2v4dM8vILAa3gNn+bhFQRew2v4NA+vIPAaXsOneXgFgdfwGj7Nw/8f22YBIobgbicAAAAASUVORK5CYII=", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + } + } + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "2. A user-defined Python function is used to perform data augmentation. During data augmentation, the multi-process optimization solution is used, and four processes are enabled to concurrently complete the task." + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 11, + "source": [ + "def generator_func():\n", + " for i in range(5):\n", + " yield (np.array([i, i+1, i+2, i+3, i+4]),)\n", + "\n", + "ds3 = ds.GeneratorDataset(source=generator_func, column_names=[\"data\"])\n", + "print(\"before map:\")\n", + "for data in ds3.create_dict_iterator():\n", + " print(data[\"data\"])\n", + "\n", + "func = lambda x: x**2\n", + "ds4 = ds3.map(operations=func, input_columns=\"data\", python_multiprocessing=True, num_parallel_workers=4)\n", + "print(\"after map:\")\n", + "for data in ds4.create_dict_iterator():\n", + " print(data[\"data\"])" + ], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "before map:\n", + "[0 1 2 3 4]\n", + "[1 2 3 4 5]\n", + "[2 3 4 5 6]\n", + "[3 4 5 6 7]\n", + "[4 5 6 7 8]\n", + "after map:\n", + "[ 0 1 4 9 16]\n", + "[ 1 4 9 16 25]\n", + "[ 4 9 16 25 36]\n", + "[ 9 16 25 36 49]\n", + "[16 25 36 49 64]\n" + ] + } + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "## Optimizing the Operating System Performance\n", + "\n", + "Data processing is performed on the host. Therefore, configurations of the host or operating system may affect the performance of data processing. Major factors include storage, NUMA architecture, and CPU (computing resources).\n", + "\n", + "1. Storage\n", + "\n", + " The data loading process involves frequent disk operations, and the performance of disk reading and writing directly affects the speed of data loading. Solid State Drive (SSD) is recommended for storing large datasets. SSD reduces the impact of I/O on data processing.\n", + "\n", + " > In most cases, after a dataset is loaded, it is stored in page cache of the operating system. To some extent, this reduces I/O overheads and accelerates reading subsequent epochs.\n", + "\n", + "2. NUMA architecture\n", + "\n", + " NUMA (Non-uniform Memory Architecture) is developed to solve the scalability problem of traditional Symmetric Multi-processor systems. The NUMA system has multiple memory buses. Several processors are connected to one memory via memory bus to form a group. This way, the entire large system is divided into several groups, the concept of this group is called a node in the NUMA system. Memory belonging to this node is called local memory, memory belonging to other nodes (with respect to this node) is called foreign memory. Therefore, the latency for each node to access its local memory is different from accessing foreign memory. This needs to be avoided during data processing. Generally, the following command can be used to bind a process to a node:\n", + "\n", + " ```bash\n", + " numactl --cpubind=0 --membind=0 python train.py\n", + " ```\n", + "\n", + " The example above binds the `train.py` process to `numa node` 0." + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "3. CPU (computing resource)\n", + "\n", + " Although the data processing speed can be accelerated through multi-threaded parallel technology, there is actually no guarantee that CPU computing resources will be fully utilized. If you can artificially complete the configuration of computing resources in advance, it will be able to improve the utilization of CPU computing resources to a certain extent.\n", + "\n", + " - Resource allocation\n", + "\n", + " In distributed training, multiple training processes are run on one device. These training processes allocate and compete for computing resources based on the policy of the operating system. When there is a large number of processes, data processing performance may deteriorate due to resource contention. In some cases, users need to manually allocate resources to avoid resource contention.\n", + "\n", + " ```bash\n", + " numactl --cpubind=0 python train.py\n", + " ```\n", + "\n", + " or\n", + "\n", + " ```bash\n", + " taskset -c 0-15 python train.py\n", + " ```\n", + "\n", + " > The `numactl` method directly specifies `numa node id`. The `taskset` method allows for finer control by specifying `cpu core` within a `numa node`. The `core id` range from 0 to 15.\n", + "\n", + " - CPU frequency\n", + "\n", + " The setting of CPU frequency is critical to maximizing the computing power of the host CPU. Generally, the Linux kernel supports the tuning of the CPU frequency to reduce power consumption. Power consumption can be reduced to varying degrees by selecting power management policies for different system idle states. However, lower power consumption means slower CPU wake-up which in turn impacts performance. Therefore, if the CPU's power setting is in the conservative or powersave mode, `cpupower` command can be used to switch performance modes, resulting in significant data processing performance improvement.\n", + "\n", + " ```bash\n", + " cpupower frequency-set -g performance\n", + " ```" + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "## Performance Optimization Solution Summary\n", + "\n", + "### Multi-thread Optimization Solution\n", + "\n", + "During the data pipeline process, the number of threads for related operators can be set to improve the concurrency and performance. If the user does not manually specify the `num_parallel_workers` parameter, each data processing operation will use 8 sub-threads for concurrent processing by default. For example:\n", + "\n", + "- During data loading, the `num_parallel_workers` parameter in the built-in data loading class is used to set the number of threads.\n", + "- During data augmentation, the `num_parallel_workers` parameter in the `map` function is used to set the number of threads.\n", + "- During batch processing, the `num_parallel_workers` parameter in the `batch` function is used to set the number of threads.\n", + "\n", + "For details, see [Built-in Loading Operators](https://www.mindspore.cn/docs/api/en/master/api_python/mindspore.dataset.html).\n", + "\n", + "When using MindSpore for standalone or distributed training, the setting of the `num_parallel_workers` parameter should follow the following principles:\n", + "\n", + "- The summary of the `num_parallel_workers` parameter set for each data loading and processing operation should not be greater than the maximum number of CPU cores of the machine, otherwise it will cause resource competition between each operation.\n", + "- Before setting the `num_parallel_workers` parameter, it is recommended to use MindSpore's Profiler (performance analysis) tool to analyze the performance of each operation in the training, and allocate more resources to the operation with pool performance, that is, set a large `num_parallel_workers` to balance the throughput between various operations and avoid unnecessary waiting.\n", + "- In a standalone training scenario, increasing the `num_parallel_workers` parameter can often directly improve processing performance, but in a distributed scenario, due to increased CPU competition, blindly increasing `num_parallel_workers` may lead to performance degradation. You need to try to use a compromise value.\n", + "\n", + "### Multi-process Optimization Solution\n", + "\n", + "During data processing, operators implemented by Python support the multi-process mode. For example:\n", + "\n", + "- By default, the `GeneratorDataset` class is in multi-process mode. The `num_parallel_workers` parameter indicates the number of enabled processes. The default value is 1. For details, see [GeneratorDataset](https://www.mindspore.cn/docs/api/en/master/api_python/dataset/mindspore.dataset.GeneratorDataset.html).\n", + "- If the user-defined Python function or the `py_transforms` module is used to perform data augmentation and the `python_multiprocessing` parameter of the `map` function is set to True, the `num_parallel_workers` parameter indicates the number of processes and the default value of the `python_multiprocessing` parameter is False. In this case, the `num_parallel_workers` parameter indicates the number of threads. For details, see [Built-in Loading Operators](https://www.mindspore.cn/docs/api/en/master/api_python/mindspore.dataset.html).\n", + "\n", + "### Compose Optimization Solution\n", + "\n", + "Map operators can receive the Tensor operator list and apply all these operators based on a specific sequence. Compared with the Map operator used by each Tensor operator, such Fat Map operators can achieve better performance, as shown in the following figure:" + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "![compose](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/docs/mindspore/programming_guide/source_en/images/compose.png)" + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "### Operator Fusion Optimization Solution\n", + "\n", + "Some fusion operators are provided to aggregate the functions of two or more operators into one operator. For details, see [Augmentation Operators](https://www.mindspore.cn/docs/api/en/master/api_python/mindspore.dataset.vision.html). Compared with the pipelines of their components, such fusion operators provide better performance. As shown in the figure:\n", + "\n", + "![operator-fusion](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/docs/mindspore/programming_guide/source_en/images/operator_fusion.png)\n", + "\n", + "### Operating System Optimization Solution\n", + "\n", + "- Use Solid State Drives to store the data.\n", + "- Bind the process to a NUMA node.\n", + "- Manually allocate more computing resources.\n", + "- Set a higher CPU frequency.\n", + "\n", + "## References\n", + "\n", + "[1] Alex Krizhevsky. [Learning Multiple Layers of Features from Tiny Images](http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf)." + ], + "metadata": {} + } + ], + "metadata": { + "kernelspec": { + "display_name": "MindSpore", + "language": "python", + "name": "mindspore" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/tutorials/experts/source_en/debug/auto_tune.md b/tutorials/experts/source_en/debug/auto_tune.md new file mode 100644 index 0000000000000000000000000000000000000000..4f38c8858bd5ee62271a520ac1aedb2f10467e72 --- /dev/null +++ b/tutorials/experts/source_en/debug/auto_tune.md @@ -0,0 +1,86 @@ +# AutoTune + +`Ascend` `Model Optimization` + +   + +## Overview + +AutoTune is a tool that uses hardware resources and automatically tune the performance of TBE operators. Comparing with manually debugging the performance of operator, it takes less time and labor cost, and a model with better performance can be obtained. This document mainly introduces how to use the AutoTune tool to Online tune. The detail guidelines about the AutoTune framework, function description, and the fault handling can be got in [AutoTune Guides](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/31d1d888/about-this-document). + +## TuneMode + +The AutoTune tool includes `RL` and `GA` tuning modes. The`RL`tuning mode mainly supports`broadcast`,`reduce`, and`elewise`operators. The`GA`tuning mode mainly supports`cube`operators. The more information about the GA, RL, and the operators supported by the two tune mode can be got in [Tune Mode](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/41bb2c07) and [Operators](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/74e08a9c/operator-list). + +## EnvironmentVariables + +When using the AutoTune tool to tune the operators, some environment variables need to be configured (Required). + +```shell +# Run package installation directory +LOCAL_ASCEND=/usr/local/Ascend +# Run package startup depends path +export LD_LIBRARY_PATH=${LOCAL_ASCEND}/fwkacllib/lib64:$LD_LIBRARY_PATH +export PATH=${LOCAL_ASCEND}/fwkacllib/ccec_compiler/bin:${LOCAL_ASCEND}/fwkacllib/bin:$PATH +export PYTHONPATH=${LOCAL_ASCEND}/fwkacllib/python/site-packages:$PYTHONPATH +export ASCEND_OPP_PATH=${LOCAL_ASCEND}/opp + +# Offline tuning environment variables +export ENABLE_TUNE_DUMP=True +``` + +Try to find the detailed description of environment variables, or other optional environment variables descriptions in [Environment Variable](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/3f0a50ba/environment-variable-configuration). + +## EnablingTune + +The AutoTune tool supports two tuning modes, `Online tune` and `Offline Tune`. + +1. Online Tune + + Set `auto_tune_mode` in context to turn on Online tune. The value of `auto_tune_mode` should be in `["NO_TUNE", "RL", "GA", "RL,GA"]`. + + NO_TUNE: turn off tune. + + RL: turn on RL tune. + + GA: turn on GA tune. + + RL,GA: turn on GA and RL at the same time, the tool will select RL or GA automatically according to different types of operators which are used in the network. + + Example of online tuning: + + ```python + import mindspore.context as context + context.set_context(mode=context.GRAPH_MODE, device_target="Ascend", auto_tune_mode="GA,RL") + .... + ``` + + After setting the above context, you can start the tuning according to the normal execution of the training script. During the execution of the use case, no operation is required. The result of the model is the result after tuning. + +2. Offline Tune + + The Offline Tune is using the dump data (The output description file, and the binary file of operators) of network model (Generate when training network) to tune the operators. The method of Offline Tune and related environment variables can be found in [Offline Tune](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/2fa72dd0) in `CANN` development tool guide, which is not described here. + +## TuningResult + +After the tuning starts, a file named `tune_result_{timestamp}_pidxxx.json` will be generated in the working directory to record the tuning process and tuning results. Please refer to [tuning result file analysis](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/b6ae7c6a) for specific analysis of this file. + +After the tuning is complete. The custom knowledge base will be generated if the conditions are met. If the `TUNE_BANK_PATH`(Environment variable of the knowledge base storage path) is specified, the knowledge base(generated after tuning) will be saved in the specified directory. Otherwise, the knowledge base will be in the following default path. Please refer to [Custom knowledge base](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/b6ae7c6a) for the storage path. + +## MergeKnowledgeBase + +After operator tuning, the generated tuning knowledge base supports merging, which is convenient for re-executing, or the other models.(Only the same Ascend AI Processor can be merged). The more specific merging methods can be found in [merging knowledge base](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/c1a94cfc/repository-merging). + +## Notice + +Pay attention to the following points when using the AutoTune tool: + +1. The AutoTune tool can only be used on `Ascend` platform. + +2. Ensure that the available disk space in the home directory of the user who performs tuning in the operating environment is at least 20 GB. + +3. The AutoTune tool depends on some third-party software, For example: `TensorFlow` and `pciutils`. Get more information about the [Depends](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/480d602c/environment-setup). + +4. The AutoTune tool can not support all TBE operators, and can not guarantee the operator will have a performance benefit after tune (The operator has reached the best performance after multi-networks and multi-debugging manually). + +5. After the tuning tool is turned on, it is obvious that the compilation time of the perception operator becomes longer. diff --git a/tutorials/experts/source_en/debug/custom_debugging_info.md b/tutorials/experts/source_en/debug/custom_debugging_info.md new file mode 100644 index 0000000000000000000000000000000000000000..5fe5f9770e8d90a5a4047dc92bc8b50d53c0d2ee --- /dev/null +++ b/tutorials/experts/source_en/debug/custom_debugging_info.md @@ -0,0 +1,404 @@ +# Custom Debugging Information + +`Ascend` `Model Optimization` + + + +## Overview + +This section describes how to use the customized capabilities provided by MindSpore, such as `callback`, `metrics`, `Print` operators and log printing, to help you quickly debug the training network. + +## Introduction to Callback + +Here, callback is not a function but a class. You can use callback to observe the internal status and related information of the network during training or perform specific actions in a specific period. +For example, you can monitor the loss, save model parameters, dynamically adjust parameters, and terminate training tasks in advance. + +### Callback Capabilities of MindSpore + +MindSpore provides the callback capabilities to allow users to insert customized operations in a specific phase of training or inference, including: + +- Callback classes such as `ModelCheckpoint`, `LossMonitor`, and `SummaryCollector` provided by the MindSpore framework. +- Custom callback classes. + +Usage: Transfer the callback object in the `model.train` method. The callback object can be a list, for example: + +```python +from mindspore.train.callback import ModelCheckpoint, LossMonitor, SummaryCollector + +ckpt_cb = ModelCheckpoint() +loss_cb = LossMonitor() +summary_cb = SummaryCollector(summary_dir='./summary_dir') +model.train(epoch, dataset, callbacks=[ckpt_cb, loss_cb, summary_cb]) +``` + +`ModelCheckpoint` can save model parameters for retraining or inference. +`LossMonitor` can output loss information in logs for users to view. In addition, `LossMonitor` monitors the loss value change during training. When the loss value is `Nan` or `Inf`, the training terminates. +`SummaryCollector` can save the training information to files for later use. +During the training process, the callback list will execute the callback function in the defined order. Therefore, in the definition process, the dependency between callbacks needs to be considered. + +### Custom Callback + +You can customize callback based on the `callback` base class as required. + +The callback base class is defined as follows: + +```python +class Callback(): + """Callback base class""" + def begin(self, run_context): + """Called once before the network executing.""" + pass + + def epoch_begin(self, run_context): + """Called before each epoch beginning.""" + pass + + def epoch_end(self, run_context): + """Called after each epoch finished.""" + pass + + def step_begin(self, run_context): + """Called before each step beginning.""" + pass + + def step_end(self, run_context): + """Called after each step finished.""" + pass + + def end(self, run_context): + """Called once after network training.""" + pass +``` + +The callback can record important information during training and transfer the information to the callback object through a dictionary variable `RunContext.original_args()`, +You can obtain related attributes from each custom callback and perform customized operations. You can also customize other variables and transfer them to the `RunContext.original_args()` object. + +The main attributes of `RunContext.original_args()` are as follows: + +- loss_fn: Loss function +- optimizer: Optimizer +- train_dataset: Training dataset +- cur_epoch_num: Number of current epochs +- cur_step_num: Number of current steps +- batch_num: Number of batches in an epoch +- epoch_num: Number of training epochs +- batch_num: Number of training batch +- train_network: Training network +- parallel_mode: Parallel mode +- list_callback: All callback functions +- net_outputs: Network output results +- ... + +You can inherit the callback base class to customize a callback object. + +Here are two examples to further explain the usage of custom Callback. + +> custom `Callback` sample code: +> +> + +- Terminate training within the specified time. + + ```python + from mindspore.train.callback import Callback + + class StopAtTime(Callback): + def __init__(self, run_time): + super(StopAtTime, self).__init__() + self.run_time = run_time*60 + + def begin(self, run_context): + cb_params = run_context.original_args() + cb_params.init_time = time.time() + + def step_end(self, run_context): + cb_params = run_context.original_args() + epoch_num = cb_params.cur_epoch_num + step_num = cb_params.cur_step_num + loss = cb_params.net_outputs + cur_time = time.time() + if (cur_time - cb_params.init_time) > self.run_time: + print("epoch: ", epoch_num, " step: ", step_num, " loss: ", loss) + run_context.request_stop() + ``` + + The output is as follows: + + ```text + epoch: 20 step: 32 loss: 2.298344373703003 + ``` + + The implementation principle is: You can use the `run_context.original_args` method to obtain the `cb_params` dictionary, which contains the main attribute information described above. + In addition, you can modify and add values in the dictionary. In the preceding example, an `init_time` object is defined in `begin` and transferred to the `cb_params` dictionary. + A decision is made at each `step_end`. When the training time is longer than the configured time threshold, a training termination signal will be sent to the `run_context` to terminate the training in advance and the current values of epoch, step, and loss will be printed. + +- Save the checkpoint file with the highest accuracy during training. + + ```python + from mindspore.train.callback import Callback + + class SaveCallback(Callback): + def __init__(self, eval_model, ds_eval): + super(SaveCallback, self).__init__() + self.model = eval_model + self.ds_eval = ds_eval + self.acc = 0 + + def step_end(self, run_context): + cb_params = run_context.original_args() + result = self.model.eval(self.ds_eval) + if result['accuracy'] > self.acc: + self.acc = result['accuracy'] + file_name = str(self.acc) + ".ckpt" + save_checkpoint(save_obj=cb_params.train_network, ckpt_file_name=file_name) + print("Save the maximum accuracy checkpoint,the accuracy is", self.acc) + ``` + + The specific implementation principle is: define a callback object, and initialize the object to receive the model object and the ds_eval (verification dataset). Verify the accuracy of the model in the step_end phase. When the accuracy is the current highest, automatically trigger the save checkpoint method to save the current parameters. + +## MindSpore Metrics + +After the training is complete, you can use metrics to evaluate the training result. + +MindSpore provides multiple metrics, such as `accuracy`, `loss`, `tolerance`, `recall`, and `F1`. + +You can define a metrics dictionary object that contains multiple metrics and transfer them to the `model` object and use the `model.eval` function to verify the training result. + +> `metrics` sample code: +> +> + +```python +from mindspore import Model +import mindspore.nn as nn + +metrics = { + 'accuracy': nn.Accuracy(), + 'loss': nn.Loss(), + 'precision': nn.Precision(), + 'recall': nn.Recall(), + 'f1_score': nn.F1() +} +model = Model(network=net, loss_fn=net_loss, optimizer=net_opt, metrics=metrics) +result = model.eval(ds_eval) +``` + +The `model.eval` method returns a dictionary that contains the metrics and results transferred to the metrics. + +The callback function can also be used in the eval process, and the user can call the related API or customize the callback method to achieve the desired function. + +You can also define your own metrics class by inheriting the `Metric` base class and rewriting the `clear`, `update`, and `eval` methods. + +The `Accuracy` operator is used as an example to describe the internal implementation principle. + +The `Accuracy` inherits the `EvaluationBase` base class and rewrites the preceding three methods. + +- The `clear` method initializes related calculation parameters in the class. +- The `update` method accepts the predicted value and tag value and updates the internal variables of Accuracy. +- The `eval` method calculates related indicators and returns the calculation result. + +By invoking the `eval` method of `Accuracy`, you will obtain the calculation result. + +You can understand how `Accuracy` runs by using the following code: + +```python +from mindspore import Tensor +from mindspore.nn import Accuracy +import numpy as np + +x = Tensor(np.array([[0.2, 0.5], [0.3, 0.1], [0.9, 0.6]])) +y = Tensor(np.array([1, 0, 1])) +metric = Accuracy() +metric.clear() +metric.update(x, y) +accuracy = metric.eval() +print('Accuracy is ', accuracy) +``` + +The output is as follows: + +```text +Accuracy is 0.6667 +``` + +## MindSpore Print Operator + +MindSpore-developed `Print` operator is used to print the tensors or character strings input by users. Multiple strings, multiple tensors, and a combination of tensors and strings are supported, which are separated by comma (,). The `Print` operator is only supported in Ascend environment. +The method of using the MindSpore `Print` operator is the same as using other operators. You need to assert MindSpore `Print` operator in `__init__` and invoke it using `construct`. The following is an example. + +```python +import numpy as np +from mindspore import Tensor +import mindspore.ops as ops +import mindspore.nn as nn +import mindspore.context as context + +context.set_context(mode=context.GRAPH_MODE) + +class PrintDemo(nn.Cell): + def __init__(self): + super(PrintDemo, self).__init__() + self.print = ops.Print() + + def construct(self, x, y): + self.print('print Tensor x and Tensor y:', x, y) + return x + +x = Tensor(np.ones([2, 1]).astype(np.int32)) +y = Tensor(np.ones([2, 2]).astype(np.int32)) +net = PrintDemo() +output = net(x, y) +``` + +The output is as follows: + +```text +print Tensor x and Tensor y: +Tensor(shape=[2, 1], dtype=Int32, value= +[[1] + [1]]) +Tensor(shape=[2, 2], dtype=Int32, value= +[[1 1] + [1 1]]) +``` + +## Data Dump Introduction + +When training the network, if the training result deviates from the expectation, the input and output of the operator can be saved for debugging through the data dump function. For detailed Dump function introduction, please refer to [Dump Mode](https://www.mindspore.cn/docs/programming_guide/en/master/dump_in_graph_mode.html#dump-introduction). + +### Synchronous Dump + +Synchronous Dump function usage reference [Synchronous Dump Step](https://www.mindspore.cn/docs/programming_guide/en/master/dump_in_graph_mode.html#synchronous-dump-step). + +### Asynchronous Dump + +Asynchronous Dump function usage reference [Asynchronous Dump Step](https://www.mindspore.cn/docs/programming_guide/en/master/dump_in_graph_mode.html#asynchronous-dump-step)。 + +## Running Data Recorder + +Running Data Recorder(RDR) is the feature MindSpore provides to record data while training program is running. If a running exception occurs in MindSpore, the pre-recorded data in MindSpore is automatically exported to assist in locating the cause of the running exception. Different exceptions will export different data, for instance, the occurrence of `Run task error` exception, the computational graph, execution sequence of the graph, memory allocation and other information will be exported to assist in locating the cause of the exception. + +> Not all run exceptions export data, and only partial exception exports are currently supported. +> +> Only supports the data collection of CPU/Ascend/GPU in the training scenario with the graph mode. + +### Usage + +#### Set RDR By Configuration File + +1. Create the configuration file `mindspore_config.json`. + + ```json + { + "rdr": { + "enable": true, + "mode": 1, + "path": "/path/to/rdr/dir" + } + } + ``` + + > enable: Controls whether the RDR is enabled. + > + > mode: Controls RDR data exporting mode. When mode is set to 1, RDR exports data only in exceptional scenario. When mode is set to 2, RDR exports data in exceptional or normal scenario. + > + > path: Set the path to which RDR stores data. Only absolute path is supported. + +2. Configure RDR via `context`. + + ```python + context.set_context(env_config_path="./mindspore_config.json") + ``` + +#### Set RDR By Environment Variables + +Set `export MS_RDR_ENABLE=1` to enable RDR, and set `export MS_RDR_MODE=1` or `export MS_RDR_MODE=2` to control exporting mode for RDR data, and set the root directory by `export MS_RDR_PATH=/path/to/root/dir` for recording data. The final directory for recording data is `/path/to/root/dir/rank_{RANK_ID}/rdr/`. `{RANK_ID}` is the unique ID for multi-cards training, the single card scenario defaults to `RANK_ID=0`. + +> The configuration file set by the user takes precedence over the environment variables. + +#### Exception Handling + +If MindSpore is used for training on Ascend 910, there is an exception `Run task error` in training. + +When we go to the directory for recording data, we can see several files appear in this directory, each file represents a kind of data. For example, `hwopt_d_before_graph_0.ir` is a computational graph file. You can use a text tool to open this file to view the calculational graph and analyze whether the calculational graph meets your expectations. + +#### Diagnosis Handling + +When enable RDR and set `export MS_RDR_MODE=2`, it is diagnostic mode. After Compiling graph, we also can see several files in above `MS_RDR_PATH` directory. the files are same with exception handling's. + +## Log-related Environment Variables and Configurations + +MindSpore uses glog to output logs. The following environment variables are commonly used: + +- `GLOG_v` + + The environment variable specifies the log level. After the log level is specified, the log information greater than or equal to this level will be output. The values are as follows: 0: DEBUG; 1: INFO; 2: WARNING; 3: ERROR; 4: CRITICAL. + The default value is 2, indicating the WARNING level. ERROR level indicates that an error occurred during program execution. The error log will be output and the program may not be terminated. CRITICAL level indicates that an exception occurs during program execution and the program execution will be terminated. + +- `GLOG_logtostderr` + + The environment variable specifies the log output mode. + When `GLOG_logtostderr` is set to 1, logs are output to the screen. If the value is set to 0, logs are output to a file. The default value is 1. + +- `GLOG_log_dir` + + The environment variable specifies the log output path. Log files will be saved to the path of `the_specified_directory/rank_${rank_id}/logs/`. During the distributed training, `rank_id` is the ID of the current device in the cluster. Otherwise, `rank_id` is `0`. + If `GLOG_logtostderr` is set to 0, value of this variable must be specified. + If `GLOG_log_dir` is specified and the value of `GLOG_logtostderr` is 1, logs are output to the screen but not to a file. + Logs of C++ and Python will be output to different files. The file name of C++ log complies with the naming rule of `GLOG` log file. Here, the name is `mindspore.MachineName.UserName.log.LogLevel.Timestamp`. The file name of Python log is `mindspore.log`. + `GLOG_log_dir` can only contains characters such as uppercase letters, lowercase letters, digits, "-", "_" and "/". + +- `GLOG_stderrthreshold` + + The log module will print logs to the screen when these logs are output to a file. This environment variable is used to control the log level printed to the screen in this scenario. + The default value is 2, indicating the WARNING level. The values are as follows: 0: DEBUG; 1: INFO; 2: WARNING; 3: ERROR; 4: CRITICAL. + +- `MS_SUBMODULE_LOG_v` + + The environment variable specifies log levels of C++ sub modules of MindSpore. + The environment variable is assigned as: `MS_SUBMODULE_LOG_v="{SubModule1:LogLevel1,SubModule2:LogLevel2,...}"`. + The specified sub module log level will overwrite the global log level. The meaning of sub module log level is the same as `GLOG_v`, the sub modules of MindSpore are categorized by source directory is shown in the below table. + E.g. when set `GLOG_v=1 MS_SUBMODULE_LOG_v="{PARSER:2,ANALYZER:2}"` then log levels of `PARSER` and `ANALYZER` are WARNING, other modules' log levels are INFO. + +Sub modules of MindSpore grouped by source directory: + +| Source Files | Sub Module Name | +| ------------ | --------------- | +| mindspore/ccsrc/backend/kernel_compiler | KERNEL | +| mindspore/ccsrc/backend/optimizer | PRE_ACT | +| mindspore/ccsrc/backend/session | SESSION | +| mindspore/ccsrc/common | COMMON | +| mindspore/ccsrc/debug | DEBUG | +| mindspore/ccsrc/frontend/operator | ANALYZER | +| mindspore/ccsrc/frontend/optimizer | OPTIMIZER | +| mindspore/ccsrc/frontend/parallel | PARALLEL | +| mindspore/ccsrc/minddata/dataset | MD | +| mindspore/ccsrc/minddata/mindrecord | MD | +| mindspore/ccsrc/pipeline/jit/*.cc | PIPELINE | +| mindspore/ccsrc/pipeline/jit/parse | PARSER | +| mindspore/ccsrc/pipeline/jit/static_analysis | ANALYZER | +| mindspore/ccsrc/pipeline/pynative | PYNATIVE | +| mindspore/ccsrc/profiler | PROFILER | +| mindspore/ccsrc/pybind_api | COMMON | +| mindspore/ccsrc/runtime/device | DEVICE | +| mindspore/ccsrc/transform/graph_ir | GE_ADPT | +| mindspore/ccsrc/transform/express_ir | EXPRESS | +| mindspore/ccsrc/utils | UTILS | +| mindspore/ccsrc/vm | VM | +| mindspore/ccsrc | ME | +| mindspore/core/gvar | COMMON | +| mindspore/core/ | CORE | + +- `GLOG_log_max` + + It is used to control the size of the mindspire C + + module log file. The default maximum is 50MB. You can change the default maximum value of the log file through this environment variable. If the currently written log file exceeds the maximum value, the newly output log content will be written to the new log file. + +- `logger_maxBytes` + + It is used to control the size of the mindspire Python module log file. The default is 52428800 bytes. + +- `logger_backupCount` + + Used to control the number of mindspire Python module log files. The default is 30. + +> The glog does not support log rotate. To control the disk space occupied by log files, use the log file management tool provided by the operating system, such as: logrotate of Linux. diff --git a/tutorials/experts/source_en/debug/dataset_autotune.md b/tutorials/experts/source_en/debug/dataset_autotune.md new file mode 100644 index 0000000000000000000000000000000000000000..f95dae5b03c0c76ec65039951312c9d3ed48320b --- /dev/null +++ b/tutorials/experts/source_en/debug/dataset_autotune.md @@ -0,0 +1,225 @@ +# Dataset AutoTune for Dataset Pipeline + +`Ascend` `GPU` `Data Preparation` + + + +## Overview + +MindSpore provides a tool named Dataset AutoTune for optimizing dataset. +The Dataset AutoTune can automatically tune Dataset pipelines to improve performance. + +This feature can automatically detect a bottleneck operator in the dataset pipeline and respond by automatically adjusting tunable parameters for dataset ops, like increasing the number of parallel workers or updating the prefetch size of dataset ops. + +![autotune](images/autotune.png) + +With Dataset AutoTune enabled, MindSpore will sample dataset statistics at a given interval, which is tuneable by the user. + +Once Dataset AutoTune collects enough information, it will analyze whether the performance bottleneck is on the dataset side or not. +If so, it will adjust the parallelism and speedup the dataset pipeline. +If not, Dataset AutoTune will also try to reduce the memory usage of the dataset pipeline to release memory for CPU. + +> Dataset AutoTune is disabled by default. + +## Enable Dataset AutoTune + +To enable Dataset AutoTune and not save the optimized dataset pipeline: + +```python +import mindspore.dataset as ds +ds.config.set_enable_autotune(True) +``` + +To enable Dataset AutoTune plus save the optimized dataset pipeline in a configuration file: + +```python +import mindspore.dataset as ds +ds.config.set_enable_autotune(True, "/path/to/autotune_out.json") +``` + +## Tuning Interval for Dataset AutoTune + +The frequency at which Dataset AutoTune will adjust the dataset pipeline can be customized. +To set the tuning interval in steps: + +```python +import mindspore.dataset as ds +ds.config.set_autotune_interval(100) +``` + +> To set the tuning interval to be after every epoch, set the tuning interval to 0. + +To query the tuning interval for dataset pipeline autotuning: + +```python +import mindspore.dataset as ds +print("tuning interval:", ds.config.get_autotune_interval()) +``` + +## Constraints + +- Both Dataset Profiling and Dataset AutoTune cannot be enabled concurrently, otherwise it will lead to unwork of Dataset AutoTune or Profiling. If both of them are enabled at the same time, a warning message will prompt the user to check whether it is a mistake. Please make sure Profiling is disabled when using Dataset AutoTune. +- When enable [Offload for Dataset](https://www.mindspore.cn/docs/programming_guide/en/master/enable_dataset_offload.html) and Dataset AutoTune simultaneously, if any dataset node has been offloaded for hardware acceleration, then the optimized dataset pipeline configuration file will not be written and a warning will be logged, because the dataset pipeline that is actually running is not the predefined one. +- If the Dataset pipeline consists of a node that does not support deserialization (e.g. user-defined Python functions, GeneratorDataset), then any attempt to deserialize the saved optimized dataset pipeline configuration file will report an error. In this case, it is recommended to open the pipeline configuration file and modify the script of dataset pipeline manually. + +## Example + +Take ResNet training as example. + +### Dataset AutoTune Config + +To enable Dataset AutoTune, only one statement is needed. + +```python +# dataset.py of ResNet in ModelZoo +# models/official/cv/resnet/src/dataset.py + +def create_dataset(...) + """ + create dataset for train or test + """ + # enable Dataset AutoTune + ds.config.set_enable_autotune(True, "/path/to/autotune_out.json") + + # define dataset + data_set = ds.Cifar10Dataset(data_path) + ... +``` + +### Start Training + +Start the training process as described in [resnet/README.md](https://gitee.com/mindspore/models/blob/master/official/cv/resnet/README.md). Dataset AutoTune will display its analysis result through LOG messages. + +```text +[INFO] [auto_tune.cc:73 LaunchThread] Launching Dataset AutoTune thread +[INFO] [auto_tune.cc:35 Main] Dataset AutoTune thread has started. +[INFO] [auto_tune.cc:191 RunIteration] Run Dataset AutoTune at epoch #1 +[INFO] [auto_tune.cc:203 RecordPipelineTime] Epoch #1, Average Pipeline time is 21.6624 ms. The avg pipeline time for all epochs is 21.6624ms +[INFO] [auto_tune.cc:231 IsDSaBottleneck] Epoch #1, Device Connector Size: 0.0224, Connector Capacity: 1, Utilization: 2.24%, Empty Freq: 97.76% +epoch: 1 step: 1875, loss is 1.1544309 +epoch time: 72110.166 ms, per step time: 38.459 ms + +[WARNING] [auto_tune.cc:236 IsDSaBottleneck] Utilization: 2.24% < 75% threshold, dataset pipeline performance needs tuning. +[WARNING] [auto_tune.cc:297 Analyse] Op (MapOp(ID:3)) is slow, input connector utilization=0.975806, output connector utilization=0.298387, diff= 0.677419 > 0.35 threshold. +[WARNING] [auto_tune.cc:253 RequestNumWorkerChange] Added request to change "num_parallel_workers" of Operator: MapOp(ID:3)From old value: [2] to new value: [4]. +[WARNING] [auto_tune.cc:309 Analyse] Op (BatchOp(ID:2)) getting low average worker cpu utilization 1.64516% < 35% threshold. +[WARNING] [auto_tune.cc:263 RequestConnectorCapacityChange] Added request to change "prefetch_size" of Operator: BatchOp(ID:2)From old value: [1] to new value: [5]. +epoch: 2 step: 1875, loss is 0.64530635 +epoch time: 24519.360 ms, per step time: 13.077 ms + +[WARNING] [auto_tune.cc:236 IsDSaBottleneck] Utilization: 0.0213516% < 75% threshold, dataset pipeline performance needs tuning. +[WARNING] [auto_tune.cc:297 Analyse] Op (MapOp(ID:3)) is slow, input connector utilization=1, output connector utilization=0, diff= 1 > 0.35 threshold. +[WARNING] [auto_tune.cc:253 RequestNumWorkerChange] Added request to change "num_parallel_workers" of Operator: MapOp(ID:3)From old value: [4] to new value: [6]. +[WARNING] [auto_tune.cc:309 Analyse] Op (BatchOp(ID:2)) getting low average worker cpu utilization 4.39062% < 35% threshold. +[WARNING] [auto_tune.cc:263 RequestConnectorCapacityChange] Added request to change "prefetch_size" of Operator: BatchOp(ID:2)From old value: [5] to new value: [9]. +epoch: 3 step: 1875, loss is 0.9806979 +epoch time: 17116.234 ms, per step time: 9.129 ms + +... + +[INFO] [profiling.cc:703 Stop] MD Autotune is stopped. +[INFO] [auto_tune.cc:52 Main] Dataset AutoTune thread is finished. +[INFO] [auto_tune.cc:53 Main] Printing final tree configuration +[INFO] [auto_tune.cc:66 PrintTreeConfiguration] CifarOp(ID:5) num_parallel_workers: 2 prefetch_size: 2 +[INFO] [auto_tune.cc:66 PrintTreeConfiguration] MapOp(ID:4) num_parallel_workers: 1 prefetch_size: 2 +[INFO] [auto_tune.cc:66 PrintTreeConfiguration] MapOp(ID:3) num_parallel_workers: 10 prefetch_size: 2 +[INFO] [auto_tune.cc:66 PrintTreeConfiguration] BatchOp(ID:2) num_parallel_workers: 8 prefetch_size: 17 +[INFO] [auto_tune.cc:55 Main] Suggest to set proper num_parallel_workers for each Operation or use global setting API: mindspore.dataset.config.set_num_parallel_workers +[INFO] [auto_tune.cc:57 Main] Suggest to choose maximum prefetch_size from tuned result and set by global setting API: mindspore.dataset.config.set_prefetch_size +``` + +Some analysis to explain the meaning of the log information: + +- **How to check process of Dataset AutoTune:** + + Dataset AutoTune displays common status log information at INFO level. However, when AutoTune detects a bottleneck in the dataset pipeline, it will try to modify the parameters of dataset pipeline ops, and display this analysis log information at WARNING level. + +- **How to read LOG messages:** + + The initial configuration of the dataset pipeline is suboptimal (Utilization Device Connector is low). + + ```text + [INFO] [auto_tune.cc:231 IsDSaBottleneck] Epoch #1, Device Connector Size: 0.0224, Connector Capacity: 1, Utilization: 2.24%, Empty Freq: 97.76% + [WARNING] [auto_tune.cc:236 IsDSaBottleneck] Utilization: 2.24% < 75% threshold, dataset pipeline performance needs tuning. + ``` + + Then, Dataset AutoTune increases the number of parallel workers from 2 to 4 for MapOp(ID:3) and increases the prefetch size from 1 to 5 for BatchOp(ID:2). + + ```text + [WARNING] [auto_tune.cc:297 Analyse] Op (MapOp(ID:3)) is slow, input connector utilization=0.975806, output connector utilization=0.298387, diff= 0.677419 > 0.35 threshold. + [WARNING] [auto_tune.cc:253 RequestNumWorkerChange] Added request to change "num_parallel_workers" of Operator: MapOp(ID:3)From old value: [2] to new value: [4]. + [WARNING] [auto_tune.cc:309 Analyse] Op (BatchOp(ID:2)) getting low average worker cpu utilization 1.64516% < 35% threshold. + [WARNING] [auto_tune.cc:263 RequestConnectorCapacityChange] Added request to change "prefetch_size" of Operator: BatchOp(ID:2)From old value: [1] to new value: [5]. + ``` + + After tuning the configuration of the dataset pipeline, the step time is reduced. + + ```text + epoch: 1 step: 1875, loss is 1.1544309 + epoch time: 72110.166 ms, per step time: 38.459 ms + epoch: 2 step: 1875, loss is 0.64530635 + epoch time: 24519.360 ms, per step time: 13.077 ms + epoch: 3 step: 1875, loss is 0.9806979 + epoch time: 17116.234 ms, per step time: 9.129 ms + ``` + + At the end of training, an improved configuration is created by Dataset AutoTune. + For num_parallel_workers, Dataset AutoTune suggests to set new value for each Operation or using global setting API. + For prefetch_size, Dataset AutoTune suggests to choose the maximum value and set by global setting API. + + ```text + [INFO] [auto_tune.cc:66 PrintTreeConfiguration] CifarOp(ID:5) num_parallel_workers: 2 prefetch_size: 2 + [INFO] [auto_tune.cc:66 PrintTreeConfiguration] MapOp(ID:4) num_parallel_workers: 1 prefetch_size: 2 + [INFO] [auto_tune.cc:66 PrintTreeConfiguration] MapOp(ID:3) num_parallel_workers: 10 prefetch_size: 2 + [INFO] [auto_tune.cc:66 PrintTreeConfiguration] BatchOp(ID:2) num_parallel_workers: 8 prefetch_size: 17 + [INFO] [auto_tune.cc:55 Main] Suggest to set proper num_parallel_workers for each Operation or use global setting API: mindspore.dataset.config.set_num_parallel_workers + [INFO] [auto_tune.cc:57 Main] Suggest to choose maximum prefetch_size from tuned result and set by global setting API: mindspore.dataset.config.set_prefetch_size + ``` + +### The Saved AutoTune Recommended Configuration + +Since Dataset AutoTune was enabled to generate an optimized dataset pipeline, a JSON serialization of the +dataset pipeline can be saved (by passing in the `json_filepath` parameter) in a configuration file. + +Example of the JSON configuration file: + +```text +{ + "remark": "The following file has been auto-generated by the Dataset AutoTune.", + "summary": [ + "CifarOp(ID:5) (num_parallel_workers: 2, prefetch_size: 2)", + "MapOp(ID:4) (num_parallel_workers: 1, prefetch_size: 2)", + "MapOp(ID:3) (num_parallel_workers: 10, prefetch_size: 2)", + "BatchOp(ID:2) (num_parallel_workers: 8, prefetch_size: 17)" + ], + "tree": {...} +} +``` + +The file starts with a summary of the configuration and then is followed by the actual pipeline (`tree`). The file is +loadable using the deserialization API `mindspore.dataset.deserialize`. + +Notes on the JSON configuration file: + +- Non-parallel dataset operations will show `NA` for `num_parallel_workers`. + +### Before Next Training + +Before starting the next training process, users can apply the recommended configuration changes to the dataset Python scripts. + +If Dataset AutoTune generated an optimized pipeline configuration file, use deserialize support to load the dataset pipeline: + +```python +import mindspore.dataset as ds +ds.deserialize(json_filepath="/path/to/autotune_out.json") +``` + +This allows the dataset pipeline to be run at an improved speed from the beginning of the training process. + +By the way, MindSpore also provides APIs to set the global value of num_parallel_workers and prefetch_size. + +Please refer to [mindspore.dataset.config](https://www.mindspore.cn/docs/api/en/master/api_python/mindspore.dataset.config.html): + +- [mindspore.dataset.config.set_num_parallel_workers](https://www.mindspore.cn/docs/api/en/master/api_python/mindspore.dataset.config.html#mindspore.dataset.config.set_num_parallel_workers) +- [mindspore.dataset.config.set_prefetch_size](https://www.mindspore.cn/docs/api/en/master/api_python/mindspore.dataset.config.html#mindspore.dataset.config.set_prefetch_size) + diff --git a/tutorials/experts/source_en/debug/debug_in_pynative_mode.md b/tutorials/experts/source_en/debug/debug_in_pynative_mode.md new file mode 100644 index 0000000000000000000000000000000000000000..07a123708db73021fa679ef641f9200add136756 --- /dev/null +++ b/tutorials/experts/source_en/debug/debug_in_pynative_mode.md @@ -0,0 +1,836 @@ +# Debugging in PyNative Mode + +`Ascend` `GPU` `CPU` `Model Running` + + + +## Overview + +MindSpore supports the following running modes which are optimized for debugging or running: + +- PyNative mode: dynamic graph mode. In this mode, operators in the neural network are delivered and executed one by one, facilitating the compilation and debugging of the neural network model. +- Graph mode: static graph mode. In this mode, the neural network model is compiled into an entire graph and then delivered for execution. This mode uses technologies such as graph optimization to improve the running performance and facilitates large-scale deployment and cross-platform running. + +By default, MindSpore is in Graph mode. You can switch it to PyNative mode by calling `context.set_context(mode=context.PYNATIVE_MODE)`. Similarly, MindSpore in PyNative mode can be switched to Graph mode through `context.set_context(mode=context.GRAPH_MODE)`. + +In PyNative mode, single operators, common functions, network inference, and separated gradient calculation can be executed. The following describes the usage and precautions. + +> In PyNative mode, operators are executed asynchronously on the device to improve performance. Therefore, when an error occurs during operator execution, the error information may be displayed after the program is executed. Therefore, in PyNative mode, a pynative_synchronize setting is added to control whether operators are executed asynchronously on the device. +> +> In the following example, the parameter initialization uses random values, and the output results in specific execution may be different from the results of local execution; if you need to stabilize the output of a fixed value, you can set a fixed random seed. For the setting method, please refer to [mindspore.set_seed()](https://www.mindspore.cn/docs/api/en/master/api_python/mindspore/mindspore.set_seed.html). + +## Executing a Single Operator + +Execute a single operator and output the result, as shown in the following example. + +```python +import numpy as np +import mindspore.nn as nn +from mindspore import context, Tensor + +context.set_context(mode=context.PYNATIVE_MODE, device_target="GPU") + +conv = nn.Conv2d(3, 4, 3, bias_init='zeros') +input_data = Tensor(np.ones([1, 3, 5, 5]).astype(np.float32)) +output = conv(input_data) +print(output.asnumpy()) +``` + +Output: + +```text +[[[[-0.02190447 -0.05208071 -0.05208071 -0.05208071 -0.06265172] +[-0.01529094 -0.05286242 -0.05286242 -0.05286242 -0.04228776] +[-0.01529094 -0.05286242 -0.05286242 -0.05286242 -0.04228776] +[-0.01529094 -0.05286242 -0.05286242 -0.05286242 -0.04228776] +[-0.01430791 -0.04892948 -0.04892948 -0.04892948 -0.01096004]] + +[[ 0.00802889 -0.00229866 -0.00229866 -0.00229866 -0.00471579] +[ 0.01172971 0.02172665 0.02172665 0.02172665 0.03261888] +[ 0.01172971 0.02172665 0.02172665 0.02172665 0.03261888] +[ 0.01172971 0.02172665 0.02172665 0.02172665 0.03261888] +[ 0.01784375 0.01185635 0.01185635 0.01185635 0.01839031]] + +[[ 0.04841832 0.03321705 0.03321705 0.03321705 0.0342317 ] +[ 0.0651359 0.04310361 0.04310361 0.04310361 0.03355784] +[ 0.0651359 0.04310361 0.04310361 0.04310361 0.03355784] +[ 0.0651359 0.04310361 0.04310361 0.04310361 0.03355784] +[ 0.04680437 0.03465693 0.03465693 0.03465693 0.00171057]] + +[[-0.01783456 -0.00459451 -0.00459451 -0.00459451 0.02316688] +[ 0.01295831 0.00879035 0.00879035 0.00879035 0.01178642] +[ 0.01295831 0.00879035 0.00879035 0.00879035 0.01178642] +[ 0.01295831 0.00879035 0.00879035 0.00879035 0.01178642] +[ 0.05016355 0.03958241 0.03958241 0.03958241 0.03443141]]]] +``` + +## Executing a Common Function + +Combine multiple operators into a function, call the function to execute the operators, and output the result, as shown in the following example: + +Example Code: + +```python +import numpy as np +from mindspore import context, Tensor +import mindspore.ops as ops + +context.set_context(mode=context.PYNATIVE_MODE, device_target="GPU") + +def add_func(x, y): + z = ops.add(x, y) + z = ops.add(z, x) + return z + +x = Tensor(np.ones([3, 3], dtype=np.float32)) +y = Tensor(np.ones([3, 3], dtype=np.float32)) +output = add_func(x, y) +print(output.asnumpy()) +``` + +Output: + +```text +[[3. 3. 3.] + [3. 3. 3.] + [3. 3. 3.]] +``` + +> Summary is not supported in PyNative mode, so summary related operators cannot be used. + +### Improving PyNative Performance + +MindSpore provides the Staging function to improve the execution speed of inference tasks in PyNative mode. This function compiles Python functions or Python class methods into computational graphs in PyNative mode and improves the execution speed by using graph optimization technologies, as shown in the following example: + +```python +import numpy as np +import mindspore.nn as nn +from mindspore import context, Tensor +import mindspore.ops as ops +from mindspore import ms_function + +context.set_context(mode=context.PYNATIVE_MODE, device_target="GPU") + +class TensorAddNet(nn.Cell): + def __init__(self): + super(TensorAddNet, self).__init__() + self.add = ops.Add() + + @ms_function + def construct(self, x, y): + res = self.add(x, y) + return res + +x = Tensor(np.ones([4, 4]).astype(np.float32)) +y = Tensor(np.ones([4, 4]).astype(np.float32)) +net = TensorAddNet() + +z = net(x, y) # Staging mode +add = ops.Add() +res = add(x, z) # PyNative mode +print(res.asnumpy()) +``` + +Output: + +```text +[[3. 3. 3. 3.] + [3. 3. 3. 3.] + [3. 3. 3. 3.] + [3. 3. 3. 3.]] +``` + +In the preceding code, the `ms_function` decorator is added before `construct` of the `TensorAddNet` class. The decorator compiles the `construct` method into a computational graph. After the input is given, the graph is delivered and executed, `add` in the preceding code is executed in the common PyNative mode. + +It should be noted that, in a function to which the `ms_function` decorator is added, if an operator (such as `pooling` or `add`) that does not need parameter training is included, the operator can be directly called in the decorated function, as shown in the following example: + +Example Code: + +```python +import numpy as np +import mindspore.nn as nn +from mindspore import context, Tensor +import mindspore.ops as ops +from mindspore import ms_function + +context.set_context(mode=context.PYNATIVE_MODE, device_target="GPU") + +add = ops.Add() + +@ms_function +def add_fn(x, y): + res = add(x, y) + return res + +x = Tensor(np.ones([4, 4]).astype(np.float32)) +y = Tensor(np.ones([4, 4]).astype(np.float32)) +z = add_fn(x, y) +print(z.asnumpy()) +``` + +Output: + +```text +[[2. 2. 2. 2.] + [2. 2. 2. 2.] + [2. 2. 2. 2.] + [2. 2. 2. 2.]] +``` + +If the decorated function contains operators (such as `Convolution` and `BatchNorm`) that require parameter training, these operators must be instantiated before the decorated function is called, as shown in the following example: + +Example Code: + +```python +import numpy as np +import mindspore.nn as nn +from mindspore import context, Tensor +from mindspore import ms_function + +context.set_context(mode=context.PYNATIVE_MODE, device_target="GPU") + +conv_obj = nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, stride=2, padding=0) +conv_obj.init_parameters_data() +@ms_function +def conv_fn(x): + res = conv_obj(x) + return res + +input_data = np.random.randn(2, 3, 6, 6).astype(np.float32) +z = conv_fn(Tensor(input_data)) +print(z.asnumpy()) +``` + +Output: + +```text +[[[[ 0.10377571 -0.0182163 -0.05221086] +[ 0.1428334 -0.01216263 0.03171652] +[-0.00673915 -0.01216291 0.02872104]] + +[[ 0.02906547 -0.02333629 -0.0358406 ] +[ 0.03805163 -0.00589525 0.04790922] +[-0.01307234 -0.00916951 0.02396654]] + +[[ 0.01477884 -0.06549098 -0.01571796] +[ 0.00526886 -0.09617482 0.04676902] +[-0.02132788 -0.04203424 0.04523344]] + +[[ 0.04590619 -0.00251453 -0.00782715] +[ 0.06099087 -0.03445276 0.00022781] +[ 0.0563223 -0.04832596 -0.00948266]]] + +[[[ 0.08444098 -0.05898955 -0.039262 ] +[ 0.08322686 -0.0074796 0.0411371 ] +[-0.02319113 0.02128408 -0.01493311]] + +[[ 0.02473745 -0.02558945 -0.0337843 ] +[-0.03617039 -0.05027632 -0.04603915] +[ 0.03672804 0.00507637 -0.08433761]] + +[[ 0.09628943 0.01895323 -0.02196114] +[ 0.04779419 -0.0871575 0.0055248 ] +[-0.04382382 -0.00511185 -0.01168541]] + +[[ 0.0534859 0.02526264 0.04755395] +[-0.03438103 -0.05877855 0.06530266] +[ 0.0377498 -0.06117418 0.00546303]]]] +``` + +## Debugging Network Train Model + +In PyNative mode, the gradient can be calculated separately. As shown in the following example, `GradOperation` is used to calculate all input gradients of the function or the network. Note that the inputs have to be Tensor. + +Example Code: + +```python +import mindspore.ops as ops +import mindspore.context as context +from mindspore import dtype as mstype +from mindspore import Tensor + +context.set_context(mode=context.PYNATIVE_MODE, device_target="GPU") + +def mul(x, y): + return x * y + +def mainf(x, y): + return ops.GradOperation(get_all=True)(mul)(x, y) + +print(mainf(Tensor(1, mstype.int32), Tensor(2, mstype.int32))) +``` + +Output: + +```text +(Tensor(shape=[], dtype=Int32, value=2), Tensor(shape=[], dtype=Int32, value=1)) +``` + +During network training, obtain the gradient, call the optimizer to optimize parameters (the breakpoint cannot be set during the reverse gradient calculation), and calculate the loss values. Then, network training is implemented in PyNative mode. + +Complete LeNet Sample Code: + +```python +import numpy as np +import mindspore.nn as nn +import mindspore.ops as ops +from mindspore import dtype as mstype +from mindspore import context, Tensor, ParameterTuple +from mindspore.common.initializer import TruncatedNormal +from mindspore.nn import Dense, WithLossCell, SoftmaxCrossEntropyWithLogits, Momentum + +context.set_context(mode=context.PYNATIVE_MODE, device_target="GPU") + +def conv(in_channels, out_channels, kernel_size, stride=1, padding=0): + """weight initial for conv layer""" + weight = weight_variable() + return nn.Conv2d(in_channels, out_channels, + kernel_size=kernel_size, stride=stride, padding=padding, + weight_init=weight, has_bias=False, pad_mode="valid") + +def fc_with_initialize(input_channels, out_channels): + """weight initial for fc layer""" + weight = weight_variable() + bias = weight_variable() + return nn.Dense(input_channels, out_channels, weight, bias) + +def weight_variable(): + """weight initial""" + return TruncatedNormal(0.02) + + +class LeNet5(nn.Cell): + """ + Lenet network + Args: + num_class (int): Num classes. Default: 10. + + Returns: + Tensor, output tensor + + Examples: + >>> LeNet(num_class=10) + """ + def __init__(self, num_class=10): + super(LeNet5, self).__init__() + self.num_class = num_class + self.batch_size = 32 + self.conv1 = conv(1, 6, 5) + self.conv2 = conv(6, 16, 5) + self.fc1 = fc_with_initialize(16 * 5 * 5, 120) + self.fc2 = fc_with_initialize(120, 84) + self.fc3 = fc_with_initialize(84, self.num_class) + self.relu = nn.ReLU() + self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2) + self.reshape = ops.Reshape() + + def construct(self, x): + x = self.conv1(x) + x = self.relu(x) + x = self.max_pool2d(x) + x = self.conv2(x) + x = self.relu(x) + x = self.max_pool2d(x) + x = self.reshape(x, (self.batch_size, -1)) + x = self.fc1(x) + x = self.relu(x) + x = self.fc2(x) + x = self.relu(x) + x = self.fc3(x) + return x + + +class GradWrap(nn.Cell): + """ GradWrap definition """ + def __init__(self, network): + super(GradWrap, self).__init__(auto_prefix=False) + self.network = network + self.weights = ParameterTuple(filter(lambda x: x.requires_grad, network.get_parameters())) + + def construct(self, x, label): + weights = self.weights + return ops.GradOperation(get_by_list=True)(self.network, weights)(x, label) + +net = LeNet5() +optimizer = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()), 0.1, 0.9) +criterion = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean') +net_with_criterion = WithLossCell(net, criterion) +train_network = GradWrap(net_with_criterion) +train_network.set_train() + +input_data = Tensor(np.ones([net.batch_size, 1, 32, 32]).astype(np.float32) * 0.01) +label = Tensor(np.ones([net.batch_size]).astype(np.int32)) +output = net(Tensor(input_data)) +loss_output = criterion(output, label) +grads = train_network(input_data, label) +success = optimizer(grads) +loss = loss_output.asnumpy() +print(loss) +``` + +Output: + +```text +2.3050091 +``` + +In the preceding execution, an intermediate result of network execution can be obtained at any required place in `construt` function, and the network can be debugged by using the Python Debugger (pdb). + +## Synchronous Execution Under PyNative + +In PyNative mode, the operators are executed asynchronously by default. You can control whether to execute asynchronously by setting the context. When the operator fails to execute, you can easily see the error code location through the call stack. + +Set context pynative_synchronize to True: + +```python +context.set_context(pynative_synchronize=True) +``` + +Example Code: + +```python +import numpy as np +import mindspore.context as context +import mindspore.nn as nn +from mindspore import Tensor +from mindspore import dtype as mstype +import mindspore.ops as ops + +context.set_context(mode=context.PYNATIVE_MODE, device_target="Ascend", pynative_synchronize=True) + +class Net(nn.Cell): + def __init__(self): + super(Net, self).__init__() + self.get_next = ops.GetNext([mstype.float32], [(1, 1)], 1, "test") + + def construct(self, x1,): + x = self.get_next() + x = x + x1 + return x + +context.set_context() +x1 = np.random.randn(1, 1).astype(np.float32) +net = Net() +output = net(Tensor(x1)) +print(output.asnumpy()) +``` + +Output: you can see the complete call stack. + +```text +Traceback (most recent call last): + File "test_pynative_sync_control.py", line 41, in + output = net(Tensor(x1)) + File "mindspore/mindspore/nn/cell.py", line 406, in + output = self.run_construct(cast_inputs, kwargs) + File "mindspore/mindspore/nn/cell.py", line 348, in + output = self.construct(*cast_inputs, **kwargs) + File "test_pynative_sync_control.py", line 33, in + x = self.get_next() + File "mindspore/mindspore/ops/primitive.py", line 247, in + return _run_op(self, self.name, args) + File "mindspore/mindspore/common/api.py", line 77, in + results = fn(*arg, **kwargs) + File "mindspore/mindspore/ops/primitive.py", line 677, in _run_op + output = real_run_op(obj, op_name, args) +RuntimeError: mindspore/ccsrc/runtime/device/kernel_runtime.cc:1006 DebugStreamSync] Op Default/GetNext-op0 run failed! +``` + +## Hook + +Debugging deep learning network is a task that practitioners in every field of deep learning need to face and invest a lot of energy. Because the deep learning network hides the input, output data and gradients of the middle layer operator and only provides the gradients of the network input data (feature data and weight), developers can not accurately perceive the data changes of the middle layer operator, which affects the debugging efficiency. In order to facilitate developers to accurately and quickly debug the deep learning network, MindSpore designed the hook function in PyNative mode. Developers can use the hook function to capture the input, output data and gradients of the middle layer operator. At present, PyNative mode provides four forms of hook functions: HookBackward operator and the register_forward_pre_hook function, the register_forward_hook function, the register_backward_hook function for Cell object. + +### HookBackward operator + +HookBackward implements the hook function as an operator. The user initializes a HookBackward operator and inserts it into the position where the gradient needs to be captured in the deep learning network. When the network is executing forward process, the HookBackward operator outputs the input data as it is without any modification; When the network back propagates gradient, the hook function registered on HookBackward operator will capture the gradient back propagated to this point. You can customize the gradient operation in the hook function, such as printing the gradient or returning a new gradient. + +Example Code: + +```python +import mindspore +from mindspore import ops +from mindspore import Tensor +from mindspore import context +from mindspore.ops import GradOperation + +context.set_context(mode=context.PYNATIVE_MODE) + +def hook_fn(grad_out): + print(grad_out) + +grad_all = GradOperation(get_all=True) +hook = ops.HookBackward(hook_fn) +def hook_test(x, y): + z = x * y + z = hook(z) + z = z * y + return z + +def net(x, y): + return grad_all(hook_test)(x, y) + +output = net(Tensor(1, mindspore.float32), Tensor(2, mindspore.float32)) +print(output) +``` + +Output: + +```python +(Tensor(shape=[], dtype=Float32, value= 2),) +(Tensor(shape=[], dtype=Float32, value= 4), Tensor(shape=[], dtype=Float32, value= 4)) +``` + +For more descriptions of HookBackward operator, please refer to [API document](https://mindspore.cn/docs/api/en/master/api_python/ops/mindspore.ops.HookBackward.html). + +### The register_forward_pre_hook function for Cell object + +Users can register a custom hook function by using the `register_forward_pre_hook` function on the Cell object, which is used to capture the forward data passed to this Cell object. This function does not work in graph mode or on Cell object decorated with `ms_function` . The `register_forward_pre_hook` function receives an user-defined hook function as the input parameter and returns a `handle` object corresponding to the hook function. Users can delete the corresponding hook function by calling the `remove()` function of the `handle` object. Each time the `register_forward_pre_hook` function is called, a different `handle` object is returned. The hook function should be defined as follows. + +Example Code: + +```python +def forward_pre_hook_fn(cell_id, inputs): + print("forward inputs: ", inputs) +``` + +The `cell_id` is the name and ID information of the Cell object, the `inputs` are the forward data passed to the Cell object. Therefore, users can use the `register_forward_pre_hook` function to capture the forward input data of a Cell object in the network. You can customize the operation of input data in the hook function, such as checking data, printing data, or returning new input data to the current Cell object. If the original input data of the Cell object is calculated in the hook function and then returned as new input data, these new calculation operations will act on the back propagation of the gradient. + +Example Code: + +```python +import numpy as np +import mindspore +import mindspore.nn as nn +from mindspore import Tensor +from mindspore import context +from mindspore.ops import GradOperation + +context.set_context(mode=context.PYNATIVE_MODE) + +def forward_pre_hook_fn(cell_id, inputs): + print("forward inputs: ", inputs) + input_x = inputs[0] + inputs[1] + return input_x, inputs[1] + +class Net(nn.Cell): + def __init__(self): + super(Net, self).__init__() + self.mul = nn.MatMul() + self.handle = self.mul.register_forward_pre_hook(forward_pre_hook_fn) + + def construct(self, x, y): + x = x + x + x = self.mul(x, y) + return x + +grad = GradOperation(get_all=True) +net = Net() +output = net(Tensor(np.ones([1]).astype(np.float32)), Tensor(np.ones([1]).astype(np.float32))) +print(output) +gradient = grad(net)(Tensor(np.ones([1]).astype(np.float32)), Tensor(np.ones([1]).astype(np.float32))) +print(gradient) +net.handle.remove() +gradient = grad(net)(Tensor(np.ones([1]).astype(np.float32)), Tensor(np.ones([1]).astype(np.float32))) +print(gradient) +``` + +Output: + +```python +forward inputs: (Tensor(shape=[1], dtype=Float32, value= [ 2.00000000e+00]), Tensor(shape=[1], dtype=Float32, value= [ 1.00000000e+00])) +3.0 +forward inputs: (Tensor(shape=[1], dtype=Float32, value= [ 2.00000000e+00]), Tensor(shape=[1], dtype=Float32, value= [ 1.00000000e+00])) +(Tensor(shape=[1], dtype=Float32, value= [ 2.00000000e+00]), Tensor(shape=[1], dtype=Float32, value= [ 4.00000000e+00])) +(Tensor(shape=[1], dtype=Float32, value= [ 2.00000000e+00]), Tensor(shape=[1], dtype=Float32, value= [ 2.00000000e+00])) +``` + +If user returns the newly created data directly in the hook function instead of returning the calculated data from the original input data, the back propagation of the gradient will be cut off on this Cell object. + +Example Code: + +```python +import numpy as np +import mindspore +import mindspore.nn as nn +from mindspore import Tensor +from mindspore import context +from mindspore.ops import GradOperation + +context.set_context(mode=context.PYNATIVE_MODE) + +def forward_pre_hook_fn(cell_id, inputs): + print("forward inputs: ", inputs) + return Tensor(np.ones([1]).astype(np.float32)), Tensor(np.ones([1]).astype(np.float32)) + +class Net(nn.Cell): + def __init__(self): + super(Net, self).__init__() + self.mul = nn.MatMul() + self.handle = self.mul.register_forward_pre_hook(forward_pre_hook_fn) + + def construct(self, x, y): + x = x + x + x = self.mul(x, y) + return x + +grad = GradOperation(get_all=True) +net = Net() +gradient = grad(net)(Tensor(np.ones([1]).astype(np.float32)), Tensor(np.ones([1]).astype(np.float32))) +print(gradient) +``` + +Output: + +```python +forward inputs: (Tensor(shape=[1], dtype=Float32, value= [ 2.00000000e+00]), Tensor(shape=[1], dtype=Float32, value= [ 1.00000000e+00])) +(Tensor(shape=[1], dtype=Float32, value= [ 0.00000000e+00]), Tensor(shape=[1], dtype=Float32, value= [ 0.00000000e+00])) +``` + +In order to avoid script failure when switching to graph mode, it is not recommended to call the `register_forward_pre_hook` function or the `remove()` function of the `handle` object in the `construct` function of the Cell object. In the PyNative mode, if the `register_forward_pre_hook` function is called in the `construct` function of the Cell object, a hook function will be added at each run time of Cell object. + +More about the `register_forward_pre_hook` interface, please refer to [API Document](https://mindspore.cn/docs/api/en/master/api_python/nn/mindspore.nn.Cell.html#mindspore.nn.Cell.register_forward_pre_hook). + +### The register_forward_hook function for Cell object + +Users can register a custom hook function by using the `register_forward_hook` function on the Cell object, which is used to capture the forward input and output data passed to this Cell object. This function does not work in graph mode or on Cell object decorated with `ms_function` . The `register_forward_hook` function receives an user-defined hook function as the input parameter and returns a `handle` object corresponding to the hook function. Users can delete the corresponding hook function by calling the `remove()` function of the `handle` object. Each time the `register_forward_hook` function is called, a different handle object is returned. The hook function should be defined as follows. + +Example Code: + +```python +def forward_hook_fn(cell_id, inputs, outputs): + print("forward inputs: ", inputs) + print("forward outputs: ", outputs) +``` + +The `cell_id` is the name and ID information of the Cell object, the `inputs` are the forward data passed to the Cell object, the `outputs` are the forward output data of the Cell object. Therefore, users can use the `register_forward_hook` function to capture the forward input and output data of a Cell object in the network. You can customize the operation of the input and output data in the hook function, such as checking data, printing data, or returning new output data. If the original output data of the Cell object is calculated in the hook function and then returned as new output data, these new calculation operations will act on the back propagation of the gradient. + +Example Code: + +```python +import numpy as np +import mindspore +import mindspore.nn as nn +from mindspore import Tensor +from mindspore import context +from mindspore.ops import GradOperation + +context.set_context(mode=context.PYNATIVE_MODE) + +def forward_hook_fn(cell_id, inputs, outputs): + print("forward inputs: ", inputs) + print("forward outputs: ", outputs) + outputs = outputs + outputs + return outputs + +class Net(nn.Cell): + def __init__(self): + super(Net, self).__init__() + self.mul = nn.MatMul() + self.handle = self.mul.register_forward_hook(forward_hook_fn) + + def construct(self, x, y): + x = x + x + x = self.mul(x, y) + return x + +grad = GradOperation(get_all=True) +net = Net() +gradient = grad(net)(Tensor(np.ones([1]).astype(np.float32)), Tensor(np.ones([1]).astype(np.float32))) +print(gradient) +net.handle.remove() +gradient = grad(net)(Tensor(np.ones([1]).astype(np.float32)), Tensor(np.ones([1]).astype(np.float32))) +print(gradient) +``` + +Output: + +```python +forward inputs: (Tensor(shape=[1], dtype=Float32, value= [ 2.00000000e+00]), Tensor(shape=[1], dtype=Float32, value= [ 1.00000000e+00])) +forward outputs: 2.0 +(Tensor(shape=[1], dtype=Float32, value= [ 4.00000000e+00]), Tensor(shape=[1], dtype=Float32, value= [ 4.00000000e+00])) +(Tensor(shape=[1], dtype=Float32, value= [ 2.00000000e+00]), Tensor(shape=[1], dtype=Float32, value= [ 2.00000000e+00])) +``` + +If user returns the newly created data directly in the hook function instead of returning the calculated data from the original output data, the back propagation of the gradient will be cut off on this Cell object. Refer to the same case description of the `register_forward_pre_hook` function for this phenomenon. + +In order to avoid script failure when switching to graph mode, it is not recommended to call the `register_forward_hook` function or the `remove()` function of the `handle` object in the `construct` function of the Cell object. In the PyNative mode, if the `register_forward_hook` function is called in the `construct` function of the Cell object, a hook function will be added at each run time of Cell object. + +More about the `register_forward_hook` interface, please refer to [API Document](https://mindspore.cn/docs/api/en/master/api_python/nn/mindspore.nn.Cell.html#mindspore.nn.Cell.register_forward_hook). + +### The register_backward_hook function for Cell object + +Users can register a custom hook function by using the `register_backward_hook` function on the Cell object, which is used to capture the gradients associated with the Cell object during network back propagation. This function does not work in graph mode or on Cell object decorated with `ms_function` . The `register_backward_hook` function receives an user-defined hook function as the input parameter and returns a `handle` object corresponding to the hook function. Users can delete the corresponding hook function by calling the `remove()` function of the `handle` object. Each time the `register_backward_hook` function is called, a different `handle` object is returned. + +Different from the HookBackward operator, the input parameters of the hook function registered in register_backward_hook function contain the `cell_id` representing name and ID information of the Cell object, the incoming gradient and the output gradient of Cell object. + +Example Code: + +```python +def backward_hook_function(cell_id, grad_input, grad_output): + print(grad_input) + print(grad_output) +``` + +The `cell_id` is the name and ID information of the Cell object. The `grad_input` is the input gradient of the Cell object, which corresponds to the output gradient of the next operator in the forward process. The `grad_output` is the output gradient of the Cell object. Therefore, users can use `register_backward_hook` interface to capture the input gradient and output gradient of the Cell object. Users can customize the gradient operation in the hook function, such as checking gradient, printing gradient or returning new output gradient. If users need to return new output gradient in the hook function, the return gradient must be in the form of `tuple` . + +Example Code: + +```python +import numpy as np +import mindspore +import mindspore.nn as nn +from mindspore import Tensor +from mindspore import context +from mindspore.ops import GradOperation + +context.set_context(mode=context.PYNATIVE_MODE) + +def backward_hook_function(cell_id, grad_input, grad_output): + print(grad_input) + print(grad_output) + +class Net(nn.Cell): + def __init__(self): + super(Net, self).__init__() + self.conv = nn.Conv2d(1, 2, kernel_size=2, stride=1, padding=0, weight_init="ones", pad_mode="valid") + self.bn = nn.BatchNorm2d(2, momentum=0.99, eps=0.00001, gamma_init="ones") + self.handle = self.bn.register_backward_hook(backward_hook_function) + self.relu = nn.ReLU() + + def construct(self, x): + x = self.conv(x) + x = self.bn(x) + x = self.relu(x) + return x + +net = Net() +grad_all = GradOperation(get_all=True) +output = grad_all(net)(Tensor(np.ones([1, 1, 2, 2]).astype(np.float32))) +print(output) +net.handle.remove() +output = grad_all(net)(Tensor(np.ones([1, 1, 2, 2]).astype(np.float32))) +print(output) +``` + +Output: + +```python +(Tensor(shape=[1, 2, 1, 1], dtype=Float32, value= +[[[[ 1.00000000e+00]], + [[ 1.00000000e+00]]]]),) +(Tensor(shape=[1, 2, 1, 1], dtype=Float32, value= +[[[[ 9.99994993e-01]], + [[ 9.99994993e-01]]]]),) +(Tensor(shape=[1, 1, 2, 2], dtype=Float32, value= +[[[[ 1.99998999e+00, 1.99998999e+00], + [ 1.99998999e+00, 1.99998999e+00]]]]),) +(Tensor(shape=[1, 1, 2, 2], dtype=Float32, value= +[[[[ 1.99998999e+00, 1.99998999e+00], + [ 1.99998999e+00, 1.99998999e+00]]]]),) +``` + +When `register_forward_pre_hook` function, `register_forward_hook` function and `register_backward_hook` function are registered on the same Cell object and new operators are added in hook function for data processing, these new operators will participate in the forward calculation before or after the execution of the Cell object, but the gradients of these new operators is not within the capture range of the backward hook function. The backward hook function registered by `register_backward_hook` interface only captures the input and output gradients of the original Cell object. + +Example Code: + +```python +import numpy as np +import mindspore +import mindspore.nn as nn +from mindspore import Tensor +from mindspore import context +from mindspore.ops import GradOperation + +context.set_context(mode=context.PYNATIVE_MODE) + +def forward_pre_hook_fn(cell_id, inputs): + print("forward inputs: ", inputs) + input_x = inputs[0] + inputs[1] + input_y = inputs[0] + inputs[1] + return input_x, input_y + +def forward_hook_fn(cell_id, inputs, outputs): + print("forward inputs: ", inputs) + print("forward outputs: ", outputs) + outputs = outputs + outputs + return outputs + +def backward_hook_fn(cell_id, grad_input, grad_output): + print("grad input: ", grad_input) + print("grad output: ", grad_output) + +class Net(nn.Cell): + def __init__(self): + super(Net, self).__init__() + self.mul = nn.MatMul() + self.handle = self.mul.register_forward_pre_hook(forward_pre_hook_fn) + self.handle2 = self.mul.register_forward_hook(forward_hook_fn) + self.handle3 = self.mul.register_backward_hook(backward_hook_fn) + + def construct(self, x, y): + x = x + x + x = self.mul(x, y) + return x + +net = Net() +grad = GradOperation(get_all=True) +gradient = grad(net)(Tensor(np.ones([1]).astype(np.float32)), Tensor(np.ones([1]).astype(np.float32))) +print(gradient) +``` + +Output: + +```python +forward inputs: (Tensor(shape=[1], dtype=Float32, value= [ 2.00000000e+00]), Tensor(shape=[1], dtype=Float32, value= [ 1.00000000e+00])) +forward inputs: (Tensor(shape=[1], dtype=Float32, value= [ 3.00000000e+00]), Tensor(shape=[1], dtype=Float32, value= [ 3.00000000e+00])) +forward outputs: 9.0 +grad input: (Tensor(shape=[], dtype=Float32, value= 2),) +grad output: (Tensor(shape=[1], dtype=Float32, value= [ 6.00000000e+00]), Tensor(shape=[1], dtype=Float32, value= [ 6.00000000e+00])) +(Tensor(shape=[1], dtype=Float32, value= [ 2.40000000e+01]), Tensor(shape=[1], dtype=Float32, value= [ 1.20000000e+01])) +``` + +The `grad input` is the input gradient of `self.mul` . It is not the input gradient of the `Add` operator in the `forward_hook_funcntion` . The `grad output` is the output gradient of `self.mul` . It is not the output gradient of the `Add` operators in the `forward_pre_hook_funcntion` . `register_forward_pre_hook` function and `register_forward_hook` function work before or after the execution of the Cell object. They will not affect the gradient capture range of the backward hook function on the Cell object. + +In order to avoid script failure when switching to graph mode, it is not recommended to call the `register_backward_hook` function or the `remove()` function of the `handle` object in the `construct` function of the Cell object. In the PyNative mode, if the `register_backward_hook` function is called in the `construct` function of the Cell object, a hook function will be added at each run time of Cell object. + +More about the `register_backward_hook` interface, please refer to [API Document](https://mindspore.cn/docs/api/en/master/api_python/nn/mindspore.nn.Cell.html#mindspore.nn.Cell.register_backward_hook). + +## Custom bprop + +Users can customize the back propagation (calculation) function of the Cell object to control the gradient calculation process and positioning gradient problem. The custom bprop is implemented by defining a `bprop function` for Cell object. During the back propagation process, the custom bprop function will run. + +Example Code: + +```python +import mindspore +import mindspore.nn as nn +from mindspore import Tensor +from mindspore import context +from mindspore.ops import GradOperation + +context.set_context(mode=context.PYNATIVE_MODE, device_target="GPU") + +class Net(nn.Cell): + def __init__(self): + super(Net, self).__init__() + + def construct(self, x, y): + z = x * y + z = z * y + return z + + def bprop(self, x, y, out, dout): + x_dout = x + y + y_dout = x * y + return x_dout, y_dout + +grad_all = GradOperation(get_all=True) +output = grad_all(Net())(Tensor(1, mindspore.float32), Tensor(2, mindspore.float32)) +print(output) +``` + +Output: + +```python +(Tensor(shape=[], dtype=Float32, value= 3), Tensor(shape=[], dtype=Float32, value= 2)) +``` diff --git a/tutorials/experts/source_en/debug/dump_in_graph_mode.md b/tutorials/experts/source_en/debug/dump_in_graph_mode.md new file mode 100644 index 0000000000000000000000000000000000000000..53ff44ab8f804222ff74c6515a369036788c7263 --- /dev/null +++ b/tutorials/experts/source_en/debug/dump_in_graph_mode.md @@ -0,0 +1,580 @@ +# Using Dump in the Graph Mode + +`Ascend` `GPU` `CPU` `Model Optimization` + + + +## Overview + +The input and output of the operator can be saved for debugging through the data dump when the training result deviates from the expectation. + +- For the dynamic graph mode, MindSpore provides native Python execution capabilities. Users can view and record the corresponding input and output during the running of the network script. For details, see [Use PyNative Mode to Debug](https://www.mindspore.cn/docs/programming_guide/en/master/debug_in_pynative_mode.html). + +- For the static graph mode, MindSpore provides the Dump function to save the graph and the input and output data of the operator during model training to a disk file. + +Aiming at the static graph mode, this tutorial introduces how to analyze and compare network data based on the Dump function. + +### Debugging Process + +Using dump to help debugging is divided into two steps: 1. Data preparation; 2. Data analysis. + +#### Data preparation + +The data preparation phase uses synchronous dump or asynchronous dump to generate dump data. See [Synchronous Dump Step](#synchronous-dump-step) and [Asynchronous Dump Step](#asynchronous-dump-step) for details. + +When preparing data, you can refer to the following best practices: + +1. Set the `iteration` parameter to save only the data of the iteration with the problem and the previous iteration. For example, if the problem to be analyzed will appear in the 10th iteration (counting from 1), you can set it as follows: `"iteration": "8 | 9"`. Note that the `iteration` parameter evaluates iterations from 0. Saving the data of the above two iterations can help problem analysis under most scenarios. +2. After the iteration with problems is completed, it is recommended that you use [run_context.request_stop()](https://www.mindspore.cn/docs/api/en/master/api_python/mindspore.train.html#mindspore.train.callback.RunContext.request_stop) or other methods to stop the training. + +#### Data analysis + +If you have installed MindInsight, you can use offline debugger of MindInsight to analyze it. See [Using the Offline Debugger](https://www.mindspore.cn/mindinsight/docs/en/master/debugger_offline.html) for the usage of offline debugger. + +If MindInsight is not installed, you need to analyze the data through the following steps. + +1. Find the corresponding operator from the script. + + The Dump function needs to use the IR file of the final execution graph. The IR file can be viewed with the `vi` command. The IR file contains the full name of the operator, and the dependency of the operator on the input and output of the computational graph, and also contains the trace information from the operator to the corresponding script code. For the configuration of the Dump function, see [Synchronous Dump Step](#synchronous-dump-step) and [Asynchronous Dump Step](#asynchronous-dump-step). For the final implementation of the image IR file naming and directory structure, see [Synchronous Dump Data Object Directory](#synchronous-dump-data-object-directory) and [Asynchronous Dump Data Object Directory](#asynchronous-dump-data-object-directory). Then find the operator corresponding to the code in the script through the graph file, refer to [Synchronous Dump Data Analysis Sample](#synchronous-dump-data-analysis-sample) and [Asynchronous Dump Data Analysis Sample](#asynchronous-dump-data-analysis-sample). + +2. From operator to dump data. + + After understanding the mapping relationship between the script and the operator, you can determine the name of the operator you want to analyze and find the dump file corresponding to the operator. Please refer to [Synchronous Dump Data Object Directory](#synchronous-dump-data-object-directory) and [Asynchronous Dump Data Object Directory](#asynchronous-dump-data-object-directory). + +3. Analyze Dump data. + + By analyzing Dump data, it can be compared with other third-party frameworks. For the synchronous dump data format, please refer to [Introduction to Synchronous Dump Data File](#introduction-to-synchronous-dump-data-file). For the asynchronous Dump data format, please refer to [Introduction to Asynchronous Dump Data File](#introduction-to-asynchronous-dump-data-file). + +### Applicable Scene + +1. Analysis of static graph operator results. + + Through the IR diagram obtained by the Dump function, you can understand the mapping relationship between the script code and the execution operator (for details, see [MindSpore IR Introduction](https://www.mindspore.cn/docs/programming_guide/en/master/design/mindir.html#overview)). Combining the input and output data of the execution operator, it is possible to analyze possible overflow, gradient explosion and disappearance during the training process, and backtrack to the code that may have problems in the script. + +2. Analysis of the feature map. + + Analyze the information of the feature map by obtaining the output data of the layer. + +3. Model migration. + + In the scenario of migrating a model from a third-party framework (TensorFlow, PyTorch) to MindSpore, by comparing the output data of the same position operator, analyzing whether the training results of the third-party framework and MindSpore for the same model are close enough to locate the model Precision issues. + +## Dump Introduction + +MindSpore provides two modes: synchronous dump and asynchronous dump: + +- The mechanism of synchronous dump is that after the execution of each step in the network training process, the host side initiates a dump action, copies the data in the operator address from the device to the host, and saves the file. Synchronous Dump will turn off memory reuse between operators by default to avoid reading dirty data. +- Asynchronous Dump is a function developed specifically for the sinking of the entire Ascend image. It can dump data while executing the operator. The data will be dumped immediately after the execution of an operator. Therefore, the correct data can be generated by turning on the memory reuse, but the corresponding network training speed will be slower. + +The configuration files required for different modes and the data format of dump are different: + +- When Dump is enabled on Ascend, the operator to Dump will automatically close memory reuse. +- Synchronous Dump supports the graphics mode both on Ascend, GPU and CPU, and currently does not support PyNative mode. +- Asynchronous Dump only supports graph mode on Ascend, not PyNative mode. Memory reuse will not be turned off when asynchronous dump is enabled. +- Default is Asynchronous mode. If synchronous mode is needed, "e2e_dump_settings" should be set in configure file. +- Dump does not support heterogeneous training. If Dump is enabled for heterogeneous training, the saved dump data object directory maybe not in expected directory structure. + +## Synchronous Dump + +### Synchronous Dump Step + +1. Create dump json file:`data_dump.json`, the name and location of the JSON file can be customized. + + ```json + { + "common_dump_settings": { + "dump_mode": 0, + "path": "/absolute_path", + "net_name": "ResNet50", + "iteration": "0|5-8|100-120", + "saved_data": "tensor", + "input_output": 0, + "kernels": ["Default/Conv-op12"], + "support_device": [0,1,2,3,4,5,6,7] + }, + "e2e_dump_settings": { + "enable": true, + "trans_flag": true + } + } + ``` + + - `dump_mode`: 0: dump all kernels data in graph, 1: dump kernels data in kernels list. + - `path`: The absolute path to save dump data. + - `net_name`: The net name eg:ResNet50. + - `iteration`: Specify the iterations to dump, type is string. Use "|" to separate the step data of different intervals to be saved. For example, "0 | 5-8 | 100-120" represents dump the data of the 1st, 6th to 9th, and 101st to 121st steps. If iteration set to "all", data of every iteration will be dumped. + - `saved_data`: Specify what data is to be dumped, type is string. Use "tensor" to dump tensor data, use "statistic" to dump tensor statistics, use "full" to dump both tensor data and statistics. Default setting is "tensor". Synchronous statistics dump is only supported on GPU, using "statistic" or "full" on CPU or Ascend will result in exception. + - `input_output`: 0: dump input and output of kernel, 1:dump input of kernel, 2:dump output of kernel. This configuration parameter only supports Ascend and CPU, and GPU can only dump the output of operator. + - `kernels`: List of operator names. Turn on the IR save switch `context.set_context(save_graphs=True)` and execute the network to obtain the operator name from the generated `trace_code_graph_{graph_id}`IR file. For details, please refer to [Saving IR](https://www.mindspore.cn/docs/programming_guide/en/master/design/mindir.html#saving-ir). + - `support_device`: Supported devices, default setting is `[0,1,2,3,4,5,6,7]`. You can specify specific device ids to dump specific device data. This configuration parameter is invalid on the CPU, because there is no concept of device on the CPU, but it is still need to reserve this parameter in the json file. + - `enable`: When set to true, enable Synchronous Dump. When set to false, asynchronous dump will be used on Ascend and synchronous dump will still be used on GPU. + - `trans_flag`: Enable trans flag. Transform the device data format into NCHW. If it is `True`, the data will be saved in the 4D format (NCHW) format on the Host side; if it is `False`, the data format on the Device side will be retained. This configuration parameter is invalid on the CPU, because there is no format conversion on the CPU, but it is still need to reserve this parameter in the json file. + +2. Set Dump environment. + + Specify the json configuration file of Dump. + + ```bash + export MINDSPORE_DUMP_CONFIG=${xxx} + ``` + + "xxx" represents the absolute path of data_dump.json + + ```bash + export MINDSPORE_DUMP_CONFIG=/path/to/data_dump.json + ``` + + If the `path` field is not set or set to an empty string in the Dump configuration file, you also need to configure the environment variable `MS_DIAGNOSTIC_DATA_PATH`. + + ```bash + export MS_DIAGNOSTIC_DATA_PATH=${yyy} + ``` + + Then "$MS_DIAGNOSTIC_DATA_PATH/debug_dump" is regarded as `path`. If the `path` field in configuration file is not empty, it is still used as the path to save Dump data. + + - Set the environment variables before executing the training script. Setting environment variables during training will not take effect. + - Dump environment variables need to be configured before calling `mindspore.communication.init`. + +3. Execute the training script to dump data. + + After the training is started, if the `MINDSPORE_DUMP_CONFIG` environment variable is correctly configured, the content of the configuration file will be read and the operator data will be saved according to the data storage path specified in the Dump configuration. + In synchronous mode, if you want to dump data in GPU environment, you must use the non-data sink mode (set the `dataset_sink_mode` parameter in `model.train` or `DatasetHelper` to `False`) to ensure that you can get the dump data of each step. + If `model.train` or `DatasetHelper` is not called in the script, the default is non-data sinking mode. Using the Dump function will automatically generate the IR file of the final execution graph. + + You can set `context.set_context(reserve_class_name_in_scope=False)` in your training script to avoid dump failure because of file name is too long. + +4. Read and parse synchronous dump data through `numpy.load`, refer to [Introduction to Synchronous Dump Data File](#introduction-to-synchronous-dump-data-file). + +### Synchronous Dump Data Object Directory + +After starting the training, the data objects saved by the synchronous Dump include the final execution graph (`ms_output_trace_code_graph_{graph_id}.ir` file) and the input and output data of the operators in the graph. The data directory structure is as follows: + +```text +{path}/ + - rank_{rank_id}/ + - .dump_metadata/ + - {net_name}/ + - {graph_id}/ + - {iteration_id}/ + statistic.csv + {op_type}.{op_name}.{task_id}.{stream_id}.{timestamp}.{input_output_index}.{slot}.{format}.npy + - constants/ + Parameter.data-{data_id}.0.0.{timestamp}.output.0.DefaultFormat.npy + ... + - graphs/ + ms_output_trace_code_graph_{graph_id}.pb + ms_output_trace_code_graph_{graph_id}.ir + - execution_order/ + ms_execution_order_graph_{graph_id}.csv + ms_global_execution_order_graph_{graph_id}.csv + +``` + +- `path`: the absolute path set in the `data_dump.json` configuration file. +- `rank_id`: the id of the logic device. +- `net_name`: the network name set in the `data_dump.json` configuration file. +- `graph_id`: the id of the training graph. +- `iteration_id`: the iteration of the training. +- `op_type`: the type of the operator. +- `op_name`: the name of the operator. +- `task_id`: the id of the task. +- `stream_id`: the id of the stream. +- `timestamp`: the time stamp. +- `input_output_index` : the index of input or output. For example, `output_0` means that the file is the data of the first output Tensor of the operator. +- `slot`: the id of the slot. +- `format`: the format of the data. +- `data_id`: the id of constant data. + +For multi graph networks, due to the control flow, some subgraphs may not be executed, but Dump only saves the executed nodes, so the {graph_id} in the `.pb` file name under the 'graphs' directory may not always have a corresponding {graph_id} directory in {net_name} directory. + +Only when `saved_data` is "statistic" or "full", will tensor statistics be dumped in `statistic.csv`. Only when `saved_data` is "tensor" or "full", will full tensor data be dumped in `{op_type}.{op_name}.{task_id}.{stream_id}.{timestamp}.{input_output_index}.{slot}.{format}.npy`. + +### Introduction to Synchronous Dump Data File + +The data file generated by the synchronous Dump is a binary file with the suffix `.npy`, and the file naming format is: + +```text +{op_type}.{op_name}.{task_id}.{stream_id}.{timestamp}.{input_output_index}.{slot}.{format}.npy +``` + +The constant data file generated by the synchronous Dump is in the same format as data file, whereas {op_type}, {task_id}, {stream_id}, {input_output_index}, {slot}, {format} are unchanged for all constant data. Note, non-Tensor type will not generate dump file. + +```text +Parameter.data-{data_id}.0.0.{timestamp}.output.0.DefaultFormat.npy +``` + +User can use Numpy interface `numpy.load` to read the data. + +The statistics file generated by the synchronous dump is named `statistic.csv`. This file stores key statistics for all tensors dumped under the same directory as itself (with the file names `{op_type}.{op_name}.{task_id}.{stream_id}.{timestamp}.{input_output_index}.{slot}.{format}.npy`). Each row in `statistic.csv` summarizes a single tensor, each row contains the statistics: Op Type, Op Name, Task ID, Stream ID, Timestamp, IO, Slot, Data Size, Data Type, Shape, Max Value, Min Value, Avg Value, Count, Negative Zero Count, Positive Zero Count, NaN Count, Negative Inf Count, Positive Inf Count, Zero Count. Note that opening this file with Excel may cause data to be displayed incorrectly. Please use commands like `vi` or `cat`, or use Excel to import csv from text for viewing. + +The suffixes of the final execution graph files generated by synchronous Dump are `.pb` and `.ir` respectively, and the file naming format is: + +```text +ms_output_trace_code_graph_{graph_id}.pb +ms_output_trace_code_graph_{graph_id}.ir +``` + +The files with the suffix `.ir` can be opened and viewed by the `vi` command. + +The suffix of the node execution sequence file generated by the synchronous Dump is `.csv`, and the file naming format is: + +```text +ms_execution_order_graph_{graph_id}.csv +``` + +The suffix of the graph execution history file is `.csv`. The file naming format is: + +```text +ms_global_execution_order_graph_{graph_id}.csv +``` + +This file stores the list of iterations in which the graph was executed. After the graph is compiled, it may be split into multiple sub-graphs. +Since sub-graphs share the same graph execution history with root graph, only root graph will generate an execution history file. + +`.dump_metadata` records the original training information, and `data_dump.json` saves the dump configuration set by the user. + +### Synchronous Dump Data Analysis Sample + +For the Ascend scene, after the graph corresponding to the script is saved to the disk through the Dump function, the final execution graph file `ms_output_trace_code_graph_{graph_id}.ir` will be generated. This file saves the stack information of each operator in the corresponding graph, and records the generation script corresponding to the operator. + +Take [AlexNet script](https://gitee.com/mindspore/models/blob/master/official/cv/alexnet/src/alexnet.py) as an example: + +```python +import mindspore.nn as nn +import mindspore.ops as ops + + +def conv(in_channels, out_channels, kernel_size, stride=1, padding=0, pad_mode="valid", has_bias=True): + return nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding, + has_bias=has_bias, pad_mode=pad_mode) + + +def fc_with_initialize(input_channels, out_channels, has_bias=True): + return nn.Dense(input_channels, out_channels, has_bias=has_bias) + + +class AlexNet(nn.Cell): + """ + Alexnet + """ + def __init__(self, num_classes=10, channel=3, phase='train', include_top=True): + super(AlexNet, self).__init__() + self.conv1 = conv(channel, 64, 11, stride=4, pad_mode="same", has_bias=True) + self.conv2 = conv(64, 128, 5, pad_mode="same", has_bias=True) + self.conv3 = conv(128, 192, 3, pad_mode="same", has_bias=True) + self.conv4 = conv(192, 256, 3, pad_mode="same", has_bias=True) + self.conv5 = conv(256, 256, 3, pad_mode="same", has_bias=True) + self.relu = ops.ReLU() + self.max_pool2d = nn.MaxPool2d(kernel_size=3, stride=2, pad_mode="valid") + self.include_top = include_top + if self.include_top: + dropout_ratio = 0.65 + if phase == 'test': + dropout_ratio = 1.0 + self.flatten = nn.Flatten() + self.fc1 = fc_with_initialize(6 * 6 * 256, 4096) + self.fc2 = fc_with_initialize(4096, 4096) + self.fc3 = fc_with_initialize(4096, num_classes) + self.dropout = nn.Dropout(dropout_ratio) + + def construct(self, x): + """define network""" + x = self.conv1(x) + x = self.relu(x) + x = self.max_pool2d(x) + x = self.conv2(x) + x = self.relu(x) + x = self.max_pool2d(x) + x = self.conv3(x) + x = self.relu(x) + x = self.conv4(x) + x = self.relu(x) + x = self.conv5(x) + x = self.relu(x) + x = self.max_pool2d(x) + if not self.include_top: + return x + x = self.flatten(x) + x = self.fc1(x) + x = self.relu(x) + x = self.dropout(x) + x = self.fc2(x) + x = self.relu(x) + x = self.dropout(x) + x = self.fc3(x) + return x +``` + +If the user wants to view the code at line 58 in the script: + +```python +x = self.conv3(x) +``` + +After executing the network training, you can find multiple operator information corresponding to the line of code from the final execution graph (`ms_output_trace_code_graph_{graph_id}.ir` file). The content of the file is as follows: + +```text + %24(equivoutput) = Conv2D(%23, %21) {instance name: conv2d} primitive_attrs: {compile_info: , pri_format: NC1HWC0, stride: (1, 1, 1, 1), pad: (0, 0, 0, 0), pad_mod: same, out_channel: +192, mode: 1, dilation: (1, 1, 1, 1), output_names: [output], group: 1, format: NCHW, offset_a: 0, kernel_size: (3, 3), groups: 1, input_names: [x, w], pad_list: (1, 1, 1, 1), +IsFeatureMapOutput: true, IsFeatureMapInputList: (0)} + : (, ) -> () + : (, ) -> () + : (Default/network-WithLossCell/_backbone-AlexNet/conv3-Conv2d/Conv2D-op107) + ... + # In file {Absolute path of model_zoo}/official/cv/alexnet/src/alexnet.py(58)/ x = self.conv3(x)/ + ... + %25(equivoutput) = BiasAdd(%24, %22) {instance name: bias_add} primitive_attrs: {output_used_num: (1), input_names: [x, b], format: NCHW, compile_info: , output_names: [output], +IsFeatureMapOutput: true, IsFeatureMapInputList: (0), pri_format: NC1HWC0} + : () -> () -> () + : () -> () -> () + : (Default/network-WithLossCell/_backbone-AlexNet/conv3-Conv2d/BiasAdd-op105) + ... + # In file {Absolute path of model_zoo}/official/cv/alexnet/src/alexnet.py(58)/ x = self.conv3(x)/ + ... +``` + +The meanings of the lines in the file content shown above are as follows: + +- The input and output of the operator on the Host side (the first line) and the Device side (the second line, some operators may not exist). It can be seen from the execution graph that the operator has two inputs (left side of the arrow) and one output (right side of the arrow). + + ```text + : (, ) -> () + : (, ) -> () + ``` + +- Operator name. It can be seen from the execution graph that the full name of the operator in the final execution graph is `Default/network-WithLossCell/_backbone-AlexNet/conv3-Conv2d/Conv2D-op107`. + + ```text + : (Default/network-WithLossCell/_backbone-AlexNet/conv3-Conv2d/Conv2D-op107) + ``` + +- The training script code corresponding to the operator. By searching the training script code to be queried, multiple matching operators can be found. + + ```text + # In file {Absolute path of model_zoo}/official/cv/alexnet/src/alexnet.py(58)/ x = self.conv3(x)/ + ``` + +Through the operator name and input and output information, you can find the only corresponding Tensor data file. For example, if you want to view the dump file corresponding to the first output data of the Conv2D-op107 operator, you can obtain the following information: + +- `operator_name`: `Conv2D-op107`. + +- `input_output_index`: `output.0` indicates that the file is the data of the first output Tensor of the operator. + +- `slot`: 0, this tensor only has one slot. + +Search for the corresponding file name in the data object file directory saved by Dump: +`Conv2d.Conv2D-op107.2.2.1623124369613540.output.0.DefaultFormat.npy`. + +When restoring data, execute: + +```python +import numpy +numpy.load("Conv2d.Conv2D-op107.2.2.1623124369613540.output.0.DefaultFormat.npy") +``` + +Restore the data as `numpy.array' format. + +## Asynchronous Dump + +Large networks (such as Bert Large) will cause memory overflow when using synchronous dumps. MindSpore provides debugging capabilities for large networks through asynchronous dumps. + +### Asynchronous Dump Step + +1. Create dump json file:`data_dump.json`. + + The name and location of the JSON file can be customized. + + ```json + { + "common_dump_settings": { + "dump_mode": 0, + "path": "/absolute_path", + "net_name": "ResNet50", + "iteration": "0|5-8|100-120", + "saved_data": "tensor", + "input_output": 0, + "kernels": ["Default/Conv-op12"], + "support_device": [0,1,2,3,4,5,6,7], + "op_debug_mode": 0, + "file_format": "npy" + } + } + ``` + + - `dump_mode`: 0: dump all kernels data in graph, 1: dump kernels data in kernels list, 2: dump the kernels data specified by `set_dump` in the scripts, see [mindspore.dump](https://www.mindspore.cn/docs/api/en/master/api_python/mindspore/mindspore.set_dump.html) for the usage of `set_dump`. + - `path`: The absolute path to save dump data. It is not a mandatory option and can be left unset or set to an empty string. + - `net_name`: The net name eg:ResNet50. + - `iteration`: Specify the iterations to dump, type is string. Use "|" to separate the step data of different intervals to be saved. For example, "0 | 5-8 | 100-120" represents dump the data of the 1st, 6th to 9th, and 101st to 121st steps. If iteration set to "all", data of every iteration will be dumped. + - `saved_data`: Specify what data is to be dumped, type is string. Use "tensor" to dump tensor data, use "statistic" to dump tensor statistics, use "full" to dump both tensor data and statistics. Default setting is "tensor". Asynchronous statistics dump is only supported when `file_format` is set to `npy`, using "statistic" or "full" when `file_format` is set to `bin` will result in exception. + - `input_output`: When set to 0, it means to Dump the operator's input and output; setting it to 1 means to Dump the operator's input; setting it to 2 means to Dump the output of the operator. + - `kernels`: List of operator names. Turn on the IR save switch `context.set_context(save_graphs=True)` and execute the network to obtain the operator name from the generated `trace_code_graph_{graph_id}`IR file. `kernels` only supports TBE operator, AiCPU operator and communication operator. The data of communication operation input operator will be dumped if `kernels` is set to the name of communication operator. For details, please refer to [Saving IR](https://www.mindspore.cn/docs/programming_guide/en/master/design/mindir.html#saving-ir). + - `support_device`: Supported devices, default setting is `[0,1,2,3,4,5,6,7]`. You can specify specific device ids to dump specific device data. + - `enable`: Enable Asynchronous Dump. If synchronous dump and asynchronous dump are enabled at the same time, only synchronous dump will take effect. + - `op_debug_mode`: Reserved field, set to 0. + - `file_format`: Dump file type. It can be either `npy` and `bin`. `npy`: data will be dumped in npy files as host format. `bin`: data will be dumped in protobuf file as device format and need to be transformed to parse using the provided data analysis tool. Please refer to [Asynchronous Dump Data Analysis Sample](#asynchronous-dump-data-analysis-sample) for details. The default value is `bin`. + +2. Set Dump environment. + + Specify the json configuration file of Dump. + + ```bash + export MINDSPORE_DUMP_CONFIG=${Absolute path of data_dump.json} + ``` + + If the `path` field is not set or set to an empty string in the Dump configuration file, you also need to configure the environment variable `MS_DIAGNOSTIC_DATA_PATH`. + + ```bash + export MS_DIAGNOSTIC_DATA_PATH=${yyy} + ``` + + Then "$MS_DIAGNOSTIC_DATA_PATH/debug_dump" is regarded as `path`. If the `path` field in configuration file is not empty, it is still used as the path to save Dump data. + + - Set the environment variables before executing the training script. Setting environment variables during training will not take effect. + - Dump environment variables need to be configured before calling `mindspore.communication.init`. + +3. Execute the training script to dump data. + + You can set `context.set_context(reserve_class_name_in_scope=False)` in your training script to avoid dump failure because of file name is too long. + +4. Refer to [Asynchronous Dump Data Analysis Sample](#asynchronous-dump-data-analysis-sample) to analyze the Dump data file. + +- If you need to dump all or part of the operator, you can modify the `dump_mode` option in the json configuration file to 0 or 1. +- Using the Dump function will automatically generate the IR file of the final execution graph. + +### Asynchronous Dump Data Object Directory + +If set `file_format` to `npy`, see [Synchronous Dump Data Object Directory](https://www.mindspore.cn/docs/programming_guide/en/master/dump_in_graph_mode.html#synchronous-dump-data-object-directory) for the dump data object directory. + +The data objects saved by asynchronous Dump include the final execution graph (`ms_output_trace_code_graph_{graph_id}.ir` file) and the input and output data of the operators in the graph. The directory structure is as follows: + +```text +{path}/ + - rank_{rank_id}/ + - .dump_metadata/ + - {net_name}/ + - {graph_id}/ + - {iteration_id}/ + statistic.csv + {op_type}.{op_name}.{task_id}.{stream_id}.{timestamp} + mapping.csv + - constants/ + Parameter.data-{data_id}.0.0.{timestamp}.output.0.DefaultFormat.npy + ... + - graphs/ + ms_output_trace_code_graph_{graph_id}.pb + ms_output_trace_code_graph_{graph_id}.ir + - execution_order/ + ms_execution_order_graph_{graph_id}.csv + ms_global_execution_order_graph_{graph_id}.csv +``` + +- `path`: the absolute path set in the `data_dump.json` configuration file. +- `rank_id`: the id of the logic device. +- `net_name`: the network name set in the `data_dump.json` configuration file. +- `graph_id`: the id of the training graph. +- `iteration_id`: the iteration of the training. +- `op_type`: the type of the operator. +- `op_name`: the name of the operator. +- `task_id`: the id of the task. +- `stream_id`: the id of the stream. +- `timestamp`: the time stamp. +- `data_id`: the id of constant data. + +Due to the control flow, some sub-graphs may not be executed, but Dump only saves the executed nodes, so the {graph_id} in the `.pb` file name under the 'graphs' directory may not always have a corresponding {graph_id} directory in {net_name} directory. + +For multi-graph networks, such as dynamic shape scenario, the iterations of all graphs on each device are counted uniformly. + +If the length of the tensor file name defined according to the naming rules exceeds the OS file name length limit (usually 255 characters), the tensor file will be renamed to a string of random numbers. The mapping relationship will be written to the file 'mapping.csv' in the same directory. + +### Introduction to Asynchronous Dump Data File + +If set `file_format` to `npy`, see [Introduction to Synchronous Dump Data File](https://www.mindspore.cn/docs/programming_guide/en/master/dump_in_graph_mode.html#introduction-to-synchronous-dump-data-file) for the introduction to dump data file. + +If not configured `file_format` or set `file_format` to `bin`, after the training is started, the original data file generated by asynchronous Dump is in protobuf format. It needs to be parsed using the data analysis tool that comes with the HiSilicon Run package. For details, please refer to [How to view dump data files](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/640e796d/how-do-i-view-a-dump-file). + +The data format on the Device side may be different from the definition in the calculation diagram on the Host side. The data format of the asynchronous dump is the Device side format. If you want to convert to the Host side format, you can refer to [How to convert dump data file format](https://support.huawei.com/enterprise/en/doc/EDOC1100206689/130949fb/how-do-i-convert-the-format-of-a-dump-file). + +If the file is saved in `bin' format, the file naming format is: + +```text +{op_type}.{op_name}.{task_id}.{timestamp} +``` + +Take the Dump result of a simple network as an example: `Add.Default_Add-op1.2.161243956333802`, where `Add` is `{op_type}`, `Default_Add-op1` is `{op_name}`, and `2` is `{task_id' }`, `161243956333802` is `{timestamp}`. + +If ".", "/", "\", and spaces appear in `op_type` and `op_name`, they will be converted to underscores. + +The original data file generated by dump can also be parsed by using the data parsing tool DumpParser of MindInsight. Please refer to [DumpParser Introduction](https://gitee.com/mindspore/mindinsight/blob/master/mindinsight/parser/README.md) for the usage of DumpParser. +The data format parsed by MindInsight is exactly the same as that of synchronous dump. + +If setting `file_format` to `npy`, the naming convention of data files generated by asynchronous dump is the same as those of synchronous dump. Please refer to [Introduction to Synchronous Dump Data File](#introduction-to-synchronous-dump-data-file). + +The `saved_data` option only takes effect when `file_format` is "npy". If `saved_data` is "statistic" or "full", tensor statistics will be dumped in `statistic.csv`. When `saved_data` is "tensor" or "full", full tensor data will be dumped in `{op_type}.{op_name}.{task_id}.{stream_id}.{timestamp}.{input_output_index}.{slot}.{format}.npy`. The format of the statistic file will be the same as that of synchonous dump. Please refer to [Introduction to Synchronous Dump Data File](#introduction-to-synchronous-dump-data-file). + +The constant dump file, final execution graph file and execution order file naming rules generated by asynchronous Dump are the same as that of synchronous Dump. You can refer to [Introduction to Synchronous Dump Data File](#introduction-to-synchronous-dump-data-file). + +### Asynchronous Dump Data Analysis Sample + +Through the asynchronous Dump function, the data files generated by the operator asynchronous Dump can be obtained. + +1. Parse the dumped file using `msaccucmp.py` provied in the run package, the path where the `msaccucmp.py` file is located may be different on different environments You can find it through the find command: + + ```bash + find ${run_path} -name "msaccucmp.py" + ``` + + - `run_path`: The installation path of the run package. + +2. Change directory to `/absolute_path` after training, execute the following commands to parse Dump data file: + + ```bash + python ${The absolute path of msaccucmp.py} convert -d {file path of dump} -out {file path of output} + ``` + + Or you can use `msaccucmp.py` to convert the format of dump file. Please see . + + For example, the data file generated by Dump is: + + ```text + BNTrainingUpdate.Default_network-YoloWithLossCell_yolo_network-YOLOV3DarkNet53_feature_map-YOLOv3_backblock0-YoloBlock_conv3-SequentialCell_1-BatchNorm2d_BNTrainingUpdate-op5489.137.1608983934774491 + ``` + + Then execute: + + ```bash + python3.7.5 msaccucmp.py convert -d BNTrainingUpdate.Default_network-YoloWithLossCell_yolo_network-YOLOV3DarkNet53_feature_map-YOLOv3_backblock0-YoloBlock_conv3-SequentialCell_1-BatchNorm2d_BNTrainingUpdate-op5489.137.1608983934774491 -out ./output -f NCHW -t npy + ``` + + Then all input and output data of the operator can be generated under `./output`. Each data is saved as a file with the suffix of `.npy`, and the data format is `NCHW`. + + The generated results are as follows: + + ```text + BNTrainingUpdate.Default_network-YoloWithLossCell_yolo_network-YOLOV3DarkNet53_feature_map-YOLOv3_backblock0-YoloBlock_conv3-SequentialCell_1-BatchNorm2d_BNTrainingUpdate-op5489.137.1608983934774491.input.0.30x1024x17x17.npy + BNTrainingUpdate.Default_network-YoloWithLossCell_yolo_network-YOLOV3DarkNet53_feature_map-YOLOv3_backblock0-YoloBlock_conv3-SequentialCell_1-BatchNorm2d_BNTrainingUpdate-op5489.137.1608983934774491.input.1.1x1024x1x1.npy + BNTrainingUpdate.Default_network-YoloWithLossCell_yolo_network-YOLOV3DarkNet53_feature_map-YOLOv3_backblock0-YoloBlock_conv3-SequentialCell_1-BatchNorm2d_BNTrainingUpdate-op5489.137.1608983934774491.input.2.1x1024x1x1.npy + BNTrainingUpdate.Default_network-YoloWithLossCell_yolo_network-YOLOV3DarkNet53_feature_map-YOLOv3_backblock0-YoloBlock_conv3-SequentialCell_1-BatchNorm2d_BNTrainingUpdate-op5489.137.1608983934774491.input.3.1x1024x1x1.npy + BNTrainingUpdate.Default_network-YoloWithLossCell_yolo_network-YOLOV3DarkNet53_feature_map-YOLOv3_backblock0-YoloBlock_conv3-SequentialCell_1-BatchNorm2d_BNTrainingUpdate-op5489.137.1608983934774491.input.4.1x1024x1x1.npy + BNTrainingUpdate.Default_network-YoloWithLossCell_yolo_network-YOLOV3DarkNet53_feature_map-YOLOv3_backblock0-YoloBlock_conv3-SequentialCell_1-BatchNorm2d_BNTrainingUpdate-op5489.137.1608983934774491.input.5.1x1024x1x1.npy + BNTrainingUpdate.Default_network-YoloWithLossCell_yolo_network-YOLOV3DarkNet53_feature_map-YOLOv3_backblock0-YoloBlock_conv3-SequentialCell_1-BatchNorm2d_BNTrainingUpdate-op5489.137.1608983934774491.input.6.1x1024x1x1.npy + BNTrainingUpdate.Default_network-YoloWithLossCell_yolo_network-YOLOV3DarkNet53_feature_map-YOLOv3_backblock0-YoloBlock_conv3-SequentialCell_1-BatchNorm2d_BNTrainingUpdate-op5489.137.1608983934774491.output.0.30x1024x17x17.npy + BNTrainingUpdate.Default_network-YoloWithLossCell_yolo_network-YOLOV3DarkNet53_feature_map-YOLOv3_backblock0-YoloBlock_conv3-SequentialCell_1-BatchNorm2d_BNTrainingUpdate-op5489.137.1608983934774491.output.1.1x1024x1x1.npy + BNTrainingUpdate.Default_network-YoloWithLossCell_yolo_network-YOLOV3DarkNet53_feature_map-YOLOv3_backblock0-YoloBlock_conv3-SequentialCell_1-BatchNorm2d_BNTrainingUpdate-op5489.137.1608983934774491.output.2.1x1024x1x1.npy + BNTrainingUpdate.Default_network-YoloWithLossCell_yolo_network-YOLOV3DarkNet53_feature_map-YOLOv3_backblock0-YoloBlock_conv3-SequentialCell_1-BatchNorm2d_BNTrainingUpdate-op5489.137.1608983934774491.output.3.1x1024x1x1.npy + BNTrainingUpdate.Default_network-YoloWithLossCell_yolo_network-YOLOV3DarkNet53_feature_map-YOLOv3_backblock0-YoloBlock_conv3-SequentialCell_1-BatchNorm2d_BNTrainingUpdate-op5489.137.1608983934774491.output.4.1x1024x1x1.npy + ``` + + At the end of the file name, you can see which input or output the file is the operator, and the dimensional information of the data. For example, by the first `.npy` file name + + ```text + BNTrainingUpdate.Default_network-YoloWithLossCell_yolo_network-YOLOV3DarkNet53_feature_map-YOLOv3_backblock0-YoloBlock_conv3-SequentialCell_1-BatchNorm2d_BNTrainingUpdate-op5489.137.1608983934774491.input.0.30x1024x17x17.npy + ``` + + It can be seen that the file is the 0th input of the operator, and the dimension information of the data is `30x1024x17x17`. + +3. The corresponding data can be read through `numpy.load("file_name")`. For example: + + ```python + import numpy + numpy.load("BNTrainingUpdate.Default_network-YoloWithLossCell_yolo_network-YOLOV3DarkNet53_feature_map-YOLOv3_backblock0-YoloBlock_conv3-SequentialCell_1-BatchNorm2d_BNTrainingUpdate-op5489.137.1608983934774491.input.0.30x1024x17x17.npy") + ``` diff --git a/tutorials/experts/source_en/debug/images/autotune.png b/tutorials/experts/source_en/debug/images/autotune.png new file mode 100644 index 0000000000000000000000000000000000000000..1a37d87651c007a794ea6ff09a44ca2c4317d345 Binary files /dev/null and b/tutorials/experts/source_en/debug/images/autotune.png differ diff --git a/tutorials/experts/source_en/debug/images/dot_to_png.png b/tutorials/experts/source_en/debug/images/dot_to_png.png new file mode 100644 index 0000000000000000000000000000000000000000..9689a1e575c9900fbde64b4af206cc48ebad6dbb Binary files /dev/null and b/tutorials/experts/source_en/debug/images/dot_to_png.png differ diff --git a/tutorials/experts/source_en/debug/incremental_compilation.md b/tutorials/experts/source_en/debug/incremental_compilation.md new file mode 100644 index 0000000000000000000000000000000000000000..5ac5e6c836788b6de50001ddc27afa09d847191c --- /dev/null +++ b/tutorials/experts/source_en/debug/incremental_compilation.md @@ -0,0 +1,88 @@ +# Incremental Operator Build + +`Ascend` `Model Optimization` + + + +## Overview + +When a network model is executed, MindSpore builds the used operators. The time consumed in this stage increases with the scale of the network model. To improve the performance of secondary model execution, an incremental operator build mechanism is provided. When MindSpore executes a network model, the default `rank_0/kernel_meta` folder is generated in the directory where the execution is performed. During the execution, operator cache files (in the `.o`, `.info`, or `.json` format) generated during network build are saved to this directory. If you execute the same network model again or only part of the model changes, MindSpore automatically calls the reusable operator cache files in the `rank_0/kernel_meta` folder, which significantly reduces the network build time and improves the execution performance. Currently, the incremental operator build function can be used only on the Ascend AI chips. + +The following demonstrates how to use the incremental operator build function. + +## Usage + +Incremental operator build is enabled by default on MindSpore and does not need to be controlled. The following describes how to build a simple network model case `test_square.py`. + +Execute the following test case: + +```python +import numpy as np +import mindspore.nn as nn +import mindspore.context as context +import mindspore.ops as ops +from mindspore import Tensor + +context.set_context(mode=context.GRAPH_MODE, device_target="Ascend") + +class Net(nn.Cell): + def __init__(self): + super(Net, self).__init__() + self.square = ops.Square() + + def construct(self, data): + return self.square(data) + +def test_net(): + x = np.array([1.0, 4.0, 9.0]).astype(np.float32) + square = Net() + output = square(Tensor(x)) + print("x: ", x) + print("output: ", output) + + +if __name__ == "__main__": + test_net() + +``` + +The network model consists of a single operator `Square`, and the output is a square value of the input. + +```text +x: [1. 4. 9.] +output: [1. 16. 81.] +``` + +The `rank_0/kernel_meta` folder is generated in the directory where the execution is performed, which contains the `.o`, `.json`, `.info` and other files. + +```text +└─src + ├── test_square.py + └── rank_0 + └──kernel_meta + ├── square_12484080525657478220_2.info + ├── square_12484080525657478220_2.json + └── square_12484080525657478220_2.o +``` + +For an operator: + +The `.o` file is an executable file generated by MindSpore for the operator during network model execution. + +The `.info` file records all valid information about the operator, including the operator name, attributes, input and output formats, and input and output data types. The `.info` file is used to search for and determine whether the `.o` file of the operator can be reused. + +The `.json` file stores the operator build result, which will be used during running. + +After the preceding three types of operator cache files are generated, you can perform incremental operator build when executing the network model. That is, only new or modified operators will be built, greatly improving the network build performance. + +## FAQs + +- Cache files cannot be shared in different scenarios, such as multi-device and single-device scenarios, or training and inference scenarios. + +- The `rank_0` is the default value if the env `RANK_ID` is empty. If the `RANK_ID` is not empty, for example`RANK_ID=3`, the path `rank_3/kernel_meta` will be generated. + +- The path of `kernel_meta` can be specified by the environment variable `MS_COMPILER_CACHE_PATH`. For example, `export MS_COMPILER_CACHE_PATH=/home/xxx/`,`export RANK_ID=2`, the operator compilation cache file will be saved in `/home/xxx/rank_2/kernel_meta/`. + +- When multiple devices are running, the `rank_{ID}/kernel_meta` folder is generated in multiple `device` directories when the network model is executed(The `ID`is the value of environment variable `RANK_ID`). + + Note that when multiple devices are running, if the operator cache files in `rank_{ID}/kernel_meta` of some devices are deleted and the same network model is executed again, devices that do not need to be rebuilt may time out. As a result, the execution fails. In this case, you can set the environment variable `HCCL_CONNECT_TIMEOUT`, that is, the waiting time between multiple devices, to avoid failure. However, this method takes a long time, which is equivalent to deleting and rebuilding all devices(The `ID`is the value of environment variable `RANK_ID`). diff --git a/tutorials/experts/source_en/debug/read_ir_files.md b/tutorials/experts/source_en/debug/read_ir_files.md new file mode 100644 index 0000000000000000000000000000000000000000..98cc1190dc52b2b8e6bad69bf01ab2dbee755091 --- /dev/null +++ b/tutorials/experts/source_en/debug/read_ir_files.md @@ -0,0 +1,422 @@ +# Reading IR + +`Ascend` `GPU` `CPU` `Model Optimization` + + + +## Overview + +When a model compiled using MindSpore runs in the graph mode `context.set_context(mode=context.GRAPH_MODE)` and `context.set_context(save_graphs=True)` is set in the configuration, some intermediate files will be generated during graph compliation. These intermediate files are called IR files. Currently, there are three IR files: + +- .ir file: An IR file that describes the model structure in text format and can be directly viewed using any text editors. +- .dat file: An IR file that describes the model structure more strictly than the .ir file. It contains more contents and can be directly viewed using any text editors. +- .dot file: An IR file that describes the topology relationships between different nodes. You can use this file by [graphviz](http://graphviz.org/) as the input to generate images for users to view the model structure. For models with multiple operators, it is recommended using the visualization component [MindInsight](https://www.mindspore.cn/mindinsight/docs/en/master/dashboard.html#computational-graph-visualization) to visualize computing graphs. + +## Saving IR + +`context.set_context(save_graphs=True)` is used to save the intermediate code in each compilation phase. The intermediate code can be saved in two formats. One is the text format with the suffix `.ir`, and the other is the graphical format with the suffix `.dot`. When the network scale is small, you are advised to use the graphical format that is more intuitive. When the network scale is large, you are advised to use the text format that is more efficient. + +You can run the graphviz command to convert a .dot file to the picture format. For example, you can run the `dot -Tpng *.dot -o *.png` command to convert a `.dot` file to a .png file. + +Add the following code to `train.py`. When running the script, MindSpore will automatically store the IR files generated during compilation under the specified path. + +```python +if __name__ == "__main__": + context.set_context(save_graphs=True, save_graphs_path="path/to/ir/files") +``` + +After the training command is executed, some files are generated in the path of `save_graphs_path`. + +```text +. +├──00_parse_0000.ir +├──00_parse_0001.dat +├──00_parse_0002.dot +├──01_symbol_resolve_0003.ir +├──01_symbol_resolve_0004.dat +├──01_symbol_resolve_0005.dot +├──02_combine_like_graphs_0006.ir +├──02_combine_like_graphs_0007.dat +├──02_combine_like_graphs_0008.dot +├──03_inference_opt_prepare_0009.ir +├──03_inference_opt_prepare_0010.dat +├──03_inference_opt_prepare_0011.dot +├──04_abstract_specialize_0012.ir +├──04_abstract_specialize_0013.dat +├──04_abstract_specialize_0014.dot +... +``` + +The IR files starting with digits and underscores are generated during the ME graph compilation. The compute graph is +saved in each phase of the `pipeline`. Let's see the important phases. + +- The `parse` phase parses the `construct` function of the entrance. If viewing the IR file, we can see that only the + graph information of the top cell is parsed in this phase. +- The `symbol_resolve` phase recursively parses other functions and objects directly or indirectly referenced by the + entry function. When using the unsupported syntax, it will get an error in this phase. +- The `abstract_specialize` phase infers every node's `data type` and `shape` by the cell's inputs. When you want to + know the shape or data type of a specific operator in IR, you can view this IR file. +- The `optimize` phase, hardware-independent optimization is performed, the automatic differential and automatic + parallel functions are also performed. Some ir files with the prefix `opt_pass` are saved here. No need to pay too + much attention to those files if you are not the framework developer. +- The `validate` phase will check the temporary operators which should be removed in the prior phase. If any temporary + operator exists, the process will report an error and exit. +- The `task_emit` phase will transfer the compute graph to the backend for further processing. +- The `execute` phase will execute the compute graph. This is the final graph in the phase of frontend. + +In addition, you don't need to pay too much attention to the IR files (such as files beginning with `hwopt`) if you are +not the framework developer because the backend is close to the hardware. Only need pay attention to the +file `graph_build_[graph_id]_[IR_id].ir`. It is the MindIR after the frontend and backend optimization. + +> Multiple files may be saved because the backend only can handle the single graph. +> It is different with the frontend when the front save all sub-graphs in the one file. + +## IR File Contents Introduction + +The following is an example to describe the contents of the IR file. The content may have some changes with the version upgrade of MindSpore. + +```python +import mindspore.context as context +import mindspore.nn as nn +from mindspore import Tensor +from mindspore import ops +from mindspore import dtype as mstype + +context.set_context(mode=context.GRAPH_MODE) +context.set_context(save_graphs=True, save_graphs_path="./") + +class Net(nn.Cell): + def __init__(self): + super().__init__() + self.add = ops.Add() + self.sub = ops.Sub() + self.mul = ops.Mul() + self.div = ops.Div() + + def func(x, y): + return self.div(x, y) + + def construct(self, x, y): + a = self.sub(x, 1) + b = self.add(a, y) + c = self.mul(b, self.func(a, b)) + return c + +input1 = Tensor(3, mstype.float32) +input2 = Tensor(2, mstype.float32) +net = Net() +out = net(input1, input2) +print(out) +``` + +### ir Introduction + +Use a text editing software (for example, vi) to open the `04_abstract_specialize_0012.ir` file. The file contents are as follows: + +```text + 1 #IR entry : @1_construct_wrapper.21 + 2 #attrs : + 3 #Total params : 2 + 4 + 5 %para1_x : + 6 %para2_y : + 7 + 8 #Total subgraph : 3 + 9 + 10 subgraph attr: + 11 Undeterminate : 0 + 12 subgraph @2_construct.22(%para3_x, %para4_y) { + 13 %0(a) = Sub(%para3_x, Tensor(shape=[], dtype=Float32, value= 1)) {instance name: sub} primitive_attrs: {input_names: [x, y], output_names: [output]} + 14 : (, ) -> () + 15 # In file train.py(34)/ a = self.sub(x, 1)/ + 16 %1(b) = Add(%0, %para4_y) {instance name: add} primitive_attrs: {input_names: [x, y], output_names: [output]} + 17 : (, ) -> () + 18 # In file train.py(35)/ b = self.add(a, y)/ + 19 %2([CNode]5) = call @3_func.23(%0, %1) + 20 : (, ) -> () + 21 # In file train.py(36)/ c = self.mul(b, self.func(a, b))/ + 22 %3(c) = Mul(%1, %2) {instance name: mul} primitive_attrs: {input_names: [x, y], output_names: [output]} + 23 : (, ) -> () + 24 # In file train.py(36)/ c = self.mul(b, self.func(a, b))/ + 25 Return(%3) + 26 : () + 27 # In file train.py(37)/ return c/ + 28 } + 29 + 30 subgraph attr: + 31 Undeterminate : 0 + 32 subgraph @3_func.23(%para5_x, %para6_y) { + 33 %0([CNode]20) = Div(%para5_x, %para6_y) {instance name: div} primitive_attrs: {input_names: [x, y], output_names: [output]} + 34 : (, ) -> () + 35 # In file train.py(31)/ return self.div(x, y)/ + 36 Return(%0) + 37 : () + 38 # In file train.py(31)/ return self.div(x, y)/ + 39 } + 40 + 41 subgraph attr: + 42 subgraph @1_construct_wrapper.21() { + 43 %0([CNode]2) = call @2_construct.22(%para1_x, %para2_y) + 44 : (, ) -> () + 45 # In file train.py(37)/ return c/ + 46 Return(%0) + 47 : () + 48 # In file train.py(37)/ return c/ + 49 } +``` + +The above contents can be divided into two parts, the first part is the input list and the second part is the graph structure. +The first line tells us the name of the top MindSpore graph about the network, `1_construct_wrapper.21`, or the entry graph. +Line 3 tells us how many inputs are in the network. +Line 5 to 6 are the input list, which is in the format of `%para[No.]_[name] : <[data_type]x[shape]>`. +Line 8 tells us the number of subgraph parsed by the network. There are 3 graphs in this IR. Line 42 is the entry graph `1_construct_wrapper.21`. Line 32 is graph `3_func.23`, parsed from the `func(x, y)` in the source script. Line 12 is graph `2_construct.22`, parsed from the function `construct`. +Taking graph `2_construct.22` as an example, Line 10 to 28 indicate the graph structure, which contains several nodes, namely, `CNode`. In this example, there are `Sub`, `Add`, `Mul`. They are defined in the function `__init__`. Line 19 calls a graph by `call @3_func.23`. It indicates calling the graph `func(x, y)` to execute a division operation. + +The ]`CNode`](https://www.mindspore.cn/docs/programming_guide/en/master/design/mindir.html#syntax) information format is as follows: including the node name, attribute, input node, the specs of the inputs and outputs, and source code parsing call stack. The ANF graph is a unidirectional acyclic graph. So, the connection between nodes is displayed only based on the input relationship. The corresponding source code reflects the relationship between the `CNode` and the script source code. For example, line 15 is parsed from `a = self.sub(x, 1)`. + +```text + %[No.]([debug_name]) = [op_name]([arg], ...) primitive_attrs: {[key]: [value], ...} + : (<[input data_type]x[input shape]>, ...) -> (<[output data_type]x[output shape]>, ...) + # Corresponding source code +``` + +About the corresponding source code: + +- There are two mode for the corresponding source code displaying. The first mode is to display the complete call stack, such as `15_execute_0141.ir` on the frontend and `graph_build_0_136.ir` on the backend. The second mode only displays one code line for reducing the size of the IR file, which eliminates the call stack, such as `04_abstract_specialize_0012.ir`. +- If the operator is a back propagation operator, the associated code line will not only display its own code, but also the corresponding forward code, identified by "Corresponding forward node candidate:". +- If the operator is a fusion operator, the associated code line will display the fusion related code, identified by "Corresponding code candidate:", where the separator "-" is used to distinguish different codes. + +> - After several optimizations by the compiler, the node may undergo several changes (such as operator splitting and operator merging). The source code parsing call stack information of the node may not be in a one-to-one correspondence with the script. This is only an auxiliary method. +> - After the `kernel select` phase at the backend, two lines of input and output specification information (that is, the content after `:`) will appear. The first line represents the specifications on the HOST side, and the second line represents the specifications on the DEVICE side. + +### dat Introduction + +Use a text editing software (for example, vi) to open the `04_abstract_specialize_0013.dat` file. The file contents are as follows: + +```text + 1 # [No.1] 1_construct_wrapper.21 + 2 # In file train.py(33)/ def construct(self, x, y):/ + 3 funcgraph fg_21( + 4 %para1 : Tensor(F32)[] # x + 5 , %para2 : Tensor(F32)[] # y + 6 ) { + 7 %1 : Tensor(F32)[] = FuncGraph::fg_22(%para1, %para2) #(Tensor(F32)[], Tensor(F32)[]) # fg_22=2_construct.22 #scope: Default + 8 # In file train.py(37)/ return c/#[CNode]2 + 9 Primitive::Return{prim_type=1}(%1) #(Tensor(F32)[]) #scope: Default + 10 # In file train.py(37)/ return c/#[CNode]1 + 11 } + 12 # order: + 13 # 1: 1_construct_wrapper.21:[CNode]2{[0]: ValueNode 2_construct.22, [1]: x, [2]: y} + 14 # 2: 1_construct_wrapper.21:[CNode]1{[0]: ValueNode Return, [1]: [CNode]2} + 15 + 16 + 17 # [No.2] 2_construct.22 + 18 # In file train.py(33)/ def construct(self, x, y):/ + 19 funcgraph fg_22( + 20 %para3 : Tensor(F32)[] # x + 21 , %para4 : Tensor(F32)[] # y + 22 ) { + 23 %1 : Tensor(F32)[] = PrimitivePy::Sub{prim_type=2}[input_names=["x", "y"], output_names=["output"]](%para3, Tensor(43)[]) #(Tensor(F32)[], Tenso r(F32)[]) #scope: Default + 24 # In file train.py(34)/ a = self.sub(x, 1)/#a + 25 %2 : Tensor(F32)[] = PrimitivePy::Add{prim_type=2}[input_names=["x", "y"], output_names=["output"]](%1, %para4) #(Tensor(F32)[], Tensor(F32)[]) #scope: Default + 26 # In file train.py(35)/ b = self.add(a, y)/#b + 27 %3 : Tensor(F32)[] = FuncGraph::fg_23(%1, %2) #(Tensor(F32)[], Tensor(F32)[]) # fg_23=3_func.23 #scope: Default + 28 # In file train.py(36)/ c = self.mul(b, self.func(a, b))/#[CNode]5 + 29 %4 : Tensor(F32)[] = PrimitivePy::Mul{prim_type=2}[input_names=["x", "y"], output_names=["output"]](%2, %3) #(Tensor(F32)[], Tensor(F32)[]) #sco pe: Default + 30 # In file train.py(36)/ c = self.mul(b, self.func(a, b))/#c + 31 Primitive::Return{prim_type=1}(%4) #(Tensor(F32)[]) #scope: Default + 32 # In file train.py(37)/ return c/#[CNode]4 + 33 } + 34 # order: + 35 # 1: 2_construct.22:a{[0]: ValueNode Sub, [1]: x, [2]: ValueNode Tensor(shape=[], dtype=Float32, value= 1)} + 36 # 2: 2_construct.22:b{[0]: ValueNode Add, [1]: a, [2]: y} + 37 # 3: 2_construct.22:[CNode]5{[0]: ValueNode 3_func.23, [1]: a, [2]: b} + 38 # 4: 2_construct.22:c{[0]: ValueNode Mul, [1]: b, [2]: [CNode]5} + 39 # 5: 2_construct.22:[CNode]4{[0]: ValueNode Return, [1]: c} + 40 + 41 + 42 # [No.3] 3_func.23 + 43 # In file train.py(30)/ def func(x, y):/ + 44 funcgraph fg_23( + 45 %para5 : Tensor(F32)[] # x + 46 , %para6 : Tensor(F32)[] # y + 47 ) { + 48 %1 : Tensor(F32)[] = PrimitivePy::Div{prim_type=2}[input_names=["x", "y"], output_names=["output"]](%para5, %para6) #(Tensor(F32)[], Tensor(F32) []) #scope: Default + 49 # In file train.py(31)/ return self.div(x, y)/#[CNode]20 + 50 Primitive::Return{prim_type=1}(%1) #(Tensor(F32)[]) #scope: Default + 51 # In file train.py(31)/ return self.div(x, y)/#[CNode]19 + 52 } + 53 # order: + 54 # 1: 3_func.23:[CNode]20{[0]: ValueNode Div, [1]: x, [2]: y} + 55 # 2: 3_func.23:[CNode]19{[0]: ValueNode Return, [1]: [CNode]20} + 56 + 57 + 58 # num of total function graphs: 3 +``` + +Above, it lists all the graphs beginning with the entry graph. +Line 1 indicates graph `1_construct_wrapper.21` whose id is `No.1`. And line 7 calls graph `2_construct.22`. +line 17 to 39 shows the information of graph `2_construct.22`. +Taking graph `2_construct.22` as an example. Line 18 tells us which function this graph is parsed from. Line 20 to 21 indicates the input information which is in the format of `%para[No.] : [data_type][shape] # [name]`. +Line 23 to 32 indicates the graph structure, which contains several nodes, namely, `CNode`. In this example, there are `Sub`, `Add`, `Mul`. They are defined in the function `__init__`. +Line 34 to 39 shows the execution order of the `CNode` from graph `2_construct.22`, corresponding to the order of code execution. The information format is: `No.: belonging graph:node name{[0]: the first input, [1]: the second input, ...}`. For `CNode`, the first input indicates how to compute for this `CNode`. +Line 28 indicates the number of graphs. Here is 3. + +The [CNode](https://www.mindspore.cn/docs/programming_guide/en/master/design/mindir.html#syntax) information format is as follows: including the node name, attribute, input node, output information, format and the corresponding source code. + +```text +%[No,] : [outputs' Spec] = [op_name]{[prim_type]}[attr0, attr1, ...](arg0, arg1, ...) #(inputs' Spec)#[scope] + # Corresponding source code/#debug_name +``` + +### dot Introduction + +We can use this file by [graphviz](http://graphviz.org/) as the input to generate images for users to view the model structure. For example, under the Linux operating system, we can convert a PNG image by the following command. + +```shell +dot -Tpng -o 04_abstract_specialize_0014.png 04_abstract_specialize_0014.dot +``` + +The transformed image is shown below, and we can visually see the model structure. The Different black boxes distinguish different subgraphs, and the blue arrows between graphs represent calling another graph. The blue area represents the parameter, the rectangle represents the parameter list of the graph, the hexagon and the black arrow represent the parameter as the input of the CNode to participate in the calculation process. The yellow rectangle represents the CNode. As can be seen from the picture, the CNode input starts from index 0, and the 0th input (that is, the purple or green area) represents what calculation the operator will perform, which is connected by a dotted arrow. The type is usually an operator primitive, or it can also be another graph. The rest inputs are the parameters required for the calculation. + +![04_abstract_specialize_0014.png](./images/dot_to_png.png) + +For models with multiple operators, the picture will be very large. It is recommended using the visualization component [MindInsight](https://www.mindspore.cn/mindinsight/docs/en/master/dashboard.html#computational-graph-visualization) to visualize computing graphs. + +## Reading analyze_fail.dat + +In the process of `MindSpore` compiling a graph, the exceptions about graph evaluating fail usually happen. But we can find +the reason by analyzing the exception information and analyze_fail.dat. + +For example, we run the script below. + +```python + 1 import mindspore.context as context + 2 import mindspore.nn as nn + 3 from mindspore import Tensor + 4 from mindspore.nn import Cell + 5 from mindspore import ops + 6 from mindspore import dtype as mstype + 7 + 8 context.set_context(mode=context.GRAPH_MODE) + 9 context.set_context(save_graphs=True) + 10 + 11 class Net(nn.Cell): + 12 def __init__(self): + 13 super().__init__() + 14 self.add = ops.Add() + 15 self.sub = ops.Sub() + 16 self.mul = ops.Mul() + 17 self.div = ops.Div() + 18 + 19 def func(x, y): + 20 return self.div(x, y) + 21 + 22 def construct(self, x, y): + 23 a = self.sub(x, 1) + 24 b = self.add(a, y) + 25 c = self.mul(b, self.func(a, a, b)) + 26 return c + 27 + 28 input1 = Tensor(3, mstype.float32) + 29 input2 = Tensor(2, mstype.float32) + 30 net = Net() + 31 out = net(input1, input2) + 32 print(out) +``` + +An error happens. + +```text + 1 [EXCEPTION] ANALYZER(31946,7f6f03941740,python):2021-09-18-15:10:49.094.863 [mindspore/ccsrc/pipeline/jit/static_analysis/stack_frame.cc:85] DoJump] The parameters number of the function is 2, but the number of provided arguments is 3. + 2 FunctionGraph ID : func.18 + 3 NodeInfo: In file test.py(19) + 4 def func(x, y): + 5 + 6 Traceback (most recent call last): + 7 File "test.py", line 31, in + 8 out = net(input1, input2) + 9 File "/home/workspace/mindspore/mindspore/nn/cell.py", line 404, in __call__ + 10 out = self.compile_and_run(*inputs) + 11 File "/home/workspace/mindspore/mindspore/nn/cell.py", line 682, in compile_and_run + 12 self.compile(*inputs) + 13 File "/home/workspace/mindspore/mindspore/nn/cell.py", line 669, in compile + 14 _cell_graph_executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode) + 15 File "/home/workspace/mindspore/mindspore/common/api.py", line 542, in compile + 16 result = self._graph_executor.compile(obj, args_list, phase, use_vm, self.queue_name) + 17 TypeError: mindspore/ccsrc/pipeline/jit/static_analysis/stack_frame.cc:85 DoJump] The parameters number of the function is 2, but the number of provided arguments is 3. + 18 FunctionGraph ID : func.18 + 19 NodeInfo: In file test.py(19) + 20 def func(x, y): + 21 + 22 The function call stack (See file '/home/workspace/mindspore/rank_0/om/analyze_fail.dat' for more details): + 23 # 0 In file test.py(26) + 24 return c + 25 ^ + 26 # 1 In file test.py(25) + 27 c = self.mul(b, self.func(a, a, b)) + 28 ^ +``` + +Above exception is 'TypeError: mindspore/ccsrc/pipeline/jit/static_analysis/stack_frame.cc:85 DoJump] The parameters number of the function is 2, but the number of provided arguments is 3...'. +And it tells us `FunctionGraph ID : func.18` only needs two parameters, but actually gives 3. +We can find the related code is `self.func(a, a, b)` from 'The function call stack ... In file test.py(25)'. +Easily, by checking the code, we know that we gave too much parameter to the calling function. + +Sometimes the exception information is not enough easy to understand. Or we want to see the part of graph information that have evaluated. +Then we can open `/home/workspace/mindspore/rank_0/om/analyze_fail.dat` that indicated in the exception text by using a text editing software (for example, vi). + +```text + 1 # [No.1] construct_wrapper.0 + 2 # In file test.py(22)/ def construct(self, x, y):/ + 3 funcgraph fg_0( + 4 %para1 : Tensor(F32)[] # x + 5 , %para2 : Tensor(F32)[] # y + 6 ) { + 7 + 8 #------------------------> 0 + 9 %1 = FuncGraph::fg_3(%para1, %para2) #(Tensor(F32)[], Tensor(F32)[]) # fg_3=construct.3 #scope: Default + 10 # In file test.py(26)/ return c/#[CNode]2 + 11 Primitive::Return{prim_type=1}(%1) #(Undefined) #scope: Default + 12 # In file test.py(26)/ return c/#[CNode]1 + 13 } + 14 # order: + 15 # 1: construct_wrapper.0:[CNode]2{[0]: ValueNode construct.3, [1]: x, [2]: y} + 16 # 2: construct_wrapper.0:[CNode]1{[0]: ValueNode Return, [1]: [CNode]2} + 17 + 18 + 19 # [No.2] construct.3 + 20 # In file test.py(22)/ def construct(self, x, y):/ + 21 funcgraph fg_3( + 22 %para3 : Tensor(F32)[] # x + 23 , %para4 : Tensor(F32)[] # y + 24 ) { + 25 %1 : Tensor(F32)[] = DoSignaturePrimitive::S-Prim-Sub{prim_type=1}[input_names=["x", "y"], output_names=["output"]](%para3, I64(1)) #(Tensor(F32)[], I64) #scope: Default + 26 # In file test.py(23)/ a = self.sub(x, 1)/#a + 27 %2 : Tensor(F32)[] = DoSignaturePrimitive::S-Prim-Add{prim_type=1}[input_names=["x", "y"], output_names=["output"]](%1, %para4) #(Tensor(F32)[], Tensor(F32)[]) #scope: Default + 28 # In file test.py(24)/ b = self.add(a, y)/#b + 29 + 30 #------------------------> 1 + 31 %3 = FuncGraph::fg_18(%1, %1, %2) #(Tensor(F32)[], Tensor(F32)[], Tensor(F32)[]) # fg_18=func.18 #scope: Default + 32 # In file test.py(25)/ c = self.mul(b, self.func(a, a, b))/#[CNode]5 + 33 %4 = DoSignaturePrimitive::S-Prim-Mul{prim_type=1}[input_names=["x", "y"], output_names=["output"]](%2, %3) #(Tensor(F32)[], Undefined) #scope: Default + 34 # In file test.py(25)/ c = self.mul(b, self.func(a, a, b))/#c + 35 Primitive::Return{prim_type=1}(%4) #(Undefined) #scope: Default + 36 # In file test.py(26)/ return c/#[CNode]4 + 37 } + 38 # order: + 39 # 1: construct.3:a{[0]: a, [1]: ValueNode 1, [2]: ValueNode Float32} + 40 # 2: construct.3:a{[0]: ValueNode S-Prim-Sub, [1]: x, [2]: ValueNode 1} + 41 # 3: construct.3:b{[0]: ValueNode S-Prim-Add, [1]: a, [2]: y} + 42 # 4: construct.3:[CNode]5{[0]: ValueNode func.18, [1]: a, [2]: a, [3]: b} + 43 # 5: construct.3:c{[0]: ValueNode S-Prim-Mul, [1]: b, [2]: [CNode]5} + 44 # 6: construct.3:[CNode]4{[0]: ValueNode Return, [1]: c} + 45 + 46 + 47 #=============================================================================== + 48 # num of function graphs in stack: 2 +``` + +The file `analyze_fail.dat` has the same information format with the file `.dat`. The only difference is `analyze_fail.dat` will locate the node which inferring failed. +Searching the point by the text of `------------------------>`, we reach the last position of the `------------------------> 1` at line 30. +The node at line 31 to 32 have an error. Its IR expression is `%3 = FuncGraph::fg_18(%1, %1, %2) ...`. We can know the node have 3 parameters from `(%1, %1, %2)`. But actually the function only need 2. So the compiler will fail when evaluating the node. To solve th problem, we should decrease the parameter number. diff --git a/tutorials/experts/source_en/index.rst b/tutorials/experts/source_en/index.rst index 606e593624f4b034ee2ab2ac5d3dfe160e65181e..e9aa6c6dfe365f2831a124dfe549a05daa8566a9 100644 --- a/tutorials/experts/source_en/index.rst +++ b/tutorials/experts/source_en/index.rst @@ -9,6 +9,63 @@ For Experts .. toctree:: :glob: :maxdepth: 1 - :caption: Class + :caption: Data Processing - test/test1 + data_engine/auto_augmentation + data_engine/eager + data_engine/cache + data_engine/optimize_data_processing + +.. toctree:: + :glob: + :maxdepth: 1 + :caption: Operator Execution + + operation/op_classification + operation/op_overload + operation/op_cpu + operation/op_gpu + operation/op_ascend + operation/op_custom + +.. toctree:: + :glob: + :maxdepth: 1 + :caption: Model Inference + + model_infer/inference + model_infer/online_inference + model_infer/offline_inference + +.. toctree:: + :glob: + :maxdepth: 1 + :caption: Debugging and Tuning + + debug/read_ir_files + debug/debug_in_pynative_mode + debug/dump_in_graph_mode + debug/custom_debugging_info + debug/incremental_compilation + debug/auto_tune + debug/dataset_autotune + debug/ms_class + +.. toctree:: + :glob: + :maxdepth: 1 + :caption: Distributed Parallel + + parallel/distributed_training + parallel/distributed_advanced + parallel/distributed_example + +.. toctree:: + :glob: + :maxdepth: 1 + :caption: Advanced Features + + others/mixed_precision + others/gradient_accumulation + + \ No newline at end of file diff --git a/tutorials/experts/source_en/model_infer/inference_ascend_310_air.md b/tutorials/experts/source_en/model_infer/inference_ascend_310_air.md new file mode 100644 index 0000000000000000000000000000000000000000..9343bedbee4f9856c1a19d71d0ccffa168e0cd3e --- /dev/null +++ b/tutorials/experts/source_en/model_infer/inference_ascend_310_air.md @@ -0,0 +1,237 @@ +# Inference on the Ascend 310 AI Processor + +`Ascend` `Inference Application` + + + +## Overview + +Ascend 310 is a highly efficient and integrated AI processor oriented to edge scenarios. The Atlas 200 Developer Kit (Atlas 200 DK) is a developer board that uses the Atlas 200 AI accelerator module. Integrated with the HiSilicon Ascend 310 AI processor, the Atlas 200 allows data analysis, inference, and computing for various data such as images and videos, and can be widely used in scenarios such as intelligent surveillance, robots, drones, and video servers. + +This tutorial describes how to use MindSpore to perform inference on the Atlas 200 DK based on the AIR model file. The process is as follows: + +1. Prepare the development environment, including creating an SD card for the Atlas 200 DK, configuring the Python environment, and updating the development software package. + +2. Export the AIR model file. The ResNet-50 model is used as an example. + +3. Use the ATC tool to convert the AIR model file into an OM model. + +4. Build the inference code to generate an executable `main` file. + +5. Load the saved OM model, perform inference, and view the result. + +> You can obtain the complete executable sample code at . + +## Preparing the Development Environment + +### Hardware Preparation + +- A server or PC with the Ubuntu OS is used to prepare a bootable SD card for the Atlas 200 DK and deploy the development environment. +- An SD card with a capacity of at least 16 GB. + +### Software Package Preparation + +The following five types of scripts and software packages are required for configuring the development environment: + +1. Entry script for SD card preparation: [make_sd_card.py](https://gitee.com/ascend/tools/blob/master/makesd/for_1.0.9.alpha/make_sd_card.py) + +2. Script for preparing a bootable SD card: [make_ubuntu_sd.sh](https://gitee.com/ascend/tools/blob/master/makesd/for_1.0.9.alpha/make_ubuntu_sd.sh) + +3. Ubuntu OS image package: [ubuntu-18.04.xx-server-arm64.iso](http://cdimage.ubuntu.com/ubuntu/releases/18.04/release/ubuntu-18.04.6-server-arm64.iso) + +4. Driver package and running package of Atlas 200 DK: + + - `Ascend310-driver-*{software version}*-ubuntu18.04.aarch64-minirc.tar.gz` + + - `Ascend310-aicpu_kernels-*{software version}*-minirc.tar.gz` + + - `Ascend-acllib-*{software version}*-ubuntu18.04.aarch64-minirc.run` + +5. Package for installing the development kit: `Ascend-Toolkit-*{version}*-arm64-linux_gcc7.3.0.run` + +In the preceding information: + +- For details about the first three items, see [Creating an SD Card with a Card Reader](https://support.huaweicloud.com/intl/en-us//usermanual-A200dk_3000/atlas200dk_02_0011.html). +- You are advised to obtain other software packages from [Firmware and Driver](https://ascend.huawei.com/en/#/hardware/firmware-drivers). On this page, select `Atlas 200 DK` from the product series and product model and select the required files to download. + +### Preparing the SD Card + +A card reader is connected to the Ubuntu server through a USB port, and the SD card is prepared using the script for SD card preparation. For details, see [Procedure](https://support.huaweicloud.com/intl/en-us/usermanual-A200dk_3000/atlas200dk_02_0011.html#section2). + +### Connecting the Atlas 200 DK to the Ubuntu Server + +The Atlas 200 DK can be connected to the Ubuntu server through a USB port or network cable. For details, see [Connecting the Atlas 200 DK to the Ubuntu Server](https://support.huaweicloud.com/intl/en-us/usermanual-A200dk_3000/atlas200dk_02_0013.html). + +### Configuring the Python Environment + +Install Python and GCC software. For details, see [Installing Dependencies](https://support.huaweicloud.com/intl/en-us/usermanual-A200dk_3000/atlas200dk_02_0016.html#section4). + +### Installing the Development Kit + +Install the development kit software package `Ascend-Toolkit-*{version}*-arm64-linux_gcc7.3.0.run`. For details, see [Installing the Development Kit](https://support.huaweicloud.com/intl/en-us/usermanual-A200dk_3000/atlas200dk_02_0017.html). + +## Inference Directory Structure + +Create a directory to store the inference code project, for example, `/home/HwHiAiUser/Ascend/ascend-toolkit/20.0.RC1/acllib_linux.arm64/sample/acl_execute_model/acl_resnet50_sample`. The `inc`, `src`, and `test_data` [sample code](https://gitee.com/mindspore/docs/tree/master/docs/sample_code/acl_resnet50_sample) can be obtained from the official website, and the `model` directory stores the exported `AIR` model file and the converted `OM` model file. The `out` directory stores the executable file generated after building and the output result directory. The directory structure of the inference code project is as follows: + +```text +└─acl_resnet50_sample + ├── inc + │ ├── model_process.h // Header file that declares functions related to resource initialization/destruction + │ ├── sample_process.h // Header file that declares functions related to model processing + │ ├── utils.h // Header file that declares common functions (such as the file reading function) + ├── model + │ ├── resnet50_export.air // AIR model file + │ ├── resnet50_export.om // Converted OM model file + ├── src + │ ├── acl.json // Configuration file for system initialization + │ ├── CMakeLists.txt // Build script + │ ├── main.cpp // /Main function, which is the implementation file of image classification + │ ├── model_process.cpp // Implementation file of model processing functions + │ ├── sample_process.cpp // Implementation file of functions related to resource initialization and destruction + │ ├── utils.cpp // Implementation file of common functions (such as the file reading function) + ├── test_data + │ ├── test_data_1x3x224x224_1.bin // Input sample data 1 + │ ├── test_data_1x3x224x224_2.bin // input sample data 2 + ├── out + │ ├── main // Executable file generated during building + │ ├── result // Directory for storing the output result +``` + +> The output result directory `acl_resnet50_sample/out/result` must be created before inference. + +## Exporting the AIR Model + +Train the target network on the Ascend 910 AI Processor, save it as a checkpoint file, and export the model file in AIR format through the network and checkpoint file. For details about the export process, see [Export AIR Model](https://www.mindspore.cn/docs/programming_guide/en/master/save_model.html#export-air-model). + +> The [resnet50_export.air](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com:443/sample_resources/acl_resnet50_sample/resnet50_export.air) is a sample AIR file exported using the ResNet-50 model. + +## Converting the AIR Model File into an OM Model + +Log in to the Atlas 200 DK environment, create the `model` directory for storing the AIR file `resnet50_export.air`, for example, `/home/HwHiAiUser/Ascend/ascend-toolkit/20.0.RC1/acllib_linux.arm64/sample/acl_execute_model/acl_resnet50_sample/model`, go to the directory, and set the following environment variables where `install_path` specifies the actual installation path: + +```bash +export install_path=/home/HwHiAiUser/Ascend/ascend-toolkit/20.0.RC1 +export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH +export PYTHONPATH=${install_path}/atc/python/site-packages/te:${install_path}/atc/python/site-packages/topi:$PYTHONPATH +export LD_LIBRARY_PATH=${install_path}/atc/lib64:$LD_LIBRARY_PATH +export ASCEND_OPP_PATH=${install_path}/opp +``` + +Take `resnet50_export.air` as an example. Run the following command to convert the model and generate the `resnet50_export.om` file in the current directory. + +```bash +/home/HwHiAiUser/Ascend/ascend-toolkit/20.0.RC1/atc/bin/atc --framework=1 --model=./resnet50_export.air --output=./resnet50_export --input_format=NCHW --soc_version=Ascend310 +``` + +In the preceding information: + +- `--model`: path of the original model file +- `--output`: path of the converted OM model file +- `--input_format`: input image format + +For detailed information about ATC tools, please select the corresponding CANN version in the [Developer Documentation(Community Edition)](https://ascend.huawei.com/en/#/document?tag=developer), and then search for the chapter of "ATC Tool Instructions". + +## Building Inference Code + +Go to the project directory `acl_resnet50_sample` and set the following environment variables: + +```bash +export DDK_PATH=/home/HwHiAiUser/Ascend/ascend-toolkit/20.0.RC1 +export NPU_HOST_LIB=/home/HwHiAiUser/Ascend/ascend-toolkit/20.0.RC1/acllib_linux.arm64/lib64/stub/ +``` + +> The `include` directory of the `acllib` package in the `CMakeLists.txt` file must be correctly specified. Otherwise, an error indicating that `acl/acl.h` cannot be found is reported. The code location of the `include` directory is as follows. If the location is inconsistent with the actual installation directory, modify it. + +```text +... +#Header path + + include_directories( + + ${INC_PATH}/acllib_linux.arm64/include/ + + ../ + + ) +... +``` + +Run the following command to create a build directory: + +```bash +mkdir -p build/intermediates/minirc +``` + +Run the following command to switch to the build directory: + +```bash +cd build/intermediates/minirc +``` + +Run the `cmake` command: + +```bash +cmake ../../../src -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ -DCMAKE_SKIP_RPATH=TRUE +``` + +Run the `make` command for building: + +```bash +make +``` + +After building, the executable `main` file is generated in `acl_resnet50_sample/out`. + +## Performing Inference and Viewing the Result + +Copy the generated OM model file `resnet50_export.om` to the `acl_resnet50_sample/out` directory (the same path as the executable `main` file) and ensure that the input data sample is ready in the `acl_resnet50_sample/test_data` directory. Then, you can perform inference. + +Note that the following environment variables must be set. Otherwise, the inference fails. + +```bash +export LD_LIBRARY_PATH=/home/HwHiAiUser/Ascend/acllib/lib64/ +``` + +Go to the `acl_resnet50_sample/out` directory. If the `result` directory does not exist in the current directory, run the `mkdir result` command to create one and run the following command to perform inference: + +```bash +./main ./resnet50_export.om ../test_data +``` + +After the execution is successful, the following inference result is displayed. The `top5` probability label is displayed, and the output result is saved in the `.bin` file format in the `acl_resnet50_sample/out/result` directory. + +```text +[INFO] acl init success +[INFO] open device 0 success +[INFO] create context success +[INFO] create stream success +[INFO] get run mode success +[INFO] load model ./resnet50_export.om success +[INFO] create model description success +[INFO] create model output success +[INFO] start to process file:../test_data/test_data_1x3x224x224_1.bin +[INFO] model execute success +[INFO] top 1: index[2] value[0.941406] +[INFO] top 2: index[3] value[0.291992] +[INFO] top 3: index[1] value[0.067139] +[INFO] top 4: index[0] value[0.013519] +[INFO] top 5: index[4] value[-0.226685] +[INFO] output data success +[INFO] dump data success +[INFO] start to process file:../test_data/test_data_1x3x224x224_2.bin +[INFO] model execute success +[INFO] top 1: index[2] value[0.946289] +[INFO] top 2: index[3] value[0.296143] +[INFO] top 3: index[1] value[0.072083] +[INFO] top 4: index[0] value[0.014549] +[INFO] top 5: index[4] value[-0.225098] +[INFO] output data success +[INFO] dump data success +[INFO] unload model success, modelId is 1 +[INFO] execute sample success +[INFO] end to destroy stream +[INFO] end to destroy context +[INFO] end to reset device is 0 +[INFO] end to finalize acl +``` diff --git a/tutorials/experts/source_en/model_infer/inference_ascend_310_mindir.md b/tutorials/experts/source_en/model_infer/inference_ascend_310_mindir.md new file mode 100644 index 0000000000000000000000000000000000000000..6419e628d5b9e5b41d40ee4d1b63bca623ef6d02 --- /dev/null +++ b/tutorials/experts/source_en/model_infer/inference_ascend_310_mindir.md @@ -0,0 +1,408 @@ +# Inference Using the MindIR Model on Ascend 310 AI Processors + +`Ascend` `Inference Application` + + + +## Overview + +Ascend 310 is a highly efficient and integrated AI processor oriented to edge scenarios. This tutorial describes how to use MindSpore to perform inference on the Ascend 310 based on the MindIR model file. The process is as follows: + +1. Export the MindIR model file. The ResNet-50 model is used as an example. + +2. Build the inference code to generate an executable file. + +3. Load the saved MindIR model, perform inference, and view the result. + +> You can obtain the complete executable sample code at . + +## Preparing the Development Environment + +Refer to [Installation Guide](https://www.mindspore.cn/install/en) to install Ascend environment and MindSpore. + +## Exporting the MindIR Model + +Train the target network on the CPU/GPU/Ascend 910 AI Processor, save it as a checkpoint file, and export the model file in MindIR format through the network and checkpoint file. For details about the export process, see [Export MindIR Model](https://www.mindspore.cn/docs/programming_guide/en/master/save_model.html#export-mindir-model). + +> The [resnet50_imagenet.mindir](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/sample_resources/ascend310_resnet50_preprocess_sample/resnet50_imagenet.mindir) is a sample MindIR file exported using the ResNet-50 model, whose BatchSize is 1. We also provide a ResNet-50 MindIR with data preprocess [resnet50_imagenet_preprocess.mindir](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/sample_resources/ascend310_resnet50_preprocess_sample/resnet50_imagenet_preprocess.mindir). + +## Inference Directory Structure + +Create a directory to store the inference code project, for example, `/home/HwHiAiUser/Ascend/ascend-toolkit/20.0.RC1/acllib_linux.arm64/sample/acl_execute_model/ascend310_resnet50_preprocess_sample`. The directory code can be obtained from the [official website](https://gitee.com/mindspore/docs/tree/master/docs/sample_code/ascend310_resnet50_preprocess_sample). The `model` directory stores the exported `MindIR` model files and the `test_data` directory stores the images to be classified. The directory structure of the inference code project is as follows: + +```text +└─ascend310_resnet50_preprocess_sample + ├── CMakeLists.txt // Build script + ├── README.md // Usage description + ├── main.cc // Main function, infer with defining preprocess manually + ├── main_hide_preprocess.cc // Main function2, infer without defining preprocess + ├── model + │ ├── resnet50_imagenet.mindir // MindIR model file + │ └── resnet50_imagenet_preprocess.mindir // MindIR model file with data preprocess + └── test_data + ├── ILSVRC2012_val_00002138.JPEG // Input sample image 1 + ├── ILSVRC2012_val_00003014.JPEG // Input sample image 2 + ├── ... // Input sample image n +``` + +## Inference Code + +### Infer model with defining preprocess manually: main.cc + +#### Data-preprocessing by CPU operators + +Inference sample code: . + +Using namespace of `mindspore` and `mindspore::dataset`. + +```c++ +namespace ms = mindspore; +namespace ds = mindspore::dataset; +``` + +Set global context, device target is `Ascend 310` and device id is `0`: + +```c++ +auto context = std::make_shared(); +auto ascend310_info = std::make_shared(); +ascend310_info->SetDeviceID(0); +context->MutableDeviceInfo().push_back(ascend310_info); +``` + +Load MindIR file: + +```c++ +// Load MindIR model +ms::Graph graph; +ms::Status ret = ms::Serialization::Load(resnet_file, ms::ModelType::kMindIR, &graph); +// Build model with graph object +ms::Model resnet50; +ret = resnet50.Build(ms::GraphCell(graph), context); +``` + +Get information of this model: + +```c++ +std::vector model_inputs = resnet50.GetInputs(); +``` + +Load image file: + +```c++ +// Readfile is a function to read images +ms::MSTensor ReadFile(const std::string &file); +auto image = ReadFile(image_file); +``` + +Image preprocess(CPU operators): + +```c++ +// Create the CPU operator provided by MindData to get the function object + +// Decode the input to RGB format +std::shared_ptr decode(new ds::vision::Decode()); +// Resize the image to the given size +std::shared_ptr resize(new ds::vision::Resize({256})); +// Normalize the input +std::shared_ptr normalize(new ds::vision::Normalize( + {0.485 * 255, 0.456 * 255, 0.406 * 255}, {0.229 * 255, 0.224 * 255, 0.225 * 255})); +// Crop the input image at the center +std::shared_ptr center_crop(new ds::vision::CenterCrop({224, 224})); +// shape (H, W, C) to shape (C, H, W) +std::shared_ptr hwc2chw(new ds::vision::HWC2CHW()); + +// // Define a MindData preprocessor +ds::Execute preprocessor({decode, resize, normalize, center_crop, hwc2chw}); + +// Call the function object to get the processed image +ret = preprocessor(image, &image); +``` + +Execute the model: + +```c++ +// Create outputs vector +std::vector outputs; +// Create inputs vector +std::vector inputs; +inputs.emplace_back(model_inputs[0].Name(), model_inputs[0].DataType(), model_inputs[0].Shape(), + image.Data().get(), image.DataSize()); +// Call the Predict function of Model for inference +ret = resnet50.Predict(inputs, &outputs); +``` + +Print the result: + +```c++ +// Output the maximum probability to the screen +std::cout << "Image: " << image_file << " infer result: " << GetMax(outputs[0]) << std::endl; +``` + +#### Data pre-processing by Ascend 310 operators + +Dvpp module is a hardware decoder embedded in Ascend 310 AI chip which has a better performance on image processing compare with CPU operators. Several transforms applied on JPEG format image are supported. + +Using namespace of `mindspore` and `mindspore::dataset`. + +```c++ +namespace ms = mindspore; +namespace ds = mindspore::dataset; +``` + +Set global context, device target is `Ascend 310` and device id is `0`: + +```c++ +auto context = std::make_shared(); +auto ascend310_info = std::make_shared(); +ascend310_info->SetDeviceID(0); +context->MutableDeviceInfo().push_back(ascend310_info); +``` + +Load image file: + +```c++ +// Readfile is a function to read images +ms::MSTensor ReadFile(const std::string &file); +auto image = ReadFile(image_file); +``` + +Image preprocess(Ascend 310 operators): + +```c++ +// Create the CPU operator provided by MindData to get the function object + +// Decode the input to YUV420 format +std::shared_ptr decode(new ds::vision::Decode()); +// Resize the image to the given size +std::shared_ptr resize(new ds::vision::Resize({256})); +// Normalize the input +std::shared_ptr normalize(new ds::vision::Normalize( + {0.485 * 255, 0.456 * 255, 0.406 * 255}, {0.229 * 255, 0.224 * 255, 0.225 * 255})); +// Crop the input image at the center +std::shared_ptr center_crop(new ds::vision::CenterCrop({224, 224})); +``` + +Image preprocess (Ascend 310 operators, 130% performance increasing compare to CPU operators). + +Explicitly specify the computing hardware as Ascend 310. + +```c++ +// Define a MindData preprocessor, set deviceType = kAscend310, device id = 0 +ds::Execute preprocessor({decode, resize, center_crop, normalize}, MapTargetDevice::kAscend310, 0); + +// Call the function object to get the processed image +ret = preprocessor(image, &image); +``` + +Load MindIR file: Ascend 310 operators must bind with Aipp module, insert Aipp module for model graph compiling. + + ```c++ +// Load MindIR model +ms::Graph graph; +ms::Status ret = ms::Serialization::Load(resnet_file, ms::ModelType::kMindIR, &graph); +// Build model with graph object +ascend310_info->SetInsertOpConfigPath(preprocessor.AippCfgGenerator()); +ms::Model resnet50; +ret = resnet50.Build(ms::GraphCell(graph), context); + ``` + +Get input information of this model: + +```c++ +std::vector model_inputs = resnet50.GetInputs(); +``` + +Execute the model: + +```c++ +// Create outputs vector +std::vector outputs; +// Create inputs vector +std::vector inputs; +inputs.emplace_back(model_inputs[0].Name(), model_inputs[0].DataType(), model_inputs[0].Shape(), + image.Data().get(), image.DataSize()); +// Call the Predict function of Model for inference +ret = resnet50.Predict(inputs, &outputs); +``` + +Print the result: + +```c++ +// Output the maximum probability to the screen +std::cout << "Image: " << image_file << " infer result: " << GetMax(outputs[0]) << std::endl; +``` + +### Infer model without defining preprocess: main_hide_preprocess.cc + +> Note: Only supports CV models currently. + +Inference sample code: . + +Using namespace of `mindspore` and `mindspore::dataset`. + +```c++ +namespace ms = mindspore; +namespace ds = mindspore::dataset; +``` + +Set global context, device target is `Ascend 310` and device id is `0`: + +```c++ +auto context = std::make_shared(); +auto ascend310_info = std::make_shared(); +ascend310_info->SetDeviceID(0); +context->MutableDeviceInfo().push_back(ascend310_info); +``` + +Load MindIR file: + +```c++ +// Load MindIR model +ms::Graph graph; +ms::Status ret = ms::Serialization::Load(resnet_file, ms::ModelType::kMindIR, &graph); +// Build model with graph object +ms::Model resnet50; +ret = resnet50.Build(ms::GraphCell(graph), context); +``` + +Get information of this model and check if model has preprocess: + +```c++ +std::vector model_inputs = resnet50.GetInputs(); +if (!resnet50.HasPreprocess()) { + std::cout << "data preprocess not exists in MindIR" << std::endl; + return 1; +} +``` + +Read image and start data preprocessing and prediction: + +```c++ +std::vector> inputs; +ms::MSTensor *t1 = ms::MSTensor::CreateTensorFromFile(image_file); +inputs = {{*t1}}; + +std::vector outputs; +ret = resnet50.PredictWithPreprocess(inputs, &outputs); +if (ret.IsError()) { + std::cout << "ERROR: PredictWithPreprocess failed." << std::endl; + return 1; +} +``` + +Print the result: + +```c++ +// Output the maximum probability to the screen +std::cout << "Image: " << image_file << " infer result: " << GetMax(outputs[0]) << std::endl; +// Destroy the tensor pointer +ms::MSTensor::DestroyTensorPtr(t1); +``` + +## Introduce to Building Script + +The building script is used to building applications: . + +Add head files to gcc search path: + +```cmake +option(MINDSPORE_PATH "mindspore install path" "") +include_directories(${MINDSPORE_PATH}) +include_directories(${MINDSPORE_PATH}/include) +``` + +Find the shared libraries in MindSpore: + +```cmake +find_library(MS_LIB libmindspore.so ${MINDSPORE_PATH}/lib) +file(GLOB_RECURSE MD_LIB ${MINDSPORE_PATH}/_c_dataengine*) +``` + +Use the source files to generate the target executable file, and link the MindSpore libraries for the executable file: + +```cmake +add_executable(resnet50_sample main.cc) +target_link_libraries(resnet50_sample ${MS_LIB} ${MD_LIB}) + +add_executable(resnet50_hide_preprocess main_hide_preprocess.cc) +target_link_libraries(resnet50_hide_preprocess ${MS_LIB} ${MD_LIB}) +``` + +## Building Inference Code + +Go to the project directory `ascend310_resnet50_preprocess_sample` and set the following environment variables: + +```bash +# control log level. 0-DEBUG, 1-INFO, 2-WARNING, 3-ERROR, 4-CRITICAL, default level is WARNING. +export GLOG_v=2 + +# Conda environmental options +LOCAL_ASCEND=/usr/local/Ascend # the root directory of run package + +# lib libraries that the run package depends on +export LD_LIBRARY_PATH=${LOCAL_ASCEND}/ascend-toolkit/latest/fwkacllib/lib64:${LOCAL_ASCEND}/driver/lib64:${LOCAL_ASCEND}/ascend-toolkit/latest/opp/op_impl/built-in/ai_core/tbe/op_tiling:${LD_LIBRARY_PATH} + +# lib libraries that the mindspore depends on, modify "pip3" according to the actual situation +export LD_LIBRARY_PATH=`pip3 show mindspore-ascend | grep Location | awk '{print $2"/mindspore/lib"}' | xargs realpath`:${LD_LIBRARY_PATH} +# if MindSpore is installed by binary, run "export LD_LIBRARY_PATH=path-to-your-custom-dir:${LD_LIBRARY_PATH}" + +# Environment variables that must be configured +export TBE_IMPL_PATH=${LOCAL_ASCEND}/ascend-toolkit/latest/opp/op_impl/built-in/ai_core/tbe # TBE operator implementation tool path +export ASCEND_OPP_PATH=${LOCAL_ASCEND}/ascend-toolkit/latest/opp # OPP path +export PATH=${LOCAL_ASCEND}/ascend-toolkit/latest/fwkacllib/ccec_compiler/bin/:${PATH} # TBE operator compilation tool path +export PYTHONPATH=${TBE_IMPL_PATH}:${PYTHONPATH} # Python library that TBE implementation depends on +``` + +Run the `cmake` command, modify `pip3` according to the actual situation: + +```bash +cmake . -DMINDSPORE_PATH=`pip3 show mindspore-ascend | grep Location | awk '{print $2"/mindspore"}' | xargs realpath` +# if MindSpore is installed by binary, run "cmake . -DMINDSPORE_PATH=path-to-your-custom-dir" +``` + +Run the `make` command for building. + +```bash +make +``` + +After building, the executable file is generated in `ascend310_resnet50_preprocess_sample`. + +## Performing Inference and Viewing the Result + +Log in to the Ascend 310 server, and create the `model` directory for storing the MindIR file `resnet50_imagenet.mindir`, for example, `/home/HwHiAiUser/Ascend/ascend-toolkit/20.0.RC1/acllib_linux.arm64/sample/acl_execute_model/ascend310_resnet50_preprocess_sample/model`. +Create the `test_data` directory to store images, for example, `/home/HwHiAiUser/Ascend/ascend-toolkit/20.0.RC1/acllib_linux.arm64/sample/acl_execute_model/ascend310_resnet50_preprocess_sample/test_data`. +Then, perform the inference. + +If your MindIR file does not contain preprocess information, you can execute the following command: + +```bash +./resnet50_sample +``` + +Inference is performed on all images stored in the `test_data` directory. For example, if there are 9 images whose label is 0 in the [ImageNet2012](http://image-net.org/download-images) validation set, the inference result is as follows: + +```text +Image: ./test_data/ILSVRC2012_val_00002138.JPEG infer result: 0 +Image: ./test_data/ILSVRC2012_val_00003014.JPEG infer result: 0 +Image: ./test_data/ILSVRC2012_val_00006697.JPEG infer result: 0 +Image: ./test_data/ILSVRC2012_val_00007197.JPEG infer result: 0 +Image: ./test_data/ILSVRC2012_val_00009111.JPEG infer result: 0 +Image: ./test_data/ILSVRC2012_val_00009191.JPEG infer result: 0 +Image: ./test_data/ILSVRC2012_val_00009346.JPEG infer result: 0 +Image: ./test_data/ILSVRC2012_val_00009379.JPEG infer result: 0 +Image: ./test_data/ILSVRC2012_val_00009396.JPEG infer result: 0 +``` + +If you export the preprocess information simultaneously when you export a MindIR file, you can execute the following command: + +```bash +./resnet50_hide_preprocess +``` + +The model will load the image file inside the `test_data` directory (for example: ILSVRC2012_val_00002138.JPEG, +configable in main_hide_preprocess.cc) and start prediction, then you get the inference result as follows: + +```text +Image: ./test_data/ILSVRC2012_val_00002138.JPEG infer result: 0 +``` diff --git a/tutorials/experts/source_en/model_infer/inference_ascend_910.md b/tutorials/experts/source_en/model_infer/inference_ascend_910.md new file mode 100644 index 0000000000000000000000000000000000000000..1b8bce4b66ab3cbcc3fbab4210751993e98f8c74 --- /dev/null +++ b/tutorials/experts/source_en/model_infer/inference_ascend_910.md @@ -0,0 +1,203 @@ +# Inference on the Ascend 910 AI processor + +`Ascend` `Inference Application` + + + +## Overview + +Users can create C++ applications and call MindSpore C++ interface to inference MindIR models. + +## Inference Directory Structure + +Create a directory to store the inference code project, for example, `/home/HwHiAiUser/mindspore_sample/ascend910_resnet50_preprocess_sample`. The directory code can be obtained from the [official website](https://gitee.com/mindspore/docs/tree/master/docs/sample_code/ascend910_resnet50_preprocess_sample). The `model` directory stores the exported `MindIR` [model files](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/sample_resources/ascend310_resnet50_preprocess_sample/resnet50_imagenet.mindir) and the `test_data` directory stores the images to be classified. The directory structure of the inference code project is as follows: + +```text +└─ascend910_resnet50_preprocess_sample + ├── CMakeLists.txt // Build script + ├── README.md // Usage description + ├── main.cc // Main function + ├── model + │ └── resnet50_imagenet.mindir // MindIR model file + └── test_data + ├── ILSVRC2012_val_00002138.JPEG // Input sample image 1 + ├── ILSVRC2012_val_00003014.JPEG // Input sample image 2 + ├── ... // Input sample image n +``` + +## Inference Code + +Inference sample code: . + +Using namespace of `mindspore` and `mindspore::dataset`. + +```c++ +namespace ms = mindspore; +namespace ds = mindspore::dataset; +``` + +Set global context, device target is `Ascend910` and evice id is `0`: + +```c++ +auto context = std::make_shared(); +auto ascend910_info = std::make_shared(); +ascend910_info->SetDeviceID(0); +context->MutableDeviceInfo().push_back(ascend910_info); +``` + +Load mindir file: + +```c++ +// Load MindIR model +ms::Graph graph; +ms::Status ret = ms::Serialization::Load(resnet_file, ms::ModelType::kMindIR, &graph); +// Build model with graph object +ms::Model resnet50; +ret = resnet50.Build(ms::GraphCell(graph), context); +``` + +Get informance of this model: + +```c++ +std::vector model_inputs = resnet50.GetInputs(); +``` + +Load image file: + +```c++ +// Readfile is a function to read images +ms::MSTensor ReadFile(const std::string &file); +auto image = ReadFile(image_file); +``` + +Image preprocess: + +```c++ +// Create the CPU operator provided by MindData to get the function object + +// Decode the input to RGB format +std::shared_ptr decode(new ds::vision::Decode()); +// Resize the image to the given size +std::shared_ptr resize(new ds::vision::Resize({256})); +// Normalize the input +std::shared_ptr normalize(new ds::vision::Normalize( + {0.485 * 255, 0.456 * 255, 0.406 * 255}, {0.229 * 255, 0.224 * 255, 0.225 * 255})); +// Crop the input image at the center +std::shared_ptr center_crop(new ds::vision::CenterCrop({224, 224})); +// shape (H, W, C) to shape (C, H, W) +std::shared_ptr hwc2chw(new ds::vision::HWC2CHW()); + +// // Define a MindData preprocessor +ds::Execute preprocessor({decode, resize, normalize, center_crop, hwc2chw}); + +// Call the function object to get the processed image +ret = preprocessor(image, &image); +``` + +Execute the model: + +```c++ +// Create outputs vector +std::vector outputs; +// Create inputs vector +std::vector inputs; +inputs.emplace_back(model_inputs[0].Name(), model_inputs[0].DataType(), model_inputs[0].Shape(), + image.Data().get(), image.DataSize()); +// Call the Predict function of Model for inference +ret = resnet50.Predict(inputs, &outputs); +``` + +Print the result: + +```c++ +// Output the maximum probability to the screen +std::cout << "Image: " << image_file << " infer result: " << GetMax(outputs[0]) << std::endl; +``` + +## Introduce to Building Script + +The building script is used to building applications: . + +Add head files to gcc search path: + +```cmake +option(MINDSPORE_PATH "mindspore install path" "") +include_directories(${MINDSPORE_PATH}) +include_directories(${MINDSPORE_PATH}/include) +``` + +Find the shared libraries in MindSpore: + +```cmake +find_library(MS_LIB libmindspore.so ${MINDSPORE_PATH}/lib) +file(GLOB_RECURSE MD_LIB ${MINDSPORE_PATH}/_c_dataengine*) +``` + +Use the source files to generate the target executable file, and link the MindSpore libraries for the executable file: + +```cmake +add_executable(resnet50_sample main.cc) +target_link_libraries(resnet50_sample ${MS_LIB} ${MD_LIB}) +``` + +## Building Inference Code + +Go to the project directory `ascend910_resnet50_preprocess_sample` and set the following environment variables: + +```bash +# control log level. 0-DEBUG, 1-INFO, 2-WARNING, 3-ERROR, 4-CRITICAL, default level is WARNING. +export GLOG_v=2 + +# Conda environmental options +LOCAL_ASCEND=/usr/local/Ascend # the root directory of run package + +# lib libraries that the run package depends on +export LD_LIBRARY_PATH=${LOCAL_ASCEND}/ascend-toolkit/latest/fwkacllib/lib64:${LOCAL_ASCEND}/driver/lib64/common:${LOCAL_ASCEND}/driver/lib64/driver:${LOCAL_ASCEND}/ascend-toolkit/latest/opp/op_impl/built-in/ai_core/tbe/op_tiling:${LD_LIBRARY_PATH} + +# lib libraries that the mindspore depends on, modify "pip3" according to the actual situation +export LD_LIBRARY_PATH=`pip3 show mindspore-ascend | grep Location | awk '{print $2"/mindspore/lib"}' | xargs realpath`:${LD_LIBRARY_PATH} + +# Environment variables that must be configured +export TBE_IMPL_PATH=${LOCAL_ASCEND}/ascend-toolkit/latest/opp/op_impl/built-in/ai_core/tbe # TBE operator implementation tool path +export ASCEND_OPP_PATH=${LOCAL_ASCEND}/ascend-toolkit/latest/opp # OPP path +export PATH=${LOCAL_ASCEND}/ascend-toolkit/latest/fwkacllib/ccec_compiler/bin/:${PATH} # TBE operator compilation tool path +export PYTHONPATH=${TBE_IMPL_PATH}:${PYTHONPATH} # Python library that TBE implementation depends on +``` + +Run the `cmake` command, modify `pip3` according to the actual situation: + +```bash +cmake . -DMINDSPORE_PATH=`pip3 show mindspore-ascend | grep Location | awk '{print $2"/mindspore"}' | xargs realpath` +``` + +Run the `make` command for building. + +```bash +make +``` + +After building, the executable file is generated in `ascend910_resnet50_preprocess_sample`. + +## Performing Inference and Viewing the Result + +Log in to the Ascend 910 server, and create the `model` directory for storing the MindIR file `resnet50_imagenet.mindir`, for example, `/home/HwHiAiUser/mindspore_sample/ascend910_resnet50_preprocess_sample/model`. +Create the `test_data` directory to store images, for example, `/home/HwHiAiUser/mindspore_sample/ascend910_resnet50_preprocess_sample/test_data`. +Then, perform the inference. + +```bash +./resnet50_sample +``` + +Inference is performed on all images stored in the `test_data` directory. For example, if there are 9 images whose label is 0 in the [ImageNet2012](http://image-net.org/download-images) validation set, the inference result is as follows: + +```text +Image: ./test_data/ILSVRC2012_val_00002138.JPEG infer result: 0 +Image: ./test_data/ILSVRC2012_val_00003014.JPEG infer result: 0 +Image: ./test_data/ILSVRC2012_val_00006697.JPEG infer result: 0 +Image: ./test_data/ILSVRC2012_val_00007197.JPEG infer result: 0 +Image: ./test_data/ILSVRC2012_val_00009111.JPEG infer result: 0 +Image: ./test_data/ILSVRC2012_val_00009191.JPEG infer result: 0 +Image: ./test_data/ILSVRC2012_val_00009346.JPEG infer result: 0 +Image: ./test_data/ILSVRC2012_val_00009379.JPEG infer result: 0 +Image: ./test_data/ILSVRC2012_val_00009396.JPEG infer result: 0 +``` diff --git a/tutorials/experts/source_en/model_infer/inference_cpu.md b/tutorials/experts/source_en/model_infer/inference_cpu.md new file mode 100644 index 0000000000000000000000000000000000000000..0d33b40d5bbd5d05ae507bdc86529aaf843ec02d --- /dev/null +++ b/tutorials/experts/source_en/model_infer/inference_cpu.md @@ -0,0 +1,13 @@ +# Inference on a CPU + +`CPU` `Inference Application` + + + +## Inference Using an ONNX File + +Similar to the inference on a GPU, the following steps are required: + +1. Generate a model in ONNX format on the training platform. For details, see [Export ONNX Model](https://www.mindspore.cn/docs/programming_guide/en/master/save_model.html#export-onnx-model). + +2. Perform inference on a CPU by referring to the runtime or SDK document. For details about how to use the ONNX Runtime, see the [ONNX Runtime document](https://github.com/microsoft/onnxruntime). diff --git a/tutorials/experts/source_en/model_infer/inference_gpu.md b/tutorials/experts/source_en/model_infer/inference_gpu.md new file mode 100644 index 0000000000000000000000000000000000000000..61cb66cb15739d2f163e16fdf5439c8363bfe490 --- /dev/null +++ b/tutorials/experts/source_en/model_infer/inference_gpu.md @@ -0,0 +1,186 @@ +# Inference on a GPU + +`GPU` `Inference Application` + + + +## Use C++ Interface to Load a MindIR File for Inferencing + +### Inference Directory Structure + +Create a directory to store the inference code project, for example, `/home/mindspore_sample/gpu_resnet50_inference_sample`. You can download the [sample code](https://gitee.com/mindspore/docs/tree/master/docs/sample_code/gpu_resnet50_inference_sample) from the official website. The `model` directory is used to store the exported `MindIR` [model file](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/sample_resources/ascend310_resnet50_preprocess_sample/resnet50_imagenet.mindir). The directory structure of the inference code project is as follows: + +```text +└─gpu_resnet50_inference_sample + ├── build.sh // Build script + ├── CMakeLists.txt // CMake script + ├── README.md // Usage description + ├── src + │ └── main.cc // Main function + └── model + └── resnet50_imagenet.mindir // MindIR model file +``` + +### Inference Code + +Namespaces that reference `mindspore`. + +```c++ +using mindspore::Context; +using mindspore::Serialization; +using mindspore::Model; +using mindspore::Status; +using mindspore::ModelType; +using mindspore::GraphCell; +using mindspore::kSuccess; +using mindspore::MSTensor; +``` + +Initialize the environment, specify the hardware platform used for inference, and set DeviceID. + +Set the hardware to GPU, set DeviceID to 0 and Precision Mode to "fp16". The code example is as follows: + +```c++ +auto gpu_device_info = std::make_shared(); +gpu_device_info->SetDeviceID(device_id); +gpu_device_info->SetPrecisionMode("fp16"); +context->MutableDeviceInfo().push_back(gpu_device_info); +``` + +Load the model file. + +```c++ +// Load the MindIR model. +mindspore::Graph graph; +Serialization::Load(mindir_path, ModelType::kMindIR, &graph); +// Build a model using a graph. +ms::Model model; +model.Build(ms::GraphCell(graph), context); +``` + +Obtain the input information required by the model. + +```c++ +std::vector model_inputs = model->GetInputs(); +``` + +Start inference. + +```c++ +// Create an output vector. +std::vector outputs; +// Create an input vector. +std::vector inputs; +inputs.emplace_back(model_inputs[0].Name(), model_inputs[0].DataType(), model_inputs[0].Shape(), + image.Data().get(), image.DataSize()); +// Call the Predict function of the model for inference. +ret = model.Predict(inputs, &outputs); +``` + +### Introduce to Building Script + +Add the header file search path for the compiler: + +```cmake +option(MINDSPORE_PATH "mindspore install path" "") +include_directories(${MINDSPORE_PATH}) +include_directories(${MINDSPORE_PATH}/include) +``` + +Search for the required dynamic library in MindSpore. + +```cmake +find_library(MS_LIB libmindspore.so ${MINDSPORE_PATH}/lib) +``` + +Use the specified source file to generate the target executable file and link the target file to the MindSpore library. + +```cmake +add_executable(main src/main.cc) +target_link_libraries(main ${MS_LIB}) +``` + +>For details, see +> + +### Building Inference Code + +Go to the project directory `gpu_resnet50_inference_sample` and modify the `pip3` in the `build.sh` based on the actual situation. And then execute the building script. + +```bash +bash build.sh +``` + +After building, the executable `main` file is generated in `gpu_resnet50_inference_sample/out`. + +### Performing Inference and Viewing the Result + +After completing the preceding operations, you can learn how to perform inference. + +Log in to the GPU environment, and create the `model` directory to store the `resnet50_imagenet.mindir` file, for example, `/home/mindspore_sample/gpu_resnet50_inference_sample/model`. + +Set the environment variable base on the actual situation, where the `TensorRT` is an optional configuration item. It is recommended to add `TensorRT` path to `LD_LIBRARY_PATH` to improve mode inference performance. + +```bash +export LD_PRELOAD=/home/miniconda3/lib/libpython37m.so +export LD_LIBRARY_PATH=/usr/local/TensorRT-7.2.2.3/lib/:$LD_LIBRARY_PATH +``` + +Then, perform inference for 1000 times after 10 times warmup. + +```bash +cd out/ +./main ../model/resnet50_imagenet.mindir 1000 10 +``` + +In this example, we print the inference delay for per step and average step. + +```text +Start to load model.. +Load model successuflly +Start to warmup.. +Warmup finished +Start to infer.. +step 0 cost 1.54004ms +step 1 cost 1.5271ms +... ... +step 998 cost 1.30688ms +step 999 cost 1.30493ms +infer finished. +=================Average inference time: 1.35195 ms +``` + +### Notices + +- During the training process, some networks set operator precision to FP16 artificially. For example, the [Bert mode](https://gitee.com/mindspore/models/blob/master/official/nlp/bert/src/bert_model.py) set the `Dense` and `LayerNorm` to FP16: + +```python +class BertOutput(nn.Cell): + def __init__(self, + in_channels, + out_channels, + initializer_range=0.02, + dropout_prob=0.1, + compute_type=mstype.float32): + super(BertOutput, self).__init__() + # Set the nn.Dense to fp16. + self.dense = nn.Dense(in_channels, out_channels, + weight_init=TruncatedNormal(initializer_range)).to_float(compute_type) + self.dropout = nn.Dropout(1 - dropout_prob) + self.dropout_prob = dropout_prob + self.add = P.Add() + # Set the nn.LayerNorm to fp16. + self.layernorm = nn.LayerNorm((out_channels,)).to_float(compute_type) + self.cast = P.Cast() + ... ... +``` + +It is recommended that export the MindIR model with fp32 precision mode before deploying inference. If you want to further improve the inference performance, you can set `precision_mode is` to "fp16". + +- Some inference scripts may introduce some unique network structures in the training process. For example, the model requires the image label, which are transmitted to the network output directly. It is suggested to delete this part of operators and then export MindIR model to improve inference performance. + +## Inference Using an ONNX File + +1. Generate a model in ONNX format on the training platform. For details, see [Export ONNX Model](https://www.mindspore.cn/docs/programming_guide/en/master/save_model.html#export-onnx-model). + +2. Perform inference on a GPU by referring to the runtime or SDK document. For example, use TensorRT to perform inference on the NVIDIA GPU. For details, see [TensorRT backend for ONNX](https://github.com/onnx/onnx-tensorrt). diff --git a/tutorials/experts/source_en/model_infer/offline_inference.rst b/tutorials/experts/source_en/model_infer/offline_inference.rst new file mode 100644 index 0000000000000000000000000000000000000000..973a38f9c0f1b7c714957d5b3619ae861bba3fad --- /dev/null +++ b/tutorials/experts/source_en/model_infer/offline_inference.rst @@ -0,0 +1,11 @@ +Using Offline Model for Inference +================================= + +.. toctree:: + :maxdepth: 1 + + inference_ascend_910 + inference_ascend_310 + inference_gpu + inference_cpu + post_training_quantization \ No newline at end of file diff --git a/tutorials/experts/source_en/model_infer/online_inference.md b/tutorials/experts/source_en/model_infer/online_inference.md new file mode 100644 index 0000000000000000000000000000000000000000..3f28852c7d4a943bc40cfbbc67ee6bcbf69814fd --- /dev/null +++ b/tutorials/experts/source_en/model_infer/online_inference.md @@ -0,0 +1,59 @@ +# Online Inference with Checkpoint + +`Ascend` `Inference Application` + + + +## Use the `model.eval` interface for model validation + +### Local Storage + +When the pre-trained models are saved in local, the steps of performing inference on validation dataset are as follows: firstly creating a model, then loading the model and parameters using `load_checkpoint` and `load_param_into_net` in `mindspore` module, and finally performing inference on the validation dataset once being created. The method of processing the validation dataset is the same as that of the training dataset. + +```python +network = LeNet5(cfg.num_classes) +net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean") +model = Model(network, net_loss, metrics={"Accuracy": Accuracy()}) + +print("============== Starting Testing ==============") +param_dict = load_checkpoint(args.ckpt_path) +load_param_into_net(network, param_dict) +dataset = create_dataset(os.path.join(args.data_path, "test"), + cfg.batch_size,) +acc = model.eval(dataset, dataset_sink_mode=args.dataset_sink_mode) +print("============== {} ==============".format(acc)) +``` + +In the preceding information: +`model.eval` is an API for model validation. For details about the API, see . +> Inference sample code: . + +### Remote Storage + +When the pre-trained models are saved remotely, the steps of performing inference on the validation dataset are as follows: firstly determining which model to be used, then loading the model and parameters using `mindspore_hub.load`, and finally performing inference on the validation dataset once being created. The method of processing the validation dataset is the same as that of the training dataset. + +```python +model_uid = "mindspore/ascend/0.7/googlenet_v1_cifar10" # using GoogleNet as an example. +network = mindspore_hub.load(model_uid, num_classes=10) +net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean") +model = Model(network, net_loss, metrics={"Accuracy": Accuracy()}) + +print("============== Starting Testing ==============") +dataset = create_dataset(os.path.join(args.data_path, "test"), + cfg.batch_size,) +acc = model.eval(dataset, dataset_sink_mode=args.dataset_sink_mode) +print("============== {} ==============".format(acc)) +``` + +In the preceding information: + +`mindspore_hub.load` is an API for loading model parameters. Please check the details in . + +## Use the `model.predict` API to perform inference + + ```python + model.predict(input_data) + ``` + + In the preceding information: + `model.predict` is an API for inference. For details about the API, see . diff --git a/tutorials/experts/source_en/model_infer/post_training_quantization.md b/tutorials/experts/source_en/model_infer/post_training_quantization.md new file mode 100644 index 0000000000000000000000000000000000000000..257cd04c944fdbb46a6fb110329f583049d2728b --- /dev/null +++ b/tutorials/experts/source_en/model_infer/post_training_quantization.md @@ -0,0 +1,30 @@ +# Applying Post Training Quantization + +Translator: [unseeme](https://gitee.com/unseenme) + +`Device` `Ascend` `Inference Application` + + + +## Concept + +Post training quantization refers to perform weights quantization or full quantization on a pre-trained model. It can reduce model size while also speed up the inference. +This process does not require training. Small amounts of calibration data is needed for activations quantization. + +### Weights Quantization + +Quantify the weights of the model, only reduce the model size. Float32 operations are still performed during inference. The lower the number of quantization bits, the greater the model compression rate, but accuracy loss is usually become relatively large. + +### Full Quantization + +Quantify the weights and activations of the model, int operations are performed during inference. It can reduce the size of the model, increase the speed of model inference, and reduce power consumption. +For scenarios that need to increase the running speed and reduce the power consumption of the model, you can use the post training full quantization. In order to calculate the quantitative parameters of the activations, the user needs to provide a calibration dataset. + +## Post Training Quantization Tools + +Choose to use the corresponding post training quantization tool according to the hardware platform deployed for model inference. + +| Post Training Quantization Tools | Quantization Method Supported | Inference Hardware Platform Supported | Quantization Model Deployment | +| --- | --- | --- | --- | +| [MindSpore Post Training Quantization Tools](https://www.mindspore.cn/lite/docs/en/master/use/post_training_quantization.html) | Weights Quantization
Full Quantization | CPU | [Inference on edge device](https://www.mindspore.cn/lite/docs/en/master/use/runtime.html) | +| Ascend Model Compression Tool | Full Quantization | Ascend 310 AI Processor | [Inference on Ascend 310 AI Processor](https://www.mindspore.cn/docs/programming_guide/en/master/multi_platform_inference_ascend_310.html) | diff --git a/tutorials/experts/source_en/operation/op_ascend.md b/tutorials/experts/source_en/operation/op_ascend.md new file mode 100644 index 0000000000000000000000000000000000000000..a679c5dfabc764e32ceb126d4cd35b295c96109f --- /dev/null +++ b/tutorials/experts/source_en/operation/op_ascend.md @@ -0,0 +1,250 @@ +# Custom Operators (Ascend) + +`Ascend` `Model Development` + + + +## Overview + +When built-in operators cannot meet requirements during network development, you can call the Python API of MindSpore to quickly extend custom operators of the Ascend AI processor. + +To add a custom operator, you need to register the operator primitive, implement the operator, and register the operator information. + +The related concepts are as follows: + +- Operator primitive: defines the frontend API prototype of an operator on the network. It is the basic unit for forming a network model and includes the operator name, attribute (optional), input and output names, output shape inference method, and output dtype inference method. +- Operator implementation: describes the implementation of the internal computation logic for an operator through the DSL API provided by the Tensor Boost Engine (TBE). The TBE supports the development of custom operators based on the Ascend AI chip. +- Operator information: describes basic information about a TBE operator, such as the operator name and supported input and output types. It is the basis for the backend to select and map operators. + +This section takes a Square operator as an example to describe how to customize an operator. + +> For details, see cases in [tests/st/ops/custom_ops_tbe](https://gitee.com/mindspore/mindspore/tree/master/tests/st/ops/custom_ops_tbe) in the MindSpore source code. + +## Registering the Operator Primitive + +The primitive of an operator is a subclass inherited from `PrimitiveWithInfer`. The type name of the subclass is the operator name. + +The definition of the custom operator primitive is the same as that of the built-in operator primitive. + +- The attribute is defined by the input parameter of the constructor function `__init__`. The operator in this test case has no attribute. Therefore, `__init__` has only one input parameter. For details about test cases in which operators have attributes, see [custom add3](https://gitee.com/mindspore/mindspore/blob/master/tests/st/ops/custom_ops_tbe/cus_add3.py) in the MindSpore source code. +- The input and output names are defined by the `init_prim_io_names` function. +- The shape inference method of the output tensor is defined in the `infer_shape` function, and the dtype inference method of the output tensor is defined in the `infer_dtype` function. + +The only difference between a custom operator and a built-in operator is that the operator implementation function (`from square_impl import CusSquareImpl`) needs to be imported to the `__init__` function to register the operator implementation with the backend for the custom operator. In this test case, the operator implementation and information are defined in `square_impl.py`, and the definition will be described in the following parts. + +The following code takes the Square operator primitive `cus_square.py` as an example: + +```python +from mindspore.ops import prim_attr_register, PrimitiveWithInfer +import mindspore.ops as ops +# y = x^2 +class CusSquare(PrimitiveWithInfer): + """ + The definition of the CusSquare primitive. + """ + @prim_attr_register + def __init__(self): + self.init_prim_io_names(inputs=['x'], outputs=['y']) + from square_impl import CusSquareImpl # Import the entry function of the kernel implementation from relative path or PYTHONPATH. + + def infer_shape(self, data_shape): + return data_shape + + def infer_dtype(self, data_dtype): + return data_dtype +``` + +## Implementing a TBE Operator and Registering the Operator Information + +### Implementing a TBE Operator + +To compile an operator implementation, you need to compile a computable function and an entry function first. + +The computable function of an operator is mainly used to encapsulate the computation logic of the operator for the main function to call. The computation logic is implemented by calling the combined API of the TBE. + +The entry function of an operator describes the internal process of compiling the operator. The process is as follows: + +1. Prepare placeholders to be input. A placeholder will return a tensor object that represents a group of input data. +2. Call the computable function. The computable function uses the API provided by the TBE to describe the computation logic of the operator. +3. Call the scheduling module. The model tiles the operator data based on the scheduling description and specifies the data transfer process to ensure optimal hardware execution. By default, the automatic scheduling module (`auto_schedule`) can be used. +4. Call `cce_build_code` to compile and generate an operator binary file. + +> The input parameters of the entry function require the input information of each operator, output information of each operator, operator attributes (optional), and `kernel_name` (name of the generated operator binary file). The input and output information is encapsulated in dictionaries, including the input and output shape and dtype when the operator is called on the network. + +For details about TBE operator development, visit the [TBE website](https://support.huaweicloud.com/odevg-A800_3000_3010/atlaste_10_0063.html). For details about how to debug and optimize the TBE operator, visit the [Mind Studio website](https://support.huaweicloud.com/usermanual-mindstudioc73/atlasmindstudio_02_0043.html). + +### Registering the Operator Information + +The operator information is key for the backend to select the operator implementation and guides the backend to insert appropriate type and format conversion operators. It uses the `TBERegOp` API for definition and uses the `op_info_register` decorator to bind the operator information to the entry function of the operator implementation. When the .py operator implementation file is imported, the `op_info_register` decorator registers the operator information to the operator information library at the backend. For details about how to use the operator information, see comments for the member method of `TBERegOp`. + +> The numbers and sequences of the input and output information defined in the operator information must be the same as those in the parameters of the entry function of the operator implementation and those listed in the operator primitive. +> +> If an operator has attributes, use `attr` to describe the attribute information in the operator information. The attribute names must be the same as those in the operator primitive definition. + +### Example + +The following takes the TBE implementation `square_impl.py` of the `Square` operator as an example. `square_compute` is a computable function of the operator implementation. It describes the computation logic of `x * x` by calling the API provided by `te.lang.cce`. `cus_square_op_info` is the operator information, which is defined by `TBERegOp`. For the specific field meaning of the operator information, visit the [TBE website](https://support.huaweicloud.com/odevg-A800_3000_3010/atlaste_10_0096.html). + +Note the following parameters when setting `TBERegOp`: + +- `OPAQUE` in `fusion_type("OPAQUE")` indicates that the custom operator uses the non-fusion strategy. +- `CusSquareImpl` in `kernel_name("CusSquareImpl")` must be the same as the name of the operator entry function. +- `dtype_format` is used to describe data types supported by the operator. In the following example, two types are registered, indicating that the operator supports two data types. Each type describes the supported format in order of input and output. The first `dtype_format` indicates that the data type input0 is in F32_Default format and the data type output0 is in F32_Default format. The second `dtype_format` indicates that the data type input0 is in F16_Default format and the data type output0 is in F16_Default format. +- About the interfaces `auto_schedule` and `cce_build_code`, please see the TBE documents [auto_schedule](https://support.huaweicloud.com/odevg-A800_3000_3010/atlaste_07_0071.html) and [cce_build_code](https://support.huaweicloud.com/odevg-A800_3000_3010/atlaste_07_0072.html) for details. + +```python +from __future__ import absolute_import +from te import tvm +from topi import generic +import te.lang.cce +from topi.cce import util +from mindspore.ops import op_info_register, TBERegOp, DataType + +def square_compute(input_x): + """ + The compute function of the CusSquare implementation. + """ + res = te.lang.cce.vmul(input_x, input_x) + return res + +# Define the kernel info of CusSquare. +cus_square_op_info = TBERegOp("CusSquare") \ + .fusion_type("OPAQUE") \ + .partial_flag(True) \ + .async_flag(False) \ + .binfile_name("square.so") \ + .compute_cost(10) \ + .kernel_name("CusSquareImpl") \ + .input(0, "x", False, "required", "all") \ + .output(0, "y", False, "required", "all") \ + .dtype_format(DataType.F32_Default, DataType.F32_Default) \ + .dtype_format(DataType.F16_Default, DataType.F16_Default) \ + .get_op_info() + +# Binding kernel info with the kernel implementation. +@op_info_register(cus_square_op_info) +def CusSquareImpl(input_x, output_y, kernel_name="CusSquareImpl"): + """ + The entry function of the CusSquare implementation. + """ + shape = input_x.get("shape") + dtype = input_x.get("dtype").lower() + + shape = util.shape_refine(shape) + data = tvm.placeholder(shape, name="data", dtype=dtype.lower()) + + with tvm.target.cce(): + res = square_compute(data) + sch = generic.auto_schedule(res) + + config = {"print_ir": False, + "name": kernel_name, + "tensor_list": [data, res]} + + te.lang.cce.cce_build_code(sch, config) +``` + +## Using Custom Operators + +The usage of custom operators is the same as that of built-in operators in the network. The operators can be directly used by importing primitives. The following takes the single-operator network test of `CusSquare` as an example. + +Define the network in the `test_square.py` file. + +```python +import numpy as np +import mindspore.nn as nn +import mindspore.context as context +from mindspore import Tensor +# Import the definition of the CusSquare primitive. +from cus_square import CusSquare +context.set_context(mode=context.GRAPH_MODE, device_target="Ascend") + +class Net(nn.Cell): + def __init__(self): + super(Net, self).__init__() + self.square = CusSquare() + + def construct(self, data): + return self.square(data) + +def test_net(): + x = np.array([1.0, 4.0, 9.0]).astype(np.float32) + square = Net() + output = square(Tensor(x)) + print("x: ", x) + print("output: ", output) +``` + +Execute the test case. + +```bash +pytest -s tests/st/ops/custom_ops_tbe/test_square.py::test_net +``` + +The execution result is as follows: + +```text +x: [1. 4. 9.] +output: [1. 16. 81.] +``` + +## Defining the bprop Function for an Operator + +If an operator needs to support automatic differentiation, the bprop function needs to be defined in the primitive of the operator. In the bprop function, you need to describe the backward computation logic that uses the forward input, forward output, and output gradients to obtain the input gradients. The backward computation logic can be composed of built-in operators or custom backward operators. + +Note the following points when defining the bprop function: + +- The input parameter sequence of the bprop function is the forward input, forward output, and output gradients. For a multi-output operator, the forward output and output gradients are provided in the form of tuples. +- The return value of the bprop function is tuples consisting of input gradients. The sequence of elements in a tuple is the same as that of the forward input parameters. Even if there is only one input gradient, the return value must be a tuple. + +For example, the `CusSquare` primitive after the bprop function is added is as follows: + +```python +class CusSquare(PrimitiveWithInfer): + @prim_attr_register + def __init__(self): + """init CusSquare""" + self.init_prim_io_names(inputs=['x'], outputs=['y']) + from square_impl import CusSquareImpl + + def infer_shape(self, data_shape): + return data_shape + + def infer_dtype(self, data_dtype): + return data_dtype + + def get_bprop(self): + def bprop(data, out, dout): + twos_like = ops.OnesLike()(data) * 2.0 + gradient = ops.Mul()(data, twos_like) + dx = ops.Mul()(gradient, dout) + return (dx,) + return bprop +``` + +Define backward cases in the `test_square.py` file. + +```python +import mindspore.ops as ops +def test_grad_net(): + x = np.array([1.0, 4.0, 9.0]).astype(np.float32) + sens = np.array([1.0, 1.0, 1.0]).astype(np.float32) + square = Net() + grad = ops.GradOperation(sens_param=True) + dx = grad(square)(Tensor(x), Tensor(sens)) + print("x: ", x) + print("dx: ", dx) +``` + +Execute the test case. + +```bash +pytest -s tests/st/ops/custom_ops_tbe/test_square.py::test_grad_net +``` + +The execution result is as follows: + +```text +x: [1. 4. 9.] +dx: [2. 8. 18.] +``` diff --git a/tutorials/experts/source_en/operation/op_classification.md b/tutorials/experts/source_en/operation/op_classification.md new file mode 100644 index 0000000000000000000000000000000000000000..7c37870add92b5bfecbc30948da78e0f3cfc57ee --- /dev/null +++ b/tutorials/experts/source_en/operation/op_classification.md @@ -0,0 +1,624 @@ +# Operators Classification + +`Ascend` `GPU` `CPU` `Beginner` + + + +## Overview + +Operators can be classified into some functional modules: tensor operations, network operations, array operations, image operations, encoding operations, debugging operations, and quantization operations. And they also involve some operator combinations related to graph transformation. For details about the supported operators on the Ascend AI processors, GPU, and CPU, see [Operator List](https://www.mindspore.cn/docs/note/en/master/operator_list.html). + +## Tensor Operations + +The tensor operations include the tensor structure operation and the tensor mathematical operation. + +Tensor structure operations include tensor creation, index sharding, dimension transformation, and integration and splitting. + +Tensor mathematical operations include scalar operations, vector operations, and matrix operations. + +The following describes how to use the tensor mathematical operation and operation broadcast mechanism. + +### Mathematical Operators + +Tensor mathematical operators can be classified into scalar operator, vector operator, and matrix operator. + +Scalar operators include addition, subtraction, multiplication, division, exponentiation, common functions such as trigonometric function, exponential function, and logarithmic function, and logical comparison operators. + +#### Scalar Operations + +Scalar operators are characterized by performing element-by-element operations on tensors. + +Some scalar operators overload commonly used mathematical operators. In addition, the broadcast feature similar to NumPy is supported. + + The following code implements the exponentiation, where the base is input_x and the exponent is input_y: + +```python +import numpy as np +import mindspore +from mindspore import Tensor + +input_x = mindspore.Tensor(np.array([1.0, 2.0, 4.0]), mindspore.float32) +input_y = 3.0 +print(input_x**input_y) +``` + + The following information is displayed: + +```text +[ 1. 8. 64.] +``` + +##### Addition + +The following code implements the addition of `input_x` and `input_y`: + +```python +print(input_x + input_y) +``` + + The following information is displayed: + +```text +[4. 5. 7.] +``` + +##### Element-wise Multiplication + +The following code implements the element-wise multiplication: + +```python +import numpy as np +import mindspore +from mindspore import Tensor +import mindspore.ops as ops + +input_x = Tensor(np.array([1.0, 2.0, 3.0]), mindspore.float32) +input_y = Tensor(np.array([4.0, 5.0, 6.0]), mindspore.float32) +mul = ops.Mul() +res = mul(input_x, input_y) + +print(res) +``` + + The following information is displayed: + +```text +[4. 10. 18.] +``` + +##### Trigonometric Function + +The following code implements Acos: + +```python +import numpy as np +import mindspore +from mindspore import Tensor +import mindspore.ops as ops + +acos = ops.ACos() +input_x = Tensor(np.array([0.74, 0.04, 0.30, 0.56]), mindspore.float32) +output = acos(input_x) +print(output) +``` + + The following information is displayed: + +```text +[0.7377037 1.5307858 1.2661037 0.97641146] +``` + +#### Vector Operations + +Vector operators perform operations on only one particular axis, mapping a vector to a scalar or another vector. + +##### Squeeze + +The following code implements the compression of a channel whose dimension of the third channel is 1: + +```python +import numpy as np +import mindspore +from mindspore import Tensor +import mindspore.ops as ops + +input_tensor = Tensor(np.ones(shape=[3, 2, 1]), mindspore.float32) +squeeze = ops.Squeeze(2) +output = squeeze(input_tensor) + +print(output) +``` + + The following information is displayed: + +```text +[[1. 1.] + [1. 1.] + [1. 1.]] +``` + +#### Matrix Operations + +Matrix operations include matrix multiplication, matrix norm, matrix determinant, matrix eigenvalue calculation, and matrix decomposition. + +##### Matrix Multiplication + + The following code implements the matrix multiplication of input_x and input_y: + +```python +import numpy as np +import mindspore +from mindspore import Tensor +import mindspore.ops as ops + +input_x = Tensor(np.ones(shape=[1, 3]), mindspore.float32) +input_y = Tensor(np.ones(shape=[3, 4]), mindspore.float32) +matmul = ops.MatMul() +output = matmul(input_x, input_y) + +print(output) +``` + +The following information is displayed: + +```text +[[3. 3. 3. 3.]] +``` + +### Broadcast Mechanism + +Broadcast indicates that when the number of channels of each input variable is inconsistent, change the number of channels to obtain the result. + +- The following code implements the broadcast mechanism: + +```python +from mindspore import Tensor +import mindspore.ops as ops +import numpy as np + +shape = (2, 3) +input_x = Tensor(np.array([1, 2, 3]).astype(np.float32)) +broadcast_to = ops.BroadcastTo(shape) +output = broadcast_to(input_x) + +print(output) +``` + +The following information is displayed: + +```text +[[1. 2. 3.] + [1. 2. 3.]] +``` + +## Network Operations + +Network operations include feature extraction, activation function, loss function, and optimization algorithm. + +### Feature Extraction + +Feature extraction is a common operation in machine learning. The core of feature extraction is to extract more representative tensors than the original input. + +Convolution Operation + +The following code implements the 2D convolution operation which is one of the common convolution operations: + +```python +from mindspore import Tensor +import mindspore.ops as ops +import numpy as np +import mindspore + +input = Tensor(np.ones([10, 32, 32, 32]), mindspore.float32) +weight = Tensor(np.ones([32, 32, 3, 3]), mindspore.float32) +conv2d = ops.Conv2D(out_channel=32, kernel_size=3) +res = conv2d(input, weight) + +print(res) +``` + +The following information is displayed: + +```text +[[[[288. 288. 288. ... 288. 288. 288.] + [288. 288. 288. ... 288. 288. 288.] + [288. 288. 288. ... 288. 288. 288.] + ... + [288. 288. 288. ... 288. 288. 288.] + [288. 288. 288. ... 288. 288. 288.] + [288. 288. 288. ... 288. 288. 288.]]] + + ... + + [[288. 288. 288. ... 288. 288. 288.] + [288. 288. 288. ... 288. 288. 288.] + [288. 288. 288. ... 288. 288. 288.] + ... + [288. 288. 288. ... 288. 288. 288.] + [288. 288. 288. ... 288. 288. 288.] + [288. 288. 288. ... 288. 288. 288.]] + + + ... + + + [[288. 288. 288. ... 288. 288. 288.] + [288. 288. 288. ... 288. 288. 288.] + [288. 288. 288. ... 288. 288. 288.] + ... + [288. 288. 288. ... 288. 288. 288.] + [288. 288. 288. ... 288. 288. 288.] + [288. 288. 288. ... 288. 288. 288.]]]] +``` + +Convolutional Backward Propagation Operator Operation + +The following code implements the propagation operation of backward gradient operators. The outputs are stored in dout and weight: + +```python +from mindspore import Tensor +import mindspore.ops as ops +import numpy as np +import mindspore + +dout = Tensor(np.ones([10, 32, 30, 30]), mindspore.float32) +weight = Tensor(np.ones([32, 32, 3, 3]), mindspore.float32) +x = Tensor(np.ones([10, 32, 32, 32])) +conv2d_backprop_input = ops.Conv2DBackpropInput(out_channel=32, kernel_size=3) +res = conv2d_backprop_input(dout, weight, ops.shape(x)) + +print(res) +``` + +The following information is displayed: + +```text +[[[[ 32. 64. 96. ... 96. 64. 32.] + [ 64. 128. 192. ... 192. 128. 64.] + [ 96. 192. 288. ... 288. 192. 96.] + ... + [ 96. 192. 288. ... 288. 192. 96.] + [ 64. 128. 192. ... 192. 128. 64.] + [ 32. 64. 96. ... 96. 64. 32.]] + + ... + + [[ 32. 64. 96. ... 96. 64. 32.] + [ 64. 128. 192. ... 192. 128. 64.] + [ 96. 192. 288. ... 288. 192. 96.] + ... + [ 96. 192. 288. ... 288. 192. 96.] + [ 64. 128. 192. ... 192. 128. 64.] + [ 32. 64. 96. ... 96. 64. 32.]]]] +``` + +### Activation Function + +The following code implements the computation of the Softmax activation function: + +```python +from mindspore import Tensor +import mindspore.ops as ops +import numpy as np +import mindspore + +input_x = Tensor(np.array([1, 2, 3, 4, 5]), mindspore.float32) +softmax = ops.Softmax() +res = softmax(input_x) + +print(res) +``` + +The following information is displayed: + +```text +[0.01165623 0.03168492 0.08612853 0.23412164 0.63640857] +``` + +### Loss Function + + L1Loss + + The following code implements the L1 loss function: + +```python +from mindspore import Tensor +import mindspore.ops as ops +import numpy as np +import mindspore + +loss = ops.SmoothL1Loss() +input_data = Tensor(np.array([1, 2, 3]), mindspore.float32) +target_data = Tensor(np.array([1, 2, 2]), mindspore.float32) +res = loss(input_data, target_data) +print(res) +``` + + The following information is displayed: + +```text +[0. 0. 0.5] +``` + +### Optimization Algorithm + + The following code implements the stochastic gradient descent (SGD) algorithm. The output is stored in result. + +```python +from mindspore import Tensor +import mindspore.ops as ops +import numpy as np +import mindspore + +sgd = ops.SGD() +parameters = Tensor(np.array([2, -0.5, 1.7, 4]), mindspore.float32) +gradient = Tensor(np.array([1, -1, 0.5, 2]), mindspore.float32) +learning_rate = Tensor(0.01, mindspore.float32) +accum = Tensor(np.array([0.1, 0.3, -0.2, -0.1]), mindspore.float32) +momentum = Tensor(0.1, mindspore.float32) +stat = Tensor(np.array([1.5, -0.3, 0.2, -0.7]), mindspore.float32) +result = sgd(parameters, gradient, learning_rate, accum, momentum, stat) + +print(result) +``` + + The following information is displayed: + +```text +(Tensor(shape=[4], dtype=Float32, value= [ 1.99000001e+00, -4.90300000e-01, 1.69500005e+00, 3.98009992e+00]),) +``` + +## Array Operations + +Array operations refer to operations on arrays. + +### DType + +Returns a Tensor variable that has the same data type as the input and adapts to MindSpore. It is usually used in a MindSpore project. + +The following is a code example: + +```python +from mindspore import Tensor +import mindspore.ops as ops +import numpy as np +import mindspore + +input_tensor = Tensor(np.array([[2, 2], [2, 2]]), mindspore.float32) +typea = ops.DType()(input_tensor) + +print(typea) +``` + + The following information is displayed: + +```text +Float32 +``` + +### Cast + +Converts the input data type and outputs variables of the same type as the target data type. + +The following is a code example: + +```python +from mindspore import Tensor +import mindspore.ops as ops +import numpy as np +import mindspore + +input_np = np.random.randn(2, 3, 4, 5).astype(np.float32) +input_x = Tensor(input_np) +type_dst = mindspore.float16 +cast = ops.Cast() +result = cast(input_x, type_dst) +print(result.dtype) +``` + + The following information is displayed: + +```text +Float16 +``` + +### Shape + +Returns the shape of the input data. + + The following code implements the operation of returning the input data input_tensor: + +```python +from mindspore import Tensor +import mindspore.ops as ops +import numpy as np +import mindspore + +input_tensor = Tensor(np.ones(shape=[3, 2, 1]), mindspore.float32) +shape = ops.Shape() +output = shape(input_tensor) +print(output) +``` + + The following information is displayed: + +```text +(3, 2, 1) +``` + +## Image Operations + +The image operations include image preprocessing operations, for example, image cropping (for obtaining a large quantity of training samples) and resizing (for constructing an image pyramid). + + The following code implements the cropping and resizing operations: + +```python +from mindspore import Tensor +import mindspore.ops as ops +import numpy as np + +BATCH_SIZE = 1 +NUM_BOXES = 5 +IMAGE_HEIGHT = 256 +IMAGE_WIDTH = 256 +CHANNELS = 3 +image = np.random.normal(size=[BATCH_SIZE, IMAGE_HEIGHT, IMAGE_WIDTH, CHANNELS]).astype(np.float32) +boxes = np.random.uniform(size=[NUM_BOXES, 4]).astype(np.float32) +box_index = np.random.uniform(size=[NUM_BOXES], low=0, high=BATCH_SIZE).astype(np.int32) +crop_size = (24, 24) +crop_and_resize = ops.CropAndResize() +output = crop_and_resize(Tensor(image), Tensor(boxes), Tensor(box_index), crop_size) +print(output.asnumpy()) +``` + +The following information is displayed: + +```text +[[[[ 6.51672244e-01 -1.85958534e-01 5.19907832e-01] +[ 1.53466597e-01 4.10562098e-01 6.26138210e-01] +[ 6.62892580e-01 3.81776541e-01 4.69261825e-01] +... +[-5.83377600e-01 -3.53377648e-02 -6.01786733e-01] +[ 1.36125124e+00 5.84172308e-02 -6.41442612e-02] +[-9.11651254e-01 -1.19495761e+00 1.96810793e-02]] + +[[ 6.06956100e-03 -3.73778701e-01 1.88935513e-03] +[-1.06859171e+00 2.00272346e+00 1.37180305e+00] +[ 1.69524819e-01 2.90421434e-02 -4.12243098e-01] +... + +[[-2.04489112e-01 2.36615837e-01 1.33802962e+00] +[ 1.08329034e+00 -9.00492966e-01 -8.21497202e-01] +[ 7.54147097e-02 -3.72897685e-01 -2.91040149e-02] +... +[ 1.12317121e+00 8.98950577e-01 4.22795087e-01] +[ 5.13781667e-01 5.12095273e-01 -3.68211865e-01] +[-7.04941899e-02 -1.09924078e+00 6.89047515e-01]]]] +``` + +> The preceding code runs on MindSpore of the Ascend version. + +## Encoding Operations + +The encoding operations include BoundingBox Encoding, BoundingBox Decoding, and IOU computing. + +### BoundingBoxEncode + +The box of the area where the object is located is encoded to obtain more concise information similar to PCA, facilitating subsequent tasks such as feature extraction, object detection, and image restoration. + +The following code implements BoundingBox Encoding for anchor_box and groundtruth_box: + +```python +from mindspore import Tensor +import mindspore.ops as ops +import mindspore + +anchor_box = Tensor([[2,2,2,3],[2,2,2,3]],mindspore.float32) +groundtruth_box = Tensor([[1,2,1,4],[1,2,1,4]],mindspore.float32) +boundingbox_encode = ops.BoundingBoxEncode(means=(0.0, 0.0, 0.0, 0.0), stds=(1.0, 1.0, 1.0, 1.0)) +res = boundingbox_encode(anchor_box, groundtruth_box) +print(res) +``` + + The following information is displayed: + +```text +[[-1. 0.25 0. 0.40546513] + [-1. 0.25 0. 0.40546513]] +``` + +### BoundingBoxDecode + +After decoding the area location information, the encoder uses this operator to decode the information. + + Code implementation: + +```python +from mindspore import Tensor +import mindspore.ops as ops +import mindspore + +anchor_box = Tensor([[4,1,2,1],[2,2,2,3]],mindspore.float32) +deltas = Tensor([[3,1,2,2],[1,2,1,4]],mindspore.float32) +boundingbox_decode = ops.BoundingBoxDecode(means=(0.0, 0.0, 0.0, 0.0), stds=(1.0, 1.0, 1.0, 1.0), max_shape=(768, 1280), wh_ratio_clip=0.016) +res = boundingbox_decode(anchor_box, deltas) +print(res) +``` + + The following information is displayed: + +```text +[[ 4.194528 0. 0. 5.194528 ] + [ 2.1408591 0. 3.8591409 60.59815 ]] +``` + +### IOU Computing + +Computes the proportion of the intersection area and union area of the box where the predicted object is located and the box where the real object is located. It is often used as a loss function to optimize the model. + +The following code implements the IOU computing between `anchor_boxes` and `gt_boxes`. The output is stored in out: + +```python +from mindspore import Tensor +import mindspore.ops as ops +import numpy as np +import mindspore + +iou = ops.IOU() +anchor_boxes = Tensor(np.random.randint(1.0, 5.0, [3, 4]), mindspore.float16) +gt_boxes = Tensor(np.random.randint(1.0, 5.0, [3, 4]), mindspore.float16) +out = iou(anchor_boxes, gt_boxes) +print(out) +``` + + The following information is displayed: + +```text +[[ 0. -0. 0.] + [ 0. -0. 0.] + [ 0. 0. 0.]] +``` + +## Debugging Operations + +The debugging operations refer to some common operators and operations used to debug a network, for example, HookBackward. These operations are very convenient and important for entry-level deep learning, greatly improving learning experience. + +### HookBackward + +Displays the gradient of intermediate variables. It is a common operator. Currently, only the PyNative mode is supported. + +The following code implements the function of printing the gradient of the intermediate variable (x,y in this example): + +```python +from mindspore import Tensor +import mindspore.ops as ops +import numpy as np +from mindspore import dtype as mstype +from mindspore import context + +context.set_context(mode=context.PYNATIVE_MODE) + +def hook_fn(grad_out): + print(grad_out) + +grad_all = ops.GradOperation(get_all=True) +hook = ops.HookBackward(hook_fn) + +def hook_test(x, y): + z = x * y + z = hook(z) + z = z * y + return z + +def backward(x, y): + return grad_all(hook_test)(Tensor(x, mstype.float32), Tensor(y, mstype.float32)) + +print(backward(1, 2)) +``` + +The following information is displayed: + +```text +(Tensor(shape=[], dtype=Float32, value= 2),) +(Tensor(shape=[], dtype=Float32, value= 4), Tensor(shape=[], dtype=Float32, value= 4)) +``` diff --git a/tutorials/experts/source_en/operation/op_cpu.md b/tutorials/experts/source_en/operation/op_cpu.md new file mode 100644 index 0000000000000000000000000000000000000000..9bd639fb1926ef794169da518eadbcce49ec3608 --- /dev/null +++ b/tutorials/experts/source_en/operation/op_cpu.md @@ -0,0 +1,270 @@ +# Custom Operators (CPU) + +Translator: [JuLyAi](https://gitee.com/julyai) + +`CPU` `Model Development` + + + +## Overview + +When the built-in operators are not enough for developing the network, you can extend your custom CPU operators fast and conveniently using MindSpore's Python API and C++ API. + +To add a custom operator, you need to complete 3 parts of the work, including operator primitives registration, operators implementation and operators information registration. + +Among them: + +- Operator primitives: Defining the front-end interface prototype of operators in the network; The basic unit of a network model, mainly including operator's name, attributes (optional), input / output name, output shape reasoning method, output dtype reasoning method, etc. +- Operators implementation: Using the C++ API provided by the framework and combining with the specific characteristics of the operators, the internal calculation logic of the operator can be realized. + +This paper will take the custom `Transpose` operator as an example to introduce the steps of customizing operators. + +## Registration Operator's Primitives + +Each operator's primitive is a subclass inherited from the class `PrimitiveWithCheck`, whose type name is the operator's name. + +The CPU operator primitives are defined under the path `mindspore/python/mindspore/ops/operations`, and the appropriate file is selected according to the operator type. Primitives need be added to export list as external interfaces in `/operations/__init__.py`. Definition of CPU operators' primitives' interface is as follows: + +- Attributes are defined by the input parameters of construction function `__init__`. Operators in this use case have no init attributes, thus `__init__` has no additional input parameters. +- The input and output names are defined by the function `init_prim_io_names`. +- Checking shape of the output tensor is defined in `check_shape` function. Checking dtype of the output tensor is defined in `check_dtype` function. +- `_checkparam` file defines a series of operations for validity checking, such as value checking, type checking, etc. + +Taking `Transpose` operator's primitive as an example, the following example codes are given. + +```python +from mindspore.ops import PrimitiveWithInfer + +class Transpose(PrimitiveWithInfer): + """ + The definition of the Transpose primitive. + """ + @prim_attr_register + def __init__(self): + """Initialize Transpose""" + self.init_prim_io_names(inputs=['x', 'perm'], outputs=['output']) + + def infer_shape(self, x, perm): + x_shape = x['shape'] + p_value = perm['value'] + if len(x_shape) != len(p_value): + raise ValueError('The dimension of x and perm must be equal.') + out_shapes = [] + for i in p_value: + out_shapes.append(x_shape[i]) + return out_shapes + + def infer_dtype(self, x_dtype, perm_dtype): + return x_dtype +``` + +## Implementing CPU Operators and Registration Operators Information + +### Implementing CPU Operators + +Usually, to implement a CPU operator needs to write a head file and a source file. The file path is `mindspore/ccsrc/backend/kernel_compiler/cpu`. If the logical realization of the operator is by calling the third-party library `MKL-DNN`, it will be placed in the subdirectory `mkldnn`. Please refer to [oneMkl](https://github.com/oneapi-src/oneMKL) and [oneDNN](https://github.com/oneapi-src/oneDNN) for details. + +The head file of the operator contains the registration information of the operator and the declaration of the class. The operator class inherits from the parent class of `CPUKernel` and overloads `InitKernel` and `Launch`. + +The source file of the operator is the implementation of the class. It mainly overloads the InitKernel and Launch functions. The head file example codes of the `Transpose` operator are as follows: + +```cpp +class TransposeCPUFwdKernel : public CPUKernel { + public: + TransposeCPUFwdKernel() = default; + ~TransposeCPUFwdKernel() override = default; + + void InitKernel(const CNodePtr &kernel_node) override; + + bool Launch(const std::vector &inputs, const std::vector &workspace, + const std::vector &outputs) override; + + private: + std::vector shape_; + std::vector axis_; +}; +``` + +- The input parameters of the function `InitKernel` contain a constant reference to the node pointer. Through the member function of the class `AnfRuntimeAlgorithm`, the input and output shape of the operator node and the attribute information of the operator can be obtained. +- The input parameters of the function `Launch` are 3 vectors, including all the input addresses, workspace addresses and all the output addresses, respectively. The concrete implementation logic of the operator is described in the function body. +- `shape_` and `axis_` are 2 member variables defined. + +The definition of the function `InitKernel` in the source file is as follows: + +```cpp +void TransposeCPUFwdKernel::InitKernel(const CNodePtr &kernel_node) { + MS_EXCEPTION_IF_NULL(kernel_node); + shape_ = AnfAlgo::GetInputDeviceShape(kernel_node, 0); + axis_ = AnfAlgo::GetNodeAttr>(kernel_node, "perm"); + if (shape_.size() != axis_.size()) { + MS_LOG(EXCEPTION) << "The size of input shape and transpose axis shape must be equal."; + } +} +``` + +- The functions in the class `AnfRuntimeAlgorithm` implement various operations on operator nodes. `shape_` represents the shape of the first input of the operator. `axis_` represents the attribute "perm" of the operator. +- The parameter "perm" of the`Transpose` operator's primitive is as an input, but "perm" is actually considered as the attribute of the operation when parsing. + +> For details of the class `AnfRuntimeAlgorithm`, please refer to the declaration in MindSpore source codes under [mindspore/ccsrc/backend/common/session/anf_runtime_algorithm.h](https://gitee.com/mindspore/mindspore/blob/master/mindspore/ccsrc/backend/common/session/anf_runtime_algorithm.h). + +The definition of the function `Launch` in the source file is as follows: First, get the address of each input and output in turn, and then transform the dimension according to `axis_`, and assign the value to the space pointed to by the output address. + +```cpp +bool TransposeCPUFwdKernel::Launch(const std::vector &inputs, + const std::vector & /*workspace*/, + const std::vector &outputs) { + auto input = reinterpret_cast(inputs[0]->addr); + auto output = reinterpret_cast(outputs[0]->addr); + size_t size = IntToSize(inputs[0]->size / sizeof(float)); + size_t shape_size = IntToSize(shape_.size()); + if (shape_size > kMaxDim) { + MS_LOG(EXCEPTION) << "Input is " << shape_size << "-D, but transpose supports max " << kMaxDim << "-D inputs."; + } + size_t pos_array[kMaxDim]; + size_t size_offset[kMaxDim]; + size_offset[0] = size / shape_[0]; + for (size_t i = 1; i < shape_size; i++) { + size_offset[i] = size_offset[SizeToInt(i) - 1] / shape_[i]; + } + for (size_t position = 0; position < size; position += 1) { + size_t temp_position = position; + pos_array[0] = temp_position / size_offset[0]; + for (size_t i = 1; i < shape_size; i++) { + temp_position -= pos_array[SizeToInt(i) - 1] * size_offset[i - 1]; + pos_array[i] = temp_position / size_offset[i]; + } + size_t new_position = pos_array[axis_[SizeToInt(shape_size) - 1]]; + size_t new_position_size = 1; + for (int j = shape_size - 2; j >= 0; j--) { + new_position_size *= shape_[axis_[j + 1]]; + new_position += pos_array[axis_[j]] * new_position_size; + } + output[new_position] = input[position]; + } + return true; +} +``` + +### Registration Operators Information + +Operators information is the key information to guide the back-end selection of implementing operators. The first parameter of `MS_REG_CPU_KERNEL` is the name of the registration operator, which is consistent with the operator name in the primitives. The second parameter indicates the type of each input and output in turn. The last parameter is the name of the class which the operators implement. `Transpose` operator registration codes are as follows: + +```cpp +MS_REG_CPU_KERNEL(Transpose, KernelAttr().AddInputAttr(kNumberTypeFloat32).AddOutputAttr(kNumberTypeFloat32), + TransposeCPUFwdKernel); +``` + +> The number and order of the input and output information defined in operator information, the number and order of input and output information in operator implementation, and the number and order of input and output name list in operator primitives should be consistent. + +## Editing MindSpore + +After writing the custom CPU operators, you need to recompile and reinstall MindSpore. For details, please refer to [Installation Document](https://gitee.com/mindspore/docs/blob/master/install/mindspore_cpu_install_source.md#). + +## Using Custom CPU Operators + +After compiling and installing, the custom CPU operators can be used directly through the import primitives. Take the single operator network test of `Transpose` as an example. + +Define the network in document `test_transpose.py`. + +```python +import numpy as np +import mindspore.nn as nn +import mindspore.context as context +from mindspore import Tensor +import mindspore.ops as ops + +context.set_context(mode=context.GRAPH_MODE, device_target="CPU") + +class Net(nn.Cell): + def __init__(self): + super(Net, self).__init__() + self.transpose = ops.Transpose() + + def construct(self, data): + return self.transpose(data, (1, 0)) + +def test_net(): + x = np.arange(2 * 3).reshape(2, 3).astype(np.float32) + transpose = Net() + output = transpose(Tensor(x)) + print("output: ", output) +``` + +Running case: + +```bash +pytest -s test_transpose.py::test_net +``` + +Running results: + +```text +output: [[0, 3] + [1, 4] + [2, 5]] +``` + +## Defining Operators' BProp Functions + +If an operator needs to support automatic differentiation, its back-propagation function (bprop) needs to be defined in its primitives. You need to describe the reverse computing logic that uses forward input, forward output, and output gradient to get the input gradient in bprop. Reverse computation logic can be composed of built-in operators or custom reverse operators. + +The following points should be paid attention to when defining operators' bprop functions: + +- The order of input parameters of bprop function is defined as positive input, positive output and output gradient. If the operator is a multi-output operator, the forward output and output gradient will be provided in the form of tuples. +- The form of the return values of bprop function is arranged as a tuple composed of input gradient, and the order of elements in the tuple is consistent with that of forward input parameters. Even if there is only one input gradient, the return value must be in the form of tuples. + +For example, the bprop primitives of `Transpose` are: + +```python +import mindspore as ms +import mindspore.ops as ops +from mindspore.ops._grad.grad_base import bprop_getters +fill = ops.Fill() +invert_permutation = ops.InvertPermutation() +transpose = ops.Transpose() +@bprop_getters.register(ops.Transpose) +def get_bprop_transpose(self): + """Generate bprop for Transpose""" + + def bprop(x, perm, out, dout): + return transpose(dout, invert_permutation(perm)), fill(ms.int32, (len(perm), ), 0) + + return bprop +``` + +- `Transpose` bprop operator uses `InvertPermutation` operator, which also needs a complete process of primitives, registration and implementation like `Transpose` operator. + +Define the bprop case in document `test_transpose.py`. + +```python +import mindspore.ops as ops +class Grad(nn.Cell): + def __init__(self, network): + super(Grad, self).__init__() + self.grad = ops.GradOperation(sens_param=True) + self.network = network + + def construct(self, input_data, sens): + gout = self.grad(self.network)(input_data, sens) + return gout + +def test_grad_net(): + x = np.arange(2 * 3).reshape(2, 3).astype(np.float32) + sens = np.arange(2 * 3).reshape(3, 2).astype(np.float32) + grad = Grad(Net()) + dx = grad(Tensor(x), Tensor(sens)) + print("dx: ", dx.asnumpy()) +``` + +Running case: + +```bash +pytest -s test_transpose.py::test_grad_net +``` + +Running results: + +```text +dx: [[0. 2. 4.] + [1. 3. 5.]] +``` diff --git a/tutorials/experts/source_en/operation/op_custom.md b/tutorials/experts/source_en/operation/op_custom.md new file mode 100644 index 0000000000000000000000000000000000000000..f03154bfdced377eb4f8474aede179c8fb7bd0a8 --- /dev/null +++ b/tutorials/experts/source_en/operation/op_custom.md @@ -0,0 +1,891 @@ +# Custom Operators (Custom based) + +`Ascend` `GPU` `CPU` `Model Development` + + + +## Overview + +When built-in operators cannot meet requirements during network development, you can call the Python API [Custom](https://www.mindspore.cn/docs/api/en/master/api_python/ops/mindspore.ops.Custom.html#mindspore-ops-custom) primitive defined in MindSpore to quickly create different types of custom operators for use. + +Traditional methods to add a custom operator need three steps: defining the operator primitive, implementing the operator, and registering the operator information. + +The related concepts are as follows: + +- Operator primitive: defines the frontend API prototype of an operator on the network. It is the basic unit for forming a network model and includes the operator name, attribute (optional), input and output names, output shape inference method, and output data type inference method. +- Operator implementation: defines a Python function(Ascend custom operators) or a C++ class(GPU and CPU custom operators), which describes the implementation of the internal computation logic of an operator. +- Operator information: describes basic information about an operator, such as the operator name, supported input and output data types, supported input and output data formats, and attributes. It is the basis for the backend to select and map operators. + +Compared with traditional custom operator creating methods, creating custom operators based on `Custom` primitive has several advantages: + +- Different custom operators use the same `Custom` primitive, there is no need to define a primitive for every operator. The above three parts of work can be implemented in a network script in a unified way and used as part of the network expression, there is no need to modify and recompile the source codes of MindSpore. +- It unifies the interface and usage for different kinds of custom operators, which is convenient for network developers to flexibly choose which kind of custom operator to use according to their needs. +- Supports defining custom operators with hybrid expression, which can be used across platforms. + +## Basic Usage + +The supported custom operator defining methods based on the [Custom](https://www.mindspore.cn/docs/api/en/master/api_python/ops/mindspore.ops.Custom.html#mindspore-ops-custom) primitive include: hybrid, tbe, aot, pyfunc, julia, and akg. + +The difference between these operator defining methods are as follows: + +| Defining Methods | Development Language | Compilation Method | Supported Platforms | Recommended Scenarios | +|:----------------:|:--------------------:| :------: | ------ |-------------------------------------------------------------------------------| +| hybrid | MindSpore HYBRID DSL | JIT | `Ascend` `GPU` | Ascend/GPU platform general scenarios and proof of concept| +| tbe | TBE DSL | JIT | `Ascend` | Ascend AICORE platform scenarios | +| aot | C/C++/CUDA | AOT | `GPU` `CPU` | high-performance scenarios / use third-party operators scenarios | +| pyfunc | Python | JIT | `CPU` | Fast algorithm verification, need to interact with Python and other scenarios | +| julia | Julia | JIT | `CPU` | Science compute scenarios / use Julia scenarios | +| akg | MindSpore AKG DSL | JIT | `Ascend` `GPU` | Ascend/GPU platform general scenarios | + +> - The full name of DSL is Domain Specific Language. +> - AOT(Ahead Of Time) compiling means the operator implementation needs to be compiled into a dynamic library in advance and then automatically called by the framework when the network is running. JIT(Just In Time) compiling does not need to compile the operator implementation in advance, the operator implementation will be directly called by the framework during network compilation or runtime. + +Different custom operator defining methods use different development languages to implement the operator, but the development process is the same, including operator implementation, operator output shape, data type inference, and operator information registration (optional). You can choose which one to use based on needs. The defining methods of these custom operators will be introduced here, and examples are provided for each method. + +> More examples can be found in the MindSpore source code [tests/st/ops/graph_kernel/custom](https://gitee.com/mindspore/mindspore/tree/master/tests/st/ops/graph_kernel/custom). + +### Defining Custom Operator of hybrid Type + +`hybrid` is the default `func_type` of `Custom`. By defining the custom operation with hybrid type, the user can use Python-like grammar to describe the logic of operation computation and focus on the algorithm itself as the details of framework-related operation engineering are blocked from the user. + +The internal computation logic of the custom operator of type `hybrid` is described by [MindSpore Hybrid DSL](#mindspore-hybrid-developer-guide). The function written by MindSpore Hybrid DSL can be parsed and compiled by the kernel compiler [AKG](https://gitee.com/mindspore/akg) to generate high-performance operators in a JIT way and then be used in training and inference workload of AI models. Meanwhile, such functions can be used as `numpy` functions, so that users can easily tune the algorithm as well as switch to [custom operators of pyfunc type](#defining-custom-operator-of-pyfunc-type). In this way, users will achieve the goal of using custom operations in multiply platforms and multiple scenarios in the same definition of the custom operator. + +The following example test_custom_hybrid.py shows how to write a custom operator of the hybrid type. The operator computes the sum of two tensors. + +```python +import numpy as np +from mindspore import context, Tensor, ops +from mindspore.ops import ms_hybrid + +context.set_context(device_target="GPU") + +# the function written by MindSpore Hybrid DSL +@ms_hybrid +def add(a, b): + c = output_tensor(a.shape, a.dtype) + for i0 in range(a.shape[0]): + for i1 in range(a.shape[1]): + c[i0, i1] = a[i0, i1] + b[i0, i1] + return c + +if __name__ == "__main__": + # define the custom operator using the default func_type hybrid + op = ops.Custom(add) + + x0 = np.array([[0.0, 0.0], [1.0, 1.0]]).astype(np.float32) + x1 = np.array([[2.0, 2.0], [3.0, 3.0]]).astype(np.float32) + output = op(Tensor(x0), Tensor(x1)) + print(output) +``` + +In this case, + +- `hybrid` is the default `func_type` of `Custom`. +- The input of custom operators with hybrid type must be a function with decorator [`@ms_hybrid`](https://www.mindspore.cn/docs/api/zh-CN/master/api_python/ops/mindspore.ops.ms_hybrid.html). +- Users can use the automatic shape/dtype inference functionality of the custom operators with hybrid type, while they can still handwrite shape/dtype functions. + +Execute the example file: + +```bash +python test_custom_hybrid.py +``` + +Result: + +```text +[[2. 2.] + [4. 4.]] +``` + +### Defining Custom Operator of tbe Type + +The custom operator of tbe type uses the TBE(Tensor Boost Engine) operator DSL to describe the internal calculation logic of the operator. You can refer to the [TBE document](https://support.huaweicloud.com/odevg-A800_3000_3010/atlaste_10_0063.html) for the implementation details. + +Operator output shape and data type inference can be realized by defining Python functions to describe the inference logic. + +Operator information needs to be registered. For the creation of operator information, please refer to [Registering the Operator Information](#registering-the-operator-information). + +Takes test_custom_tbe.py as an example to introduce how to define a custom operator of tbe type, where the custom operator implements the function of adding two input tensors. + +Here is the content of test_custom_tbe.py: + +```python +import numpy as np +from mindspore import context, Tensor +import mindspore.ops as ops +from mindspore.ops import DataType, CustomRegOp, custom_info_register + +context.set_context(device_target="Ascend") + +# Operator implementation, and operator information registration +@custom_info_register(CustomRegOp() \ + .input(0, "a") \ + .input(1, "b") \ + .output(0, "output") \ + .dtype_format(DataType.F16_Default, DataType.F16_Default, DataType.F16_Default) \ + .dtype_format(DataType.F32_Default, DataType.F32_Default, DataType.F32_Default) \ + .target("Ascend") \ + .get_op_info()) +def add(a, b, output, kernel_name="add"): + import te.lang.cce + from te import tvm + data0 = tvm.placeholder(a.get("shape"), name="data0", dtype=a.get("dtype").lower()) + data1 = tvm.placeholder(b.get("shape"), name="data1", dtype=b.get("dtype").lower()) + res = te.lang.cce.vadd(data0, data1) + with tvm.target.cce(): + sch = te.lang.cce.auto_schedule(res) + config = {"print_ir": False, "name": kernel_name, "tensor_list": [data0, data1, res]} + te.lang.cce.cce_build_code(sch, config) + +if __name__ == "__main__": + # Define a custom operator of tbe type + op = ops.Custom(add, out_shape=lambda x, _: x, out_dtype=lambda x, _: x, func_type="tbe") + + x0 = np.array([[0.0, 0.0], [1.0, 1.0]]).astype(np.float32) + x1 = np.array([[2.0, 2.0], [3.0, 3.0]]).astype(np.float32) + output = op(Tensor(x0), Tensor(x1)) + print(output) +``` + +The following points need to be explained in this example: + +- Use Python lambda functions to infer the output shape and data type, and pass them to the `out_shape` and `out_dtype` parameters of the `Custom` primitive. In this example, the lambda function indicates that the output shape and data type are the same as the information of the first input tensor. +- Use `CustomRegOp` to create the operator information and use `custom_info_register` decorator to register it. + +Running case: + +```bash +python test_custom_tbe.py +``` + +Running results: + +```text +[[2. 2.] + [4. 4.]] +``` + +### Defining Custom Operator of aot Type + +The custom operator of aot type adopts the AOT compilation method, which requires network developers to hand-write the source code file of the operator implementation based on a specific interface and compiles the source code file into a dynamic library in advance, and then the framework will automatically call and run the function defined in the dynamic library. In terms of the development language of the operator implementation, the GPU platform supports CUDA, and the CPU platform supports C and C++. The interface specification of the operator implementation in the source file is as follows: + +```cpp +extern "C" int func_name(int nparam, void **params, int *ndims, int64_t **shapes, const char **dtypes, void *stream, void *extra); +``` + +where the function name `func_name` can be replaced with any valid function name. The return value is of type int, and 0 means normal exit, non-zero means an exception occurs. The meaning of the parameter list is as follows: + +- nparam (int): The number of inputs and outputs. For example, if an operator has 2 inputs and 1 output, then the value of nparam is 3. +- params (void \*\*): An array of pointers, with each pointer pointing to the input or output data. For example, if an operator has 2 inputs and 1 output, then params[0] points to the first input data, params[1] points to the second input data, params[2] points to the output data. +- ndims (int \*): An array of integers, each integer represents the dimensions of the shape of input or output. For example, if params[i] is a tensor with shape [1024, 1024], then ndims[i] is 2. +- shapes (int64_t \*\*): An array of shapes, each element in array represents for the shape of input or output. For example, if params[i] is a tensor with shape [1024, 1024], then shapes[i][0] is 1024, shapes[i][1] is 1024. +- dtypes (const char \*\*): Array of data types, each element in array represents for the data type of input or output. The value of data type can be "float32", "float16", "float", "float64", "int", "int8", "int16", "int32", "int64", "uint", "uint8", "uint16", "uint32", "uint64", "bool". +- stream (void \*): Stream pointer, only used in Cuda file. +- extra (void \*): Used for further extension. + +Operator output shape and data type inference can be realized by defining Python functions to describe the inference logic. + +If the operator only supports some specific input and output data types, then the operator information needs to be registered. For the creation of operator information, please refer to [Registering the Operator Information](#registering-the-operator-information). + +The following examples introduce the development process of aot type custom operator on GPU platform and CPU platform, where the custom operator implements the function of adding two input tensors. + +#### A GPU Example + +Use the CUDA language to write the source file add.cu for the operator implementation: + +```cpp +#define THREADS 1024 +__global__ void CustomAddKernel(float *input1, float *input2, float *output, size_t size) { + auto idx = blockIdx.x * THREADS + threadIdx.x; + if (idx < size) { + output[idx] = input1[idx] + input2[idx]; + } +} + +extern "C" int CustomAdd(int nparam, void **params, int *ndims, int64_t **shapes, const char **dtypes, void *stream, + void *extra) { + cudaStream_t custream = static_cast(stream); + if (nparam != 3) return 1; + void *input1 = params[0]; + void *input2 = params[1]; + void *output = params[2]; + size_t size = 1; + + for (int i = 0; i < ndims[2]; i++) { + size *= shapes[2][i]; + } + int n = size / THREADS; + for (int i = 0; i < nparam; i++) { + if (strcmp(dtypes[i], "float32") != 0) { + return 2; + } + } + CustomAddKernel<<>>(static_cast(input1), static_cast(input2), + static_cast(output), size); + return 0; +} +``` + +Compile add.cu into a dynamic library add.so: + +```bash +nvcc --shared -Xcompiler -fPIC -o add.so add.cu +``` + +Write the test case test_custom_aot.py: + +```python +import numpy as np +from mindspore import context, Tensor +import mindspore.ops as ops + +context.set_context(device_target="GPU") + +if __name__ == "__main__": + # Define a custom operator of aot type + op = ops.Custom("./add.so:CustomAdd", out_shape=lambda x, _: x, out_dtype=lambda x, _: x, func_type="aot") + + x0 = np.array([[0.0, 0.0], [1.0, 1.0]]).astype(np.float32) + x1 = np.array([[2.0, 2.0], [3.0, 3.0]]).astype(np.float32) + output = op(Tensor(x0), Tensor(x1)) + print(output) +``` + +The following points need to be explained in this example: + +- In this example, you need to place test_custom_aot.py and add.so in the same directory. If add.so is in another directory, you need to replace the value of the first parameter of `Custom` primitive with the absolute path of add.so. +- Use Python lambda functions to infer the output shape and data type, and pass them to the `out_shape` and `out_dtype` parameters of the `Custom` primitive. In this example, the lambda function indicates that the output shape and data type are the same as the information of the first input tensor. +- The operator information is not registered, so the operator information of the custom operator will be inferred from the inputs. + +Running case: + +```bash +python test_custom_aot.py +``` + +Running results: + +```text +[[2. 2.] + [4. 4.]] +``` + +#### A CPU Example + +Use C/C++ language to write the source file add.cc for the operator implementation: + +```cpp +#include +using size_t = decltype(sizeof(int)); +using int64_t = decltype(sizeof(long)); + +extern "C" int CustomAdd(int nparam, void **params, int *ndims, int64_t **shapes, const char **dtypes, void *stream, void *extra) { + if (nparam != 3) return 1; + float *input1 = static_cast(params[0]); + float *input2 = static_cast(params[1]); + float *output = static_cast(params[2]); + size_t size = 1; + for (int i = 0; i < nparam; i++) { + size *= shapes[2][i]; + } + for (int i = 0; i < nparam; i++) { + if (strcmp(dtypes[i], "float32") != 0) { + return 2; + } + } + for (int i = 0; i < size; i++) { + output[i] = input1[i] + input2[i]; + } + return 0; +} +``` + +Compile add.cc into a dynamic library add.so: + +```bash +g++ --shared -fPIC -o add.so add.cc +``` + +Write the test case test_custom_aot.py: + +```python +import numpy as np +from mindspore import context, Tensor +import mindspore.ops as ops + +context.set_context(device_target="CPU") + +if __name__ == "__main__": + # Define a custom operator of aot type + op = ops.Custom("./add.so:CustomAdd", out_shape=lambda x, _: x, out_dtype=lambda x, _: x, func_type="aot") + + x0 = np.array([[0.0, 0.0], [1.0, 1.0]]).astype(np.float32) + x1 = np.array([[2.0, 2.0], [3.0, 3.0]]).astype(np.float32) + output = op(Tensor(x0), Tensor(x1)) + print(output) +``` + +The following points need to be explained in this example: + +- In this example, you need to place test_custom_aot.py and add.so in the same directory. If add.so is in another directory, you need to replace the value of the first parameter of `Custom` primitive with the absolute path of add.so. +- Use Python lambda functions to infer the output shape and data type, and pass them to the `out_shape` and `out_dtype` parameters of the `Custom` primitive. In this example, the lambda function indicates that the output shape and data type are the same as the information of the first input tensor. +- The operator information is not registered, so the operator information of the custom operator will be inferred from the inputs. + +Running case: + +```bash +python test_custom_aot.py +``` + +Running results: + +```text +[[2. 2.] + [4. 4.]] +``` + +### Defining Custom Operator of pyfunc Type + +The custom operator of pyfunc type uses native Python syntax to define the operator implementation, which describes the internal calculation logic of the operator. The framework will automatically call this function during the network runtime. + +Operator output shape and data type inference can be realized by defining Python functions to describe the inference logic. + +If the operator only supports some specific input and output data types, then the operator information needs to be registered. For the creation of operator information, please refer to [Registering the Operator Information](#registering-the-operator-information). + +Takes test_custom_pyfunc.py as an example to introduce how to define a custom operator of pyfunc type, where the custom operator implements the function of adding two input tensors. + +Here is the content of test_custom_pyfunc.py: + +```python +import numpy as np +from mindspore import context, Tensor +import mindspore.ops as ops + +context.set_context(device_target="CPU") + +def add(a, b): + return a + b + +if __name__ == "__main__": + # Define a custom operator of pyfunc type + op = ops.Custom(add, out_shape=lambda x, _: x, out_dtype=lambda x, _: x, func_type="pyfunc") + + x0 = np.array([[0.0, 0.0], [1.0, 1.0]]).astype(np.float32) + x1 = np.array([[2.0, 2.0], [3.0, 3.0]]).astype(np.float32) + output = op(Tensor(x0), Tensor(x1)) + print(output) +``` + +The following points need to be explained in this example: + +- Use Python lambda functions to infer the output shape and data type, and pass them to the `out_shape` and `out_dtype` parameters of the `Custom` primitive. In this example, the lambda function indicates that the output shape and data type are the same as the information of the first input tensor. +- The operator information is not registered, so the operator information of the custom operator will be inferred from the inputs. + +Running case: + +```bash +python test_custom_pyfunc.py +``` + +Running results: + +```text +[[2. 2.] + [4. 4.]] +``` + +### Defining Custom Operator of julia Type + +The custom operator of julia type uses Julia to describe the internal calculation logic of the operator. The framework will automatically call this function during the network runtime. + +Operator output shape and data type inference can be realized by defining Python functions to describe the inference logic. + +If the operator has attributes or only supports specific input and output data types or data formats, the operator information needs to be registered. For the creation of operator information, please refer to [Registering the Operator Information](#registering-the-operator-information). If the operator information is not registered, then the operator information will be derived from the inputs of the current operator during the operator selection process. + +Takes the function of adding two input tensors as an example to introduce how to define a custom operator of julia type. + +Firstly, users should write a Julia function into a Julia file. Here is an example of add.jl: + +```julia +# add.jl +module Add +# inputs: x, y, output: z, output should use .= to inplace assign +function add(x, y, z) + z .= x + y +end +end +``` + +Secondly, use the `Custom` operator with julia func type in the script to call Julia function, here is an example of test_custom_julia.py: + +```python +import numpy as np +from mindspore import context, Tensor +import mindspore.ops as ops + +context.set_context(device_target="CPU") + +if __name__ == "__main__": + op = ops.Custom("./add.jl:Add:add", out_shape=lambda x, _: x, out_dtype=lambda x, _: x, func_type="julia") + x0 = np.array([[0.0, 0.0], [1.0, 1.0]]).astype(np.float32) + x1 = np.array([[2.0, 2.0], [3.0, 3.0]]).astype(np.float32) + output = op(Tensor(x0), Tensor(x1)) + print(output) +``` + +The following points need to be explained in this example: + +- Use Python lambda functions to infer the output shape and data type, and pass them to the `out_shape` and `out_dtype` parameters of the `Custom` primitive. In this example, the lambda function indicates that the output shape and data type are the same as the information of the first input tensor. +- The operator information is not registered, so the operator information of the custom operator will be inferred from the inputs. + +Running case: + +```bash +python test_custom_julia.py +``` + +Running results: + +```text +[[2. 2.] + [4. 4.]] +``` + +Matters need attention: + +1. User should use Julia version >= 1.6.0, +2. User should add `julia/lib` into `LD_LIBRARY_PATH`, consider julia-1.6.5: + + ```bash + # download julia-1.6.5 + wget https://julialang-s3.julialang.org/bin/linux/x64/1.6/julia-1.6.5-linux-x86_64.tar.gz + # extract file + tar xvf julia-1.6.5-linux-x86_64.tar.gz + # if $JULIA_DIR not exist + export LD_LIBRARY_PATH=$PWD/julia-1.6.5/lib:$LD_LIBRARY_PATH + # else + export LD_LIBRARY_PATH=$JULIA_DIR/lib:$LD_LIBRARY_PATH + ``` + +3. `Custom` operator's first arg `func` should keep format like `file_name:module_name:func_name`, `file_name` should include path, suggest using absolute path. +4. Julia file should include `module`, `module` include `function`, both ends with `end`. +5. The Julia function called by kernel should keep inputs and outputs order same with kernel. +6. The Julia function called by kernel should use `.=` to write function result into output memory. +7. User should make sure Julia code is runnable. +8. User should make sure Julia third-party package exists when using it. Install package when not exist: `import pkg; pkg.add("somepkg")`. +9. `julia array` is `column major`, and `numpy array` is `row major`, User should consider this when computing an un-elementwise function. Users can use the functions to transform layout between `numpy array` and `julia array` as below: + + ```julia + function change_input_to_row_major(x) + return permutedims(reshape(x, reverse(size(x))), length(size(x)):-1:1) + end + + function change_output_to_row_major(x) + return reshape(permutedims(x, length(size(x)):-1:1), size(x)) + end + ``` + + An example of MatMul: + + ```julia + # julia array is column-major, numpy aray is row-major + # user should change julia or numpy's layout to keep same behavior + #= EXAMPLE + A[2,3] B[3,4] C[2,4] + NUMPY: + [[1, 2, 3] [[1, 2, 3, 4] [[38, 44, 50, 56] + [4, 5, 6]] [5, 6, 7, 8] [83, 98, 113,128]] + [9,10,11,12]] + JULIA: + change_input_to_row_major: + 1.inputs read numpy data from memory: + [[1, 3, 5] [[1, 4, 7,10] + [2, 4, 6]] [2, 5, 8,11] + [3, 6, 9,12]] + 2.inputs after reshape(reverse(shape)): + [[1, 4] [[1, 5, 9] + [2, 5] [2, 6,10] + [3, 6]] [3, 7,11] + [4, 8,12]] + 3.inputs after transpose/permutedims: + [[1, 2, 3] [[1, 2, 3, 4] [[38, 44, 50, 56] + [4, 5, 6]] [5, 6, 7, 8] [83, 98, 113,128]] + [9,10,11,12]] + change_output_to_row_major: + 1.output after transpose/permutedims: + [[38, 83] + [44, 98] + [50,113] + [56,128] + 2.output after reshape: + [[38, 50, 83, 113] + [44, 56, 98, 128]] + 3.output read numpy data from memory: + [[38, 44, 50, 56] + [83, 98,113, 128]] + =# + function foo!(x, y, z) + x = change_input_to_row_major(x) + y = change_input_to_row_major(y) + z .= gemm(x, y, z) + z .= change_output_to_row_major(z) + end + ``` + +### Defining Custom Operator of akg Type + +The custom operator of akg type uses the [MindSpore AKG](https://gitee.com/mindspore/akg) operator DSL to describe the internal calculation logic of the operator. MindSpore AKG is an operator development and compilation framework based on TVM(Tensor Virtual Machine) and Polyhedral technology, it supports multiple types of operator DSL, such as Hybrid, IR builder and TVM compute. + +Operator output shape and data type inference can be realized by defining Python functions to describe the inference logic. + +If the operator has attributes or only supports specific input and output data types or data formats, the operator information needs to be registered. For the creation of operator information, please refer to [Registering the Operator Information](#registering-the-operator-information). If the operator information is not registered, then the operator information will be derived from the inputs of the current operator during the operator selection process. + +Takes test_custom_akg.py as an example of how to define a custom operator of akg type, where the operator computes the sum of two tensors. + +Here is the content of test_custom_akg.py: + +```python +import numpy as np +from mindspore import context, Tensor +import mindspore.ops as ops + +context.set_context(device_target="GPU") + +# Operator implementation, Hybrid DSL +def add(a, b): + c = output_tensor(a.shape, a.dtype) + for i0 in range(a.shape[0]): + for i1 in range(a.shape[1]): + c[i0, i1] = a[i0, i1] + b[i0, i1] + return c + +if __name__ == "__main__": + # Define a custom operator of akg type + op = ops.Custom(add, out_shape=lambda x, _: x, out_dtype=lambda x, _: x, func_type="akg") + + x0 = np.array([[0.0, 0.0], [1.0, 1.0]]).astype(np.float32) + x1 = np.array([[2.0, 2.0], [3.0, 3.0]]).astype(np.float32) + output = op(Tensor(x0), Tensor(x1)) + print(output) +``` + +The following points need to be explained in this example: + +- `context.set_context(device_target="GPU")` indicates that the operator runs on the GPU platform. To run on the Ascend platform, please compile an Ascend version of MindSpore and set the value of device_target to "Ascend". +- Use Python lambda functions to infer the output shape and data type, and pass them to the `out_shape` and `out_dtype` parameters of the `Custom` primitive. In this example, the lambda function indicates that the output shape and data type are the same as the information of the first input tensor. +- The operator information is not registered, so the operator information of the custom operator will be inferred from the inputs. + +Running case: + +```bash +python test_custom_akg.py +``` + +Running results: + +```text +[[2. 2.] + [4. 4.]] +``` + +## Advanced Usage + +### Registering the Operator Information + +The operator information describes the supported inputs and outputs data type, the supported inputs and outputs format, attributes, and target(platform information) of the operator implementation. It is used to select and map operators later. The operator information can be defined by using the [CustomRegOp](https://www.mindspore.cn/docs/api/en/master/api_python/ops/mindspore.ops.CustomRegOp.html#mindspore-ops-customregop) API, then you can use the [custom_info_register](https://www.mindspore.cn/docs/api/en/master/api_python/ops/mindspore.ops.custom_info_register.html#mindspore-ops-custom-info-register) decorator or just pass it to the `reg_info` parameter of [Custom](https://www.mindspore.cn/docs/api/en/master/api_python/ops/mindspore.ops.Custom.html#mindspore-ops-custom) primitive to bind the information to the operator implementation. The operator information will be registered to the operator information library on the MindSpore C++ side at last. The `reg_info` parameter takes higher priority than the `custom_info_register` decorator. + +The target value in operator information can be "Ascend", "GPU" or "CPU". Which describes the operator information on a specific target. For the same operator implementation, it may have different supported data types on different targets, so you can use the target value in operator information to differ this. The operator information on a specific target will be registered only once. + +> - The numbers and sequences of the input and output information defined in the operator information must be the same as those in the parameters of the operator implementation. +> - For the custom operator of akg type, if the operator has attributes, you need to register operator information, The attribute name in the operator information must be consistent with the attribute name used in the operator implementation. For the custom operator of tbe type, you need to register operator information. For the custom operator of aot type, since the operator implementation needs to be compiled into a dynamic library in advance, the decorator will not work, and the operator information can only be passed in through the `reg_info` parameter. +> - If the custom operator only supports a specific input and output data type or data format, the operator information needs to be registered so that the data type and data format can be checked when the operator is selected in the backend. For the case where the operator information is not provided, the information will be derived from the inputs of the current operator. + +### Defining the bprop Function for Operators + +If an operator needs to support automatic differentiation, the backpropagation(bprop) function needs to be defined first and then passed to the `bprop` parameter of `Custom` primitive. In the bprop function, you need to describe the backward computation logic that uses the forward input, forward output, and output gradients to obtain the input gradients. The backward computation logic can be composed of built-in operators or custom backward operators. + +Note the following points when defining the bprop function: + +- The input parameter sequence of the bprop function is the forward input, forward output, and output gradients. For a multi-output operator, the forward output and output gradients are provided in the form of tuples. +- The return value of the bprop function is tuples consisting of input gradients. The sequence of elements in a tuple is the same as that of the forward input parameters. Even if there is only one input gradient, the return value must be a tuple. + +Take test_grad.py as an example to show the usage of the backpropagation function: + +```python +import numpy as np +from mindspore import context, Tensor +from mindspore.nn import Cell +import mindspore.ops as ops + +context.set_context(mode=context.GRAPH_MODE, device_target="GPU") + +# Forward computation of custom operator +def square(x): + y = output_tensor(x.shape, x.dtype) + for i0 in range(x.shape[0]): + y[i0] = y[i0] * y[i0] + return y + +# Backward computation of custom operator +def square_grad(x, dout): + dx = output_tensor(x.shape, x.dtype) + for i0 in range(x.shape[0]): + dx[i0] = 2.0 * x[i0] + for i0 in range(x.shape[0]): + dx[i0] = dx[i0] * dout[i0] + return dx + +# Backpropagation function +def bprop(): + op = ops.Custom(square_grad, lambda x, _: x, lambda x, _: x, func_type="akg") + + def custom_bprop(x, out, dout): + dx = op(x, dout) + return (dx,) + + return custom_bprop + +class Net(Cell): + def __init__(self): + super(Net, self).__init__() + # Define a custom operator of akg type and provide a backpropagation function + self.op = ops.Custom(square, lambda x: x, lambda x: x, bprop=bprop(), func_type="akg") + + def construct(self, x): + return self.op(x) + +if __name__ == "__main__": + x = np.array([1.0, 4.0, 9.0]).astype(np.float32) + sens = np.array([1.0, 1.0, 1.0]).astype(np.float32) + dx = ops.GradOperation(sens_param=True)(Net())(Tensor(x), Tensor(sens)) + print(dx) +``` + +The following points need to be explained in this example: + +- The backpropagation function uses a custom operator of akg type, and the operator definition and use need to be separated, that is, the custom operator is defined outside the `custom_bprop` function and used inside the `custom_bprop` function. + +Running case: + +```bash +python test_grad.py +``` + +Running results: + +```text +[ 2. 8. 18.] +``` + +> More examples can be found in the MindSpore source code [tests/st/ops/graph_kernel/custom](https://gitee.com/mindspore/mindspore/tree/master/tests/st/ops/graph_kernel/custom). + +### MindSpore Hybrid Developer Guide + +MindSpore Hybrid DSL writes Python-like codes, such as function definitions, indents, and comments. With the decorator [`@ms_hybrid`](https://www.mindspore.cn/docs/api/zh-CN/master/api_python/ops/mindspore.ops.ms_hybrid.html), functions written by MindSpore Hybrid DSL can be used as a `numpy` function, as well as used in the custom operators of the hybrid type. + +```python +import numpy as np +from mindspore import ops, Tensor +from mindspore.ops import ms_hybrid + +@ms_hybrid +def outer_product(a, b): + d = allocate(a.shape, a.dtype) + c = output_tensor(a.shape, a.dtype) + + for i0 in range(a.shape[0]): + for i1 in range(b.shape[1]): + c[i0, i1] = 0.0 + for i2 in range(a.shape[1]): + d[i0, i2] = 2 * a[i0, i2] + c[i0, i1] = c[i0, i1] + sin(d[i0, i2] * b[i2, i1]) + return c + +np_x = np.random.normal(0, 1, [4, 4]).astype(np.float32) +np_y = np.random.normal(0, 1, [4, 4]).astype(np.float32) + +print(outer_product(np_x, np_y)) + +input_x = Tensor(np_x) +input_y = Tensor(np_y) + +test_op_akg = ops.Custom(outer_product) +out = test_op_akg(input_x, input_y) +print(out) +``` + +The detailed developer guide of MindSpore Hybrid DSL is as follows. + +#### Variables + +Variables MindSpore Hybrid DSL includes Tensor and Scalar. + +Tensor variables, besides those in the inputs of the function, must be declared with `shape`和 `dtype` before use. + +- declare a output tensor by `output_tensor`, such as `output_tensor(shape, dtype)`. +- declare an intermediate tensor by `allocate`, such as `allocate(shape, dtype)`. + +Example of Tensor allocation: + +```python +@ms_hybrid +def kernel_func(a, b): + # We can use a and b directly as they are inputs of the function + + # d is a tensor with dtype fp16 and shape (2,), and will be used as an intermediate tensor + d = allocate((2,), "float16") + # c is a tensor with same dtype and shape as a, and will be used as a output tensor + c = output_tensor(a.shape, b.dtype) + + # assign value to c by d + d[0] = b[0, 0] + for i in range(4): + for j in range(4): + c[i, j] = d[0] + + # c as output + return c +``` + +Scalar variables will regard its first assignment as the declaration. The assignment can be either a number or an expression. The place of the first assignment of a scalar variable defines its scope, such as inside a certain level of for loop. Using the variable outside its scope will lead to error. + +Example of using Scalar variable: + +```python +@ms_hybrid +def kernel_func(a, b): + c = output_tensor(a.shape, a.dtype) + + for i in range(10): # i loop + for j in range(5): # j loop + # assign a number to Scalar d + d = 2.0 + # assign an expression to Scalar e + e = a[i, j] + # use scalars + c[i, j] = d + e + + # Wrong: c[0, 0] = d + # Can't use Scalar d outside its scope (j loop) + return c +``` + +Unlike native Python language, once a variable is defined, we can't change its `shape`和 `dtype`. + +#### Expressions + +MindSpore Hybrid DSL supports basic math operators, including `+, -, *, /`, as well as self-assign operators, including `=, +=, -=, *=, /=`. +Users can write codes like writing Python expressions. + +**All the expressions must be based on scalars. Computation for the tensors must include all indices, such as `C[i, j] = A[i, j] + B[i, j]`. Currently, tensorized codes such as `C = A + B` are not supported.** + +When writing assignment expressions, users must take care of the dtype of the expression and make them consistent on both sides of the equality. Otherwise, the error might be thrown on the stage of **operator compilation**. Any integer numbers in the expression will be treated as int32, while float numbers will be treated as float32. There is no implicit dtype casting in MindSpore Hybrid DSL, and all dtype casting must be written with dtype names as casting functions, including: + +- int32 +- float16 +- float32 +- (only on gpu)int8, int16, int64, float64 + +Example of dtype casting: + +```python +@ms_script +def kernel_func(a): + c = output_tensor((2,), "float16") + + # Wrong: c[0] = 0.1 c's dtype is fp16, while 0.1's dtype is fp32 + c[0] = float16(0.1) # float16(0.1) cast the number 0.1 to dtype fp16 + c[1] = float16(a[0, 0]) # float16(a[0, 0])cast the number 0.1 to dtype fp16 + return c +``` + +#### Loop + +Currently, only the `for` loop is supported. `while`, `break`, and `continue` are illegal in MindSpore Hybrid DSL. + +Loops are the same as those in Python. `range` and `grid` are supported to express extents of loops. `range` is for one-dimensional loops and accept a number as the upper bound of the loop, such as: + +```python +@ms_script +def kernel_func(a, b): + c = output_tensor((3, 4, 5), "float16") + + for i in range(3): + for j in range(4): + for k in range(5): + out[i, j, k] = a[i, j, k] + b[i, j, k] + return c +``` + +The iteration space of the above loops is `0 <= i < 3, 0 <= j < 4, 0 <= k < 5`. + +`grid` is for multi-dimensional loops and accepts `tuple` as its input. For example, the above code can be also written as follows in `grid`: + +```python +@ms_script +def kernel_func(a, b): + c = output_tensor((3, 4, 5), "float16") + + for arg in grid((4,5,6)): + out[arg] = a[arg] + b[arg] + return c +``` + +Right now `arg` is equivalent to a three dimensional index `(i,j,k)`, with upper bound 4, 5, 6 respectively. We also have access to each element in `arg`, such as: + +```python +@ms_script +def kernel_func(a, b): + c = output_tensor(a.shape, "float16") + + for arg in grid(a.shape): + out[arg] = a[arg] + b[arg[0]] + return c +``` + +Then the expression inside loops is equivalent to `out[i, j, k] = a[i, j, k] + b[i]`. + +#### Attribute + +Current we support only tensor's `shape` and `dtype` attributes, such as `a.shape`, `c.dtype`. + +The `shape` attribute of a Tensor variable is a `tuple`. We have access to its element with a **fixed** index, such as `a.shape[0]`. + +Once `grid` accepts one tensor's `shape` attribute as its input, then the dimension of the loops is the same as the dimension of the tensor. For example: + +```python +@ms_script +def kernel_func(a, b): + c = output_tensor(a.shape, "float16") + + for arg in grid(a.shape): + out[arg] = a[arg] + b[arg[0]] + return c +``` + +If a is a two dimensional tensor, then the expression inside loops is equivalent to `out[i, j] = a[i, j] + b[i]`, while if a is a three dimensional tensor, then the expression inside loops is equivalent to `out[i, j, k] = a[i, j, k] + b[i]`. + +#### Keywords + +Currently, we support keywords including: + +- Math keywords(all platform): `log`, `exp`, `sqrt`, `tanh`, `power`, `floor` +- Allocate keywords: `allocate`, `output_tensor` +- Datatype keywords: `int32`, `float16`, `float32`, `float64` +- For keywords: `for`, `range`, `grid` +- In current version, some GPU platform only keywords: + - Math keywords: `rsqrt`, `erf`, `isnan`, `sin`, `cos`, `isinf`, `isfinite`, `atan`, `atan2`, `expm1`, `floor`, `ceil`, `trunc`, `round`, `ceil_div` + - Datatype keywords: `int8`, `int16`, `int64` + +#### Frequent Error Messages and Error Attributions + +To help users effectively develop and locate bugs, MindSpore Hybrid DSL provides the following error messages, including: + +- TypeError: there are Python keywords such as `while`, `break` and `continue` which are not supported by MindSpore Hybrid DSL. +- ValueError: + - there are built-in function names which are not in the above support list; + - in the DSL, it trys to get an attribute of a tensor, but the attribute name is neither `shape` nor `dtype`. +- Other frequent error message: + - “SyntaxError”: DSL does not conform to the Python syntax(not the syntax defined by MindSpore Hybrid DSL), and is reported by the Python interpreter itself; + - “ValueError: Compile error” and “The pointer\[kernel_mod\] is null”: the kernel compiler fails in compiling the DSL. Check error messages from AKG for further information; + - “Launch graph failed”: the compiled kernel fails in running. Check the error message from the hardware. For example, when the kernel fails in Ascend, there will be an “Ascend error occurred” message and corresponding hareware error messages. \ No newline at end of file diff --git a/tutorials/experts/source_en/operation/op_gpu.md b/tutorials/experts/source_en/operation/op_gpu.md new file mode 100644 index 0000000000000000000000000000000000000000..067ac1ef46f4ef3e63db5f7c0e426f34d4aa4b65 --- /dev/null +++ b/tutorials/experts/source_en/operation/op_gpu.md @@ -0,0 +1,287 @@ +# Custom Operators (GPU) + +Translator: [Leon_02](https://gitee.com/Leon_02) + +`GPU` `Model Development` + + + +## Overview + +Operator is the basic element of constructing neural network. When built-in operators cannot meet requirements during network development, you can utilize MindSpore to quickly extend custom operators of the Graphics Processing Unit. + +- Primitive registration: the register operator primitive is the basic unit of constructing network model. Users can directly or indirectly call the operator primitive to build a neural network model. +- GPU Kernel implementation: GPU kernel is used to call GPU to accelerate computing. +- GPU Kernel registration: operator registration is used to register the GPU kernel and necessary information to the framework, and the framework completes the call to the GPU kernel. + +In this tutorial, we will develop a TensorAddV2 operator using C++ and CUDA in the mindspore framework. TensorAddV2 is used to add two tensors of the same dimension element by element. + +## Registering the Operator Primitive + +Operator primitives usually include: + +- Aperator names: operator names are used to uniquely identify operators. +- Annotations: describe the algorithm and usage constraints of operators. The annotations will be exported as Mindspore API interface documentation for developers to refer to. +- Input: the tensor(s) for operator input. +- Attributes: for example, the `data_format` attribute in Conv2d describes that the input data is in `NCHW` or `NHWC` format. +- Validation of input data: verify the validity of input data and attributes, which is convenient for developers to find the problems of network model as soon as possible. +- Output data type and dimension derivation: used to derive the data type and dimension of output. + +The following code defines an operator called TensorAddV2: + +- `TensorAddV2` is a subclass inherited from `PrimitiveWithInfer`. +- The constructor `__init__` is used to initialize the operator, since TensorAddV2 doesn't have any attributes, there is none additional input for `__init__`. +- The function `infer_shape` constraints two input dimensions must be the same and the output dimension will be same as the dimension of x1. +- The function `infer_dtype` constrains that two input data must be of type float32 and the output data type is the same as the input data type. + +```python +# mindspore/ops/operations/math_ops.py +class TensorAddV2(PrimitiveWithInfer): + """ + Adds two input tensors element-wise. + """ + @prim_attr_register + def __init__(self): + self.init_prim_io_names(inputs=['x1', 'x2'], outputs=['y']) + + def infer_shape(self, x1_shape, x2_shape): + validator.check_integer('input dims', len(x1_shape), len(x2_shape), Rel.EQ, self.name) + for i in range(len(x1_shape)): + validator.check_integer('input_shape', x1_shape[i], x2_shape[i], Rel.EQ, self.name) + return x1_shape + + def infer_dtype(self, x1_dtype, x2_type): + validator.check_tensor_type_same({'x1_dtype': x1_dtype}, [mstype.float32], self.name) + validator.check_tensor_type_same({'x2_dtype': x2_dtype}, [mstype.float32], self.name) + return x1_dtype +``` + +Next we'll export TensorAddV2 type in '__init__.py', which convenient for users to import and use in the network. + +```python +# mindspore/ops/operations/__init__.py +from .math_ops import (Abs, ACos, ..., TensorAddV2) +... +... +__all__ = [ + 'ReverseSequence', + 'CropAndResize', + ..., + 'TensorAddV2' +] +``` + +## Implementing a GPU operator + +Custom GPU operators inherit from `GPUKernel`: + +- `Init()`: it is used to initialize the GPU kernel, usually includes recording the input / output dimension of the operator, and completing the preparation before launch. +- `GetInputSizeList()`: feedback to the frame the number of bytes of video memory to input tensor. +- `GetOutputSizeList()`: feedback to the frame the number of bytes of video memory to output tensor. +- `GetWorkspaceSizeList()`: feedback to the frame the number of bytes for `Workspace`, where `Workspace` is the space used to store temporary data during calculation. +- `Launch()`: generally, CUDA kernel (CUDA kernel is a kernel function developed by Nvidia GPU's parallel computing architecture) or cudnn interface are called to complete the operator acceleration on GPU. + +The following code shows the implementation of TensorAddV2: +In order to support generalization of data types, we use class template to define `TensorAddV2GpuKernel`: + +- `Init()` records the number of tensor elements. +- `GetInputSizeList()` returns the number of bytes the input tensor needs to occupy. TensorAddV2 has two Input and the number of bytes per input equals to element_num * sizeof(T). +- `GetOutputSizeList()` returns the number of bytes the output tensor needs to occupy. TensorAddV2 has one output and the output occupies element_num * sizeof(T) bytes. +- Since TensorAddV2 doesn't need `Workspace`, the `GetWorkspaceSizeList()` returns a null `std::vector`. +- `Launch()` receives the addresses of input and output in video memory, and then calls `TensorAddV2` to complete acceleration. + +```c++ +// mindspore/ccsrc/backend/kernel_compiler/gpu/math/tensor_add_v2_gpu_kernel.h + +template +class TensorAddV2GpuKernel : public GpuKernel { + public: + TensorAddV2GpuKernel() : element_num_(1) {} + ~TensorAddV2GpuKernel() override = default; + + bool Init(const CNodePtr &kernel_node) override { + auto shape = AnfAlgo::GetPrevNodeOutputInferShape(kernel_node, 0); + for (size_t i = 0; i < shape.size(); i++) { + element_num_ *= shape[i]; + } + InitSizeLists(); + return true; + } + + const std::vector &GetInputSizeList() const override { return input_size_list_; } + const std::vector &GetOutputSizeList() const override { return output_size_list_; } + const std::vector &GetWorkspaceSizeList() const override { return workspace_size_list_; } + + bool Launch(const std::vector &inputs, const std::vector &, + const std::vector &outputs, void *stream_ptr) override { + T *x1 = GetDeviceAddress(inputs, 0); + T *x2 = GetDeviceAddress(inputs, 1); + T *y = GetDeviceAddress(outputs, 0); + + TensorAddV2(element_num_, x1, x2, y, reinterpret_cast(stream_ptr)); + return true; + } + + protected: + void InitSizeLists() override { + input_size_list_.push_back(element_num_ * sizeof(T)); + input_size_list_.push_back(element_num_ * sizeof(T)); + output_size_list_.push_back(element_num_ * sizeof(T)); + } + + private: + size_t element_num_; + std::vector input_size_list_; + std::vector output_size_list_; + std::vector workspace_size_list_; +}; +``` + +`TensorAddV2` calls CUDA kernel`TensorAddV2Kernel` to implement the parallel addition of `element_num` elements: + +```c++ +// mindspore/ccsrc/backend/kernel_compiler/gpu/math/tensor_add_v2_gpu_kernel.h + + template + __global__ void TensorAddV2Kernel(const size_t element_num, const T* x1, const T* x2, T* y) { + for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < element_num; i += blockDim.x * gridDim.x) { + y[i] = x1[i] + x2[i]; + } + } + + template + void TensorAddV2(const size_t &element_num, const T* x1, const T* x2, T* y, cudaStream_t stream){ + size_t thread_per_block = 256; + size_t block_per_grid = (element_num + thread_per_block - 1 ) / thread_per_block; + TensorAddV2Kernel<<>>(element_num, x1, x2, y); + return; + } + + template void TensorAddV2(const size_t &element_num, const float* x1, const float* x2, float* y, cudaStream_t stream); +``` + +## Registering the Operator Information + +Operator information includes: + +- `Primive` +- `Input dtype, output dtype` +- `GPU Kernel class` +- `CUDA built-in dtype` + +Framework calls `CUDA built-in dtype` to instantiate `GPU Kernel class` template class based on `Primive` and `Input dtype, output dtype`. + +The TensorAddV2 operators supporting float and int are registered in the code below: + +```c++ +// mindspore/ccsrc/backend/kernel_compiler/gpu/math/tensor_add_v2_gpu_kernel.cc + +MS_REG_GPU_KERNEL_ONE(TensorAddV2, KernelAttr() + .AddInputAttr(kNumberTypeFloat32) + .AddInputAttr(kNumberTypeFloat32) + .AddOutputAttr(kNumberTypeFloat32), + TensorAddV2GpuKernel, float) + +MS_REG_GPU_KERNEL_ONE(TensorAddV2, KernelAttr() + .AddInputAttr(kNumberTypeInt32) + .AddInputAttr(kNumberTypeInt32) + .AddOutputAttr(kNumberTypeInt32), + TensorAddV2GpuKernel, int) + +``` + +## Compiling Mindspore + +After writing the custom GPU operator, you need to recompile and install MindSpore, see [Installation Documentation](https://gitee.com/mindspore/docs/blob/master/install/mindspore_gpu_install_source_en.md#). + +## Operator verification + +At the end of the tutorial, we construct a single operator network to validate the TensorAddV2 operator we just developed: + +```python +# tests/st/ops/gpu/test_tensoraddv2_op.py + +import mindspore.context as context +from mindspore import Tensor +import mindspore.ops as ops + +context.set_context(device_target='GPU') + +@pytest.mark.level0 +@pytest.mark.platform_x86_gpu_training +@pytest.mark.env_onecard +def test_TensorAdd(): + x1 = Tensor(np.ones((3, 4), np.float32)) + x2 = Tensor(np.ones((3, 4), np.float32)) + y = ops.TensorAddV2()(x1, x2) + print('result: ', y) +``` + +When the command `pytest -s tests/st/ops/gpu/test_tensoraddv2_op.py::test_TensorAdd` executes, you can see the results meeting expectations: + +```text +result: [[2. 2. 2. 2.] + [2. 2. 2. 2.] + [2. 2. 2. 2.]] +``` + +## Defining Operators' BProp Functions + +If an operator needs to support automatic differentiation, its back-propagation function (bprop) needs to be defined in its primitives. You need to describe the reverse computing logic that uses forward input, forward output, and output gradient to get the input gradient in bprop. Reverse computation logic can be composed of built-in operators or custom reverse operators. + +The following points should be paid attention to when defining operators' bprop functions: + +- The order of input parameters of bprop function is defined as positive input, positive output and output gradient. If the operator is a multi-output operator, the forward output and output gradient will be provided in the form of tuples. +- The form of the return values of bprop function is arranged as a tuple composed of input gradient, and the order of elements in the tuple is consistent with that of forward input parameters. Even if there is only one input gradient, the return value must be in the form of tuples. + +For example, the bprop primitives of `TensorAddV2` are: + +```python +import mindspore.ops as ops +@bprop_getters.register(ops.TensorAddV2) +def get_bprop_tensoraddv2(self): + """Generate bprop for TensorAddV2""" + + def bprop(x, y, out, dout): + return dout, dout + + return bprop +``` + +Define the bprop case in document `test_tensoraddv2_op.py`. + +```python +import mindspore.ops as ops +class Grad(nn.Cell): + def __init__(self, network): + super(Grad, self).__init__() + self.grad = ops.GradOperation(sens_param=True) + self.network = network + + def construct(self, x1, x2, sens): + gout = self.grad(self.network)(x1, x2, sens) + return gout + +def test_grad_net(): + x1 = Tensor(np.ones((3, 4), np.float32)) + x2 = Tensor(np.ones((3, 4), np.float32)) + sens = Tensor(np.arange(3 * 4).reshape(3, 4).astype(np.float32)) + grad = Grad(Net()) + dx = grad(x1, x2, sense) + print("dx[0]: ", dx[0].asnumpy()) +``` + +Running case: + +```bash +pytest -s tests/st/ops/gpu/test_tensoraddv2_op.py::test_grad_net +``` + +Running results: + +```text +dx[0]: [[0. 1. 2. 3.] + [4. 5. 6. 7.] + [8. 9. 10. 11.]] +``` + diff --git a/tutorials/experts/source_en/others/gradient_accumulation.md b/tutorials/experts/source_en/others/gradient_accumulation.md new file mode 100644 index 0000000000000000000000000000000000000000..d60bea2bbe19b47b003e2bd06005b6d17af291d0 --- /dev/null +++ b/tutorials/experts/source_en/others/gradient_accumulation.md @@ -0,0 +1,271 @@ +# Gradient Accumulation Algorithm + +`GPU` `Model Optimization` + + + +## Overview + +This tutorial describes the gradient accumulation training methods to solve the problem that some large-scale networks cannot train large batch_size due to insufficient memory. + +In a traditional training method, after a loss and a gradient are computed each time, a parameter is directly updated by using the obtained gradient. + +Compared to the traditional training method, mini-batch is introduced to the gradient accumulation. The loss and gradient are computed for each mini-batch data, but the model parameters are not updated immediately. Instead, the obtained gradients are accumulated first, and then after a specified number (N) of mini-batches, the accumulated gradient is used to update the network parameters. Before the next training, the accumulated gradients are cleared and re-accumulated. The ultimate objective is to achieve the same effect as training with N x Mini-batch data. + +This tutorial describes how to implement gradient accumulation training in standalone mode and parallel mode, respectively. + +## Standalone Mode + +In standalone mode, the training process consists of three parts: forward and backward training, parameter update, and accumulated gradient clearance. MNIST is used as an example dataset. To customize a simple model to implement gradient accumulation, perform the following steps: + +> Download the main training sample code: +> +> `auto_parallel` and `semi_auto_parallel` mode don't support gradient accumulation now. + +Since you need to use the lenet network in the models repository, please execute the following command to pull the code of the models repository + +```text +git clone https://gitee.com/mindspore/models.git +``` + +If the models repository is not in the system path, it needs to be in ` train.py ` add the following two pieces of code at the beginning of the code. + +```python +import sys +sys.path.append(path to models repository) +``` + +### Importing Library Files + +The following are the required public modules and MindSpore modules and library files. + +```python +import argparse +import os +from collections.abc import Iterable + +import mindspore.nn as nn +from mindspore import ParameterTuple +from mindspore import context, DatasetHelper, save_checkpoint +from mindspore.nn import Cell +import mindspore.ops as ops +from models.official.cv.lenet.src.dataset import create_dataset +from models.official.cv.lenet.src.lenet import LeNet5 +``` + +### Loading the Dataset + +Use the `MnistDataset` API provided by `dataset` of MindSpore to load the MNIST dataset. The code is imported from [dataset.py](https://gitee.com/mindspore/models/blob/master/official/cv/lenet/src/dataset.py) in the `lenet` directory of `models`. + +### Defining the Network + +LeNet is used as an example network. You can also use other networks, such as ResNet-50 and BERT. The code is imported from [lenet.py](https://gitee.com/mindspore/models/blob/master/official/cv/lenet/src/lenet.py) in the `lenet` directory of `models`. + +### Defining the Training Process + +The training process consists of three parts: forward and backward training, parameter update, and accumulated gradient clearance. + +- `TrainForwardBackward` calculates the loss and gradient, and uses grad_sum to implement gradient accumulation. +- `TrainOptim` updates parameters. +- `TrainClear` clears the gradient accumulation variable grad_sum. + +```python +_sum_op = ops.MultitypeFuncGraph("grad_sum_op") +_clear_op = ops.MultitypeFuncGraph("clear_op") + + +@_sum_op.register("Tensor", "Tensor") +def _cumulative_grad(grad_sum, grad): + """Apply grad sum to cumulative gradient.""" + add = ops.AssignAdd() + return add(grad_sum, grad) + + +@_clear_op.register("Tensor", "Tensor") +def _clear_grad_sum(grad_sum, zero): + """Apply zero to clear grad_sum.""" + success = True + success = ops.depend(success, ops.assign(grad_sum, zero)) + return success + + +class TrainForwardBackward(Cell): + def __init__(self, network, optimizer, grad_sum, sens=1.0): + super(TrainForwardBackward, self).__init__(auto_prefix=False) + self.network = network + self.network.set_grad() + self.network.add_flags(defer_inline=True) + self.weights = ParameterTuple(network.trainable_params()) + self.optimizer = optimizer + self.grad_sum = grad_sum + self.grad = ops.GradOperation(get_by_list=True, sens_param=True) + self.sens = sens + self.hyper_map = ops.HyperMap() + + def construct(self, *inputs): + weights = self.weights + loss = self.network(*inputs) + sens = ops.Fill()(ops.DType()(loss), ops.Shape()(loss), self.sens) + grads = self.grad(self.network, weights)(*inputs, sens) + return ops.depend(loss, self.hyper_map(ops.partial(_sum_op), self.grad_sum, grads)) + + +class TrainOptim(Cell): + def __init__(self, optimizer, grad_sum): + super(TrainOptim, self).__init__(auto_prefix=False) + self.optimizer = optimizer + self.grad_sum = grad_sum + + def construct(self): + return self.optimizer(self.grad_sum) + + +class TrainClear(Cell): + def __init__(self, grad_sum, zeros): + super(TrainClear, self).__init__(auto_prefix=False) + self.grad_sum = grad_sum + self.zeros = zeros + self.hyper_map = ops.HyperMap() + + def construct(self): + success = self.hyper_map(ops.partial(_clear_op), self.grad_sum, self.zeros) + return success +``` + +### Defining the Training Model + +Each mini-batch computes the loss and gradient through forward and backward training, and uses mini_steps to control the accumulated times before each parameter update. After the number of accumulation times is reached, the parameter is updated +and the accumulated gradient variable is cleared. + +```python +class GradientAccumulation: + def __init__(self, network, loss_fn, optimizer): + self._network = network + self._loss_fn = loss_fn + self._optimizer = optimizer + + params = self._optimizer.parameters + self._grad_sum = params.clone(prefix="grad_sum", init='zeros') + self._zeros = params.clone(prefix="zeros", init='zeros') + self._train_forward_backward = self._build_train_forward_backward_network() + self._train_optim = self._build_train_optim() + self._train_clear = self._build_train_clear() + + @staticmethod + def _transform_callbacks(callbacks): + """Transform callback to a list.""" + if callbacks is None: + return [] + + if isinstance(callbacks, Iterable): + return list(callbacks) + + return [callbacks] + + def _build_train_forward_backward_network(self): + """Build forward and backward network""" + network = self._network + network = nn.WithLossCell(network, self._loss_fn) + loss_scale = 1.0 + network = TrainForwardBackward(network, self._optimizer, self._grad_sum, loss_scale).set_train() + return network + + def _build_train_optim(self): + """Build optimizer network""" + network = TrainOptim(self._optimizer, self._grad_sum).set_train() + return network + + def _build_train_clear(self): + """Build clear network""" + network = TrainClear(self._grad_sum, self._zeros).set_train() + return network + + def train_process(self, epoch, train_dataset, mini_steps=None): + """ + Training process. The data would be passed to network directly. + """ + dataset_helper = DatasetHelper(train_dataset, dataset_sink_mode=False, epoch_num=epoch) + + for i in range(epoch): + step = 0 + for k, next_element in enumerate(dataset_helper): + loss = self._train_forward_backward(*next_element) + if (k + 1) % mini_steps == 0: + step += 1 + print("epoch:", i + 1, "step:", step, "loss is ", loss) + self._train_optim() + self._train_clear() + + train_dataset.reset() + + save_checkpoint(self._train_forward_backward, "gradient_accumulation.ckpt", ) +``` + +### Training and Saving the Model + +Call the network, optimizer, and loss function, and then customize the `train_process` API of `GradientAccumulation` to train the model. + +```python +if __name__ == "__main__": + parser = argparse.ArgumentParser(description='MindSpore Grad Cumulative Example') + parser.add_argument('--device_target', type=str, default="GPU", choices=['GPU'], + help='device where the code will be implemented (default: GPU)') + parser.add_argument('--data_path', type=str, default="./Data", + help='path where the dataset is saved') + args = parser.parse_args() + + context.set_context(mode=context.GRAPH_MODE, device_target=args.device_target) + ds_train = create_dataset(os.path.join(args.data_path, "train"), 32) + + net = LeNet5(10) + net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean") + net_opt = nn.Momentum(net.trainable_params(), 0.01, 0.9) + model = GradientAccumulation(net, net_loss, net_opt) + + print("============== Starting Training ==============") + model.train_process(10, ds_train, mini_steps=4) +``` + +## Experiment Result + +After 10 epochs, the accuracy on the test set is about 96.31%. + +**Start training.** + +1. Run the training code and view the running result. + + ```bash + python train.py --data_path=./MNIST_Data + ``` + + The output is as follows. You can see that the loss value decreases with the training. + + ```text + epoch: 1 step: 27 loss is 0.3660637 + epoch: 1 step: 28 loss is 0.25238192 + ... + epoch: 3 step: 2 loss is 0.12296932 + epoch: 3 step: 3 loss is 0.15799297 + ... + epoch: 10 step: 448 loss is 0.06443884 + epoch: 10 step: 449 loss is 0.0067842817 + ``` + +2. Check the saved checkpoint files. + + The checkpoint file `gradient_accumulation.ckpt`, that is, the model file, is saved during training. + +**Validate the model.** + +Use the saved checkpoint file to load the validation dataset through [eval.py]() in the lenet directory of models. + +```bash +python eval.py --data_path=./MNIST_Data --ckpt_path=./gradient_accumulation.ckpt --device_target=GPU +``` + +The output is as follows. The accuracy of the validation dataset is about 96.31%, which is the same as the result when the value of batch_size is 32. + +```text +============== Starting Testing ============== +============== {'Accuracy': 0.9631730769230769} ============== +``` diff --git a/tutorials/experts/source_en/others/images/fp16_vs_fp32.png b/tutorials/experts/source_en/others/images/fp16_vs_fp32.png new file mode 100644 index 0000000000000000000000000000000000000000..83f965ddfa7b2bfed024e774d554f8fe15e6ab3f Binary files /dev/null and b/tutorials/experts/source_en/others/images/fp16_vs_fp32.png differ diff --git a/tutorials/experts/source_en/others/images/mix_precision_fp16.png b/tutorials/experts/source_en/others/images/mix_precision_fp16.png new file mode 100644 index 0000000000000000000000000000000000000000..8c09ff28cf8479e1d9fd13b19cb90d3379c1aa2b Binary files /dev/null and b/tutorials/experts/source_en/others/images/mix_precision_fp16.png differ diff --git a/tutorials/experts/source_en/others/mixed_precision.md b/tutorials/experts/source_en/others/mixed_precision.md new file mode 100644 index 0000000000000000000000000000000000000000..be09d7c29f5294e4f96f7c538a53e9974a951558 --- /dev/null +++ b/tutorials/experts/source_en/others/mixed_precision.md @@ -0,0 +1,244 @@ +# Enabling Mixed Precision + +`Ascend` `GPU` `Model Optimization` + + + +## Overview + +Generally, when a neural network model is trained, the default data type is FP32. In recent years, to accelerate training time, reduce memory occupied during network training, and store a trained model with same precision, more and more mixed-precision training methods are proposed in the industry. The mixed-precision training herein means that both single precision (FP32) and half precision (FP16) are used in a training process. + +## Computation Process + +Floating-point data types include double-precision (FP64), single-precision (FP32), and half-precision (FP16). In a training process of a neural network model, an FP32 data type is generally used by default to indicate a network model weight and other parameters. The following is a brief introduction to floating-point data types. + +According to IEEE 754, floating-point data types are classified into double-precision (FP64), single-precision (FP32), and half-precision (FP16). Each type is represented by three different bits. FP64 indicates a data type that uses 8 bytes (64 bits in total) for encoding and storage. FP32 indicates a data type that uses 4 bytes (32 bits in total) and FP16 indicates a data type that uses 2 bytes (16 bits in total). As shown in the following figure: + +![fp16_vs_FP32](./images/fp16_vs_fp32.png) + +As shown in the figure, the storage space of FP16 is half that of FP32, and the storage space of FP32 is half that of FP64. It consists of three parts: + +- The leftmost bit indicates the sign bit. +- The middle bits indicate exponent bits. +- The rightmost bits indicate fraction bits. + +FP16 is used as an example. The first sign bit sign indicates a positive or negative sign, the next five bits indicate an exponent, and the last 10 bits indicate a fraction. The formula is as follows: + +$$x=(-1)^{S}\times2^{E-15}\times(1+\frac{fraction}{1024})$$ + +Similarly, the true value of a formatted FP32 is as follows: + +$$x=(-1)^{S}\times2^{E-127}\times(1.M)$$ + +The true value of a formatted FP64 is as follows: + +$$x=(-1)^{S}\times2^{E-1023}\times(1.M)$$ + +The maximum value that can be represented by FP16 is 0 11110 1111111111, which is calculated as follows: + +$$(-1)^0\times2^{30-15}\times1.1111111111 = 1.1111111111(b)\times2^15 = 1.9990234375(d)\times2^15 = 65504$$ + +The minimum value that can be represented by FP16 is 0 00001 0000000000, which is calculated as follows: + +$$ (-1)^{1}\times2^{1-15}=2^{-14}=6.104×10^{-5}=-65504$$ + +Therefore, the maximum value range of FP16 is [-65504,66504], and the precision range is $2^{-24}$. If the value is beyond this range, the value is set to 0. + +## FP16 Training Issues + +Why do we need mixed-precision? Compared with FP32, FP16 has the following advantages: + +If a data type given to FP16 operators is FP32, the MindSpore framework performs precision reduction at the backend. You can enable the INFO log function and search for the keyword "Reduce precision" to view operators with precision reduced. + +- Reduced memory usage: The bit width of FP16 is half of that of FP32. Therefore, the memory occupied by parameters such as the weight is also half of the original memory. The saved memory can be used to store larger network models or train more data. +- Higher communication efficiency: For distributed training, especially the large-scale model training, the communication overhead restricts the overall performance. A smaller communication bit width means that the communication performance can be improved, the waiting time can be reduced, and the data flow can be accelerated. +- Higher computing efficiency: On special AI acceleration chips, such as Huawei Ascend 910 and 310 series, or GPUs of the NVIDIA VOLTA architecture, the computing performance of FP16 is faster than that of FP32. + +However, using FP16 also brings some problems, the most important of which are precision overflow and rounding error. + +- Data overflow: The valid data range of FP16 is $[6.10\times10^{-5}, 65504]$, and that of FP32 is $[1.4\times10^{-45}, 1.7\times10^{38}]$. We can see that the valid range of FP16 is much narrower than that of FP32. When FP16 is used to replace FP32, overflow and underflow occur. In deep learning, a gradient (a first-order derivative) of a weight in a network model needs to be calculated. Therefore, the gradient is smaller than the weight value, and underflow often occurs. +- Rounding error: When the backward gradient of a network model is small, FP32 is usually used. However, when it is converted to FP16, the interval is smaller than the minimum interval, causing data overflow. For example, 0.00006666666 can be properly represented in FP32, but it will be represented as 0.000067 in FP16. The number that does not meet the minimum interval requirement of FP16 will be forcibly rounded off. + +## Mixed-precision Computing Process + +The following figure shows the typical computation process of mixed precision in MindSpore. + +![mix precision](./images/mix_precision_fp16.png) + +1. Parameters are stored in FP32 format. +2. During the forward computation, if an FP16 operator is involved, the operator input and parameters need to be cast from FP32 to FP16. +3. The loss layer is set to FP32. +4. During backward computation, the value is multiplied by Loss Scale to avoid underflow due to a small gradient. +5. The FP16 parameter is used for gradient computation, and the result is cast back to FP32. +6. Then, the value is divided by Loss scale to restore the multiplied gradient. +7. The optimizer checks whether the gradient overflows. If yes, the optimizer skips the update. If no, the optimizer uses FP32 to update the original parameters. + +This document describes the computation process by using examples of automatic and manual mixed precision. + +## MindSpore Mixed-precision + +### Automatic Mixed Precision + +To use the automatic mixed-precision, you need to call the `Model` API to transfer the network to be trained and optimizer as the input. This API converts the network model operators into FP16 operators. + +> Due to precision problems, the `BatchNorm` operator and operators involved in loss still use FP32. + +1. Introduce the MindSpore model API `Model`. + +2. Define a network: This step is the same as that for defining a common network (no new configuration is required). + +3. Create a dataset: For details, see [Quick Start of Dataset](https://www.mindspore.cn/docs/programming_guide/en/master/dataset_sample.html). + +4. Use the `Model` API to encapsulate the network model, optimizer, and loss function, and set the `amp_level` parameter. For details, see [MindSpore API](https://www.mindspore.cn/docs/api/en/master/api_python/mindspore.html#mindspore.Model). In this step, MindSpore automatically selects an appropriate operator to convert FP32 to FP16. + +The following is a basic code example. First, import the required libraries and declarations, and define the LeNet-5 network model. + +```python +import numpy as np +import mindspore.nn as nn +from mindspore.nn import Accuracy +from mindspore import context, Model +from mindspore.common.initializer import Normal +from mindspore import dataset as ds + +context.set_context(mode=context.GRAPH_MODE) +context.set_context(device_target="CPU") + +class LeNet5(nn.Cell): + """ + Lenet network + + Args: + num_class (int): Number of classes. Default: 10. + num_channel (int): Number of channels. Default: 1. + + Returns: + Tensor, output tensor + + + """ + def __init__(self, num_class=10, num_channel=1): + super(LeNet5, self).__init__() + self.conv1 = nn.Conv2d(num_channel, 6, 5, pad_mode='valid') + self.conv2 = nn.Conv2d(6, 16, 5, pad_mode='valid') + self.fc1 = nn.Dense(16 * 5 * 5, 120, weight_init=Normal(0.02)) + self.fc2 = nn.Dense(120, 84, weight_init=Normal(0.02)) + self.fc3 = nn.Dense(84, num_class, weight_init=Normal(0.02)) + self.relu = nn.ReLU() + self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2) + self.flatten = nn.Flatten() + + def construct(self, x): + x = self.max_pool2d(self.relu(self.conv1(x))) + x = self.max_pool2d(self.relu(self.conv2(x))) + x = self.flatten(x) + x = self.relu(self.fc1(x)) + x = self.relu(self.fc2(x)) + x = self.fc3(x) + return x +``` + +Create a virtual random dataset for data input of the sample model. + +```python +# create dataset +def get_data(num, img_size=(1, 32, 32), num_classes=10, is_onehot=True): + for _ in range(num): + img = np.random.randn(*img_size) + target = np.random.randint(0, num_classes) + target_ret = np.array([target]).astype(np.float32) + if is_onehot: + target_onehot = np.zeros(shape=(num_classes,)) + target_onehot[target] = 1 + target_ret = target_onehot.astype(np.float32) + yield img.astype(np.float32), target_ret + +def create_dataset(num_data=1024, batch_size=32, repeat_size=1): + input_data = ds.GeneratorDataset(list(get_data(num_data)), column_names=['data','label']) + input_data = input_data.batch(batch_size, drop_remainder=True) + input_data = input_data.repeat(repeat_size) + return input_data +``` + +Set the `amp_level` parameter and use the `Model` API to encapsulate the network model, optimizer, and loss function. + +```python +ds_train = create_dataset() + +# Initialize network +network = LeNet5(10) + +# Define Loss and Optimizer +net_loss = nn.SoftmaxCrossEntropyWithLogits(reduction="mean") +net_opt = nn.Momentum(network.trainable_params(),learning_rate=0.01, momentum=0.9) +model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()}, amp_level="O2", loss_scale_manager=None) + +# Run training +model.train(epoch=10, train_dataset=ds_train) +``` + +## Manual Mixed Precision + +MindSpore also supports manual mixed-precision. (Manual mixed-precision is not recommended unless you want to customize special networks and features.) + +Assume that only one dense layer on the network uses FP16 for computation and other layers use FP32. + +> The mixed-precision is configured in the unit of Cell. The default type of a Cell is FP32. + +2. Configure the mixed-precision: Use `to_float(mstype.float16)` to set the operators involved in the Cell to FP16. + +3. Use `TrainOneStepCell` to encapsulate the network model and optimizer. + +The following is a basic code example. First, import the required libraries and declarations. + +```python +import numpy as np + +import mindspore.nn as nn +from mindspore import dtype as mstype +from mindspore import Tensor, context +import mindspore.ops as ops +from mindspore.nn import WithLossCell, TrainOneStepCell +from mindspore.nn import Momentum + +context.set_context(mode=context.GRAPH_MODE) +context.set_context(device_target="Ascend") +``` + +The network is defined in the same way regardless of whether FP32 or FP16 is used. The difference is that after the network is defined, the dense layer is declared to use FP16 for computing when the network model is initialized, that is, `net.dense.to_float(mstype.float16)`. + +```python +# Define network +class Net(nn.Cell): + def __init__(self, input_channel, out_channel): + super(Net, self).__init__() + self.dense = nn.Dense(input_channel, out_channel) + self.relu = ops.ReLU() + + def construct(self, x): + x = self.dense(x) + x = self.relu(x) + return x + +# Initialize network +net = Net(512, 128) +# Set mixing precision +net.to_float(mstype.float16) +net.dense.to_float(mstype.float32) + +# Define training data, label +predict = Tensor(np.ones([64, 512]).astype(np.float32) * 0.01) +label = Tensor(np.zeros([64, 128]).astype(np.float32)) + +# Define Loss and Optimizer +loss = nn.SoftmaxCrossEntropyWithLogits() +optimizer = Momentum(params=net.trainable_params(), learning_rate=0.1, momentum=0.9) +net_with_loss = WithLossCell(net, loss) +train_network = TrainOneStepCell(net_with_loss, optimizer) +train_network.set_train() + +# Run training +output = train_network(predict, label) +``` + +> Constraint: When mixed-precision is used, the backward network can be generated only by the automatic differential function. Otherwise, MindSpore may generate exception information indicating that the data format does not match. diff --git a/tutorials/experts/source_en/parallel/auto_parallel.md b/tutorials/experts/source_en/parallel/auto_parallel.md new file mode 100644 index 0000000000000000000000000000000000000000..577ea1c0f2a7c1a45b48189a5a87030224e083bf --- /dev/null +++ b/tutorials/experts/source_en/parallel/auto_parallel.md @@ -0,0 +1,456 @@ +# Parallel Distributed Training Interfaces + +`Ascend` `GPU` `Distributed Parallel` + + + +## Overview + +In deep learning, as the number of datasets and parameters increases, the time and hardware resources required for training increase, and it finally become a bottleneck to the training. Parallel distributed training can reduce the requirements on hardware such as memory and computing performance and is an important optimization method for training. + +MindSpore provides the parallel distributed training function and supports multiple parallel modes, including data parallel and automatic parallel. + +## Parallel Distributed Training Configuration + +The parallel distributed training configuration of MindSpore is managed by `auto_parallel_context` in a centralized manner. You can customize the configuration based on the actual situation and your own requirements. These configurations can be classified into three types: + +- General configuration: takes effect on both data parallel and automatic parallel, for example, `device_num` and `global_rank` etc. +- Automatic parallel configuration: takes effect only in automatic parallel mode, for example, `gradient_fp32_sync` etc. + +You can use `context.set_auto_parallel_context` to configure the preceding parameters and use `context.get_auto_parallel_context` to obtain the parameters. + +### General Configuration + +#### device_num + +`device_num` indicates the number of available machines. The default value is 1. The value is of the int type and must range from 1 to 4096. If you do not configure this parameter, the `Model` interface obtains the value by using the `get_group_size` method. If you set this parameter, your configuration is used. This configuration allows you to manually transfer `device_num` without using the `Model` interface. + +> In semi_auto_parallel/auto_parallel mode, constrain device_num to be 1, 2, 4, or multiples of 8. + +The following is a code example: + +```python +from mindspore import context + +context.set_auto_parallel_context(device_num=8) +context.get_auto_parallel_context("device_num") +``` + +#### global_rank + +`global_rank` indicates the logical sequence number of the current device. The default value is 0. The value is of the int type and must range from 0 to 4095. If you do not set this parameter, the `Model` interface obtains the value by using the `get_rank` method. If you set this parameter, your configuration is used. This configuration allows you to manually transfer `global_rank` without using the `Model` interface. + +The following is a code example: + +```python +from mindspore import context + +context.set_auto_parallel_context(global_rank=0) +context.get_auto_parallel_context("global_rank") +``` + +#### gradients_mean + +`gradients_mean` indicates whether to perform the averaging operation during reverse gradient aggregation. The value is of the Boolean type. The default value is False, indicating that only the SUM operation of AllReduce is performed for gradient aggregation. `gradients_means` affects network convergence. The setting of `gradients_means` may vary in different scenarios. Therefore, MindSpore provides this interface for users to configure parameters based on the actual situation. + +The following is a code example: + +```python +from mindspore import context + +context.set_auto_parallel_context(gradients_mean=False) +context.get_auto_parallel_context("gradients_mean") +``` + +#### parallel_mode + +`parallel_mode` indicates the parallel mode. The value is a character string. The options are as follows: + +- `stand_alone`: standalone mode. +- `data_parallel`: data parallel mode. +- `hybrid_parallel`: hybrid parallel mode. +- `semi_auto_parallel`: semi-automatic parallel mode. In this mode, you can use the `shard` method to configure a segmentation policy for an operator. If no policy is configured, the data parallel policy is used by default. +- `auto_parallel`: automatic parallel mode. In this mode, the framework automatically creates a cost model and selects the optimal segmentation policy for users. + +The complete examples of `auto_parallel` and `data_parallel` are provided in [Distributed Training](https://www.mindspore.cn/docs/programming_guide/zh-CN/master/distributed_training.html). + +The following is a code example: + +```python +from mindspore import context +import mindspore.ops as ops + +context.set_auto_parallel_context(parallel_mode="semi_auto_parallel") +mul = ops.Mul().shard(((2, 1), (2, 1))) +context.get_auto_parallel_context("parallel_mode") +``` + +> In the semi_auto_parallel mode, if a parameter is used by multiple operators, please ensure that the parameter layout in each operator is consistent, otherwise an error will be reported during compilation. In the following example, mul1 and mul2 share the weight, but mul1 splits weight into 8 slices by row, while mul2 splits the weight into 8 slices by column. The layout of weight in the two operators is inconsistent, compilation will be failed. + +```python +import numpy as np +import mindspore as ms +import mindspore.ops as ops +from mindspore import Tensor, Parameter +from mindspore.nn import Cell + +class Net(Cell): + """Net definition""" + def __init__(self): + super(Net, self).__init__() + self.mul1 = ops.Mul().shard(((8, 1), (8, 1))) + self.mul2 = ops.Mul().shard(((1, 8), (1, 8))) + self.weight = Parameter(Tensor(np.ones([16, 32]), dtype=ms.float32), "weight1") + + def construct(self, x): + out = self.mul1(x, self.weight) + out = self.mul2(out, self.weight) + return out +``` + +#### all_reduce_fusion_config + +`all_reduce_fusion_config` allows users to customize the AllReduce segmentation policy by gradient aggregation. To reduce resource consumption and operator execution gaps, the framework fusions all the AllReduce operators of reverse gradient aggregation into one by default. However, when the model is large, the iteration smearing time increases. You can manually tune and find the optimal segmentation policy by gradient aggregation by setting this parameter based on the actual network. + +The following is a code example: + +```python +from mindspore import context + +context.set_auto_parallel_context(all_reduce_fusion_config=[20, 35]) +context.get_auto_parallel_context("all_reduce_fusion_config") +``` + +In the example, the value range of `all_reduce_fusion_config` is [20,35]. The first 20 AllReduce operators, the 20th to 35th AllReduce operators, and the remaining AllReduce operators are fused into three operators, respectively. + +#### enable_alltoall + +`enable_alltoall` indicates whether to allow the `AllToAll` communication operator to be generated during communication. It is a switch and value is of the Boolean type. The default value is False. When the user's network structure satisfies the conditions generated by the `AllToAll` communication operator, a series of communication operator combinations will be generated instead of `AllToAll` in the default closed state, so turning it on will result in a significant performance improvement. + +The following is a code example: + +```python +from mindspore import context + +context.set_auto_parallel_context(enable_alltoall=True) +context.get_auto_parallel_context("enable_alltoall") +``` + +Note that enabling it has specific requirements for your network configuration. + +#### enable_parallel_optimizer + +`enable_parallel_optimizer` is a feature under development. The default value is False. In data parallel, weight update has redundant computation among devices. Parallel optimizer shards the computation of optimizer to each device. For large-scale networks like Bert and GPT, this feature could reduce requirements on memory and improve the performance efficiently. + +The optimizers may parallel in the `data_parallel` mode, and MindSpore will split the parameters that need to be updated into different devices, and then use the Broadcast operator to share weights between clusters after each update. It should be noted that the number of parameters should be greater than the number of machines. Currently, only the `Lamb` and `AdamWeightDecay` optimizers are supported. + +In the `auto_parallel` or `semi_auto_parallel` mode, the optimizer parallel is enabled. If one parameter which has been sliced by shard strategy still has repeated slices among devices, and the highest dimension of the shape can be divided by the number of devices, MindSpore would save parameters and update them by the smallest slice shapes. All optimizers are supported under this two modes. + +No matter which parallel mode is selected, parallel optimizer would not influence the forward and backward graph. Only the computation of weight update would be influenced. + +The following is a code example: + +```python +from mindspore import context + +context.set_auto_parallel_context(enable_parallel_optimizer=True) +context.get_auto_parallel_context("enable_parallel_optimizer") +``` + +#### parameter_broadcast + +Parameter broadcast shares the value of data parallel weights among devices, in the purpose of synchronization of weights. The default value is False and only the graph mode is supported. + +The following is a code example: + +```python +from mindspore import context + +context.set_auto_parallel_context(parameter_broadcast=True) +context.get_auto_parallel_context("parameter_broadcast") +``` + +#### comm_fusion + +`comm_fusion` allows user to configure the communication fusion for various communication operators, and for now, `allreduce`, `allgather`, `reducescatter` are supported. For `allreduce`, it has three `mode` options: + +- `auto`:automatic `allreduce` communication operators fusion by gradients size, and another parameter `config` is `None`. The gradients fusion size is automatically set by 64 MB. +- `size`:manual communication operators fusion by gradients size, and the type of another parameter `config` is `int` and unit is `MB`. +- `index`:manual communication operators fusion by parameters' index,same as `all_reduce_fusion_config`, and the type of parameter `config` is `list(int)`. + +For `allgather` and `reducescatter`, two `mode` options are supported: + +- `auto`:same as `allreduce`, parameter `config` is `None`. The gradients fusion size is automatically set by 64 MB. +- `size`:same as `allreduce`, and the type of another parameter `config` is `int` and unit is `MB`. + +The following is a code example: + +```python +from mindspore import context + +# allreduce auto +context.set_auto_parallel_context(comm_fusion={"allreduce": {"mode": "auto", "config": None}}) + +# allreduce size +context.set_auto_parallel_context(comm_fusion={"allreduce": {"mode": "size", "config": 32}}) + +# allreduce index +context.set_auto_parallel_context(comm_fusion={"allreduce": {"mode": "index", "config": [20, 35]}}) + +# allgather and reducescatter size +context.set_auto_parallel_context(comm_fusion={"allgather": {"mode": "size", "config": 16}, + "reducescatter": {"mode": "size", "config": 32}}) +``` + +### Automatic Parallel Configuration + +#### gradient_fp32_sync + +`gradient_fp32_sync` indicates whether gradients are aggregated based on the FP32 type. The value is of the Boolean type. The default value is True, indicating that gradients are aggregated based on the FP32 type. Due to the special structure of the `Ascend` AI processor, the speed of aggregating FP32 data is higher than that of aggregating FP16 data, but the precision may be affected. Therefore, MindSpore provides the `gradient_fp32_sync` interface for users to make choices based on the actual situation. + +The following is a code example: + +```python +from mindspore import context + +context.set_auto_parallel_context(gradient_fp32_sync=False) +context.get_auto_parallel_context("gradient_fp32_sync") +``` + +#### search_mode + +`auto_parallel_search_mode` is replaced by `search_mode`. `auto_parallel_search_mode` will be deleted in the future MindSpore version. This attribute indicates the algorithm chosen for searching op-level parallelism: `dynamic_programming`, `recursive_programming`, and `sharding_propagation`. + +`dynamic_programming` can search for the optimal policy depicted by the cost model, but it takes a long time to search on a huge network model. +`recursive_programming` can quickly search out parallel policies, but the found policies may not have the optimal running performance. `shardin_propagation` requires users to configure sharding strategies for some operators, and the algorithm will then propagate the strategies from configured ops to non-configured ops, with the goal of minimizing [tensor redistribution](https://www.mindspore.cn/docs/programming_guide/en/master/design/distributed_training_design.html#id10) communication. MindSpore allows users to select a search algorithm. The default value is `dynamic_programming`. + +The following is a code example: + +```python +from mindspore import context + +context.set_auto_parallel_context(search_mode="dynamic_programming") +context.get_auto_parallel_context("search_mode") +``` + +#### strategy_ckpt_load_file + +Specifies a path to load the segmentation information of all operators with weights in automatic parallel mode. + +The following is a code example: + +```python +from mindspore import context + +context.set_auto_parallel_context(strategy_ckpt_load_file="./") +context.get_auto_parallel_context("strategy_ckpt_load_file") +``` + +#### strategy_ckpt_save_file + +Specify a path for storing the segmentation information of all operators with weights in automatic parallel mode. + +The following is a code example: + +```python +from mindspore import context + +context.set_auto_parallel_context(strategy_ckpt_save_file="./") +context.get_auto_parallel_context("strategy_ckpt_save_file") +``` + +#### full_batch + +`full_batch` allows users to determine whether to import datasets in full mode. The default value is False. That is, datasets are imported in the data parallel mode. In special scenarios, the performance of full dataset import is better than that of import in data parallel mode. For example, the WideDeep network is used in uneven segmentation scenarios. Therefore, MindSpore provides the `full_batch` configurable interface. + +The following is a code example: + +```python +from mindspore import context + +context.set_auto_parallel_context(full_batch=False) +context.get_auto_parallel_context("full_batch") +``` + +#### pipeline_stages + +`pipeline_stages` is used to set parallel stage information of pipeline. It is used to show how the machines are distributed in `auto_parallel` mode in pipeline. Currently pipeline parallel is still under development. + +The following is a code example: + +```python +from mindspore import context + +context.set_auto_parallel_context(pipeline_stages=4) +context.get_auto_parallel_context("pipeline_stages") +``` + +#### parallel_optimizer_config + +`parallel_optimizer_config` is a dict contains the keys and values for setting the parallel optimizer configuration. The configuration provides more detailed behavior control about parallel training +when parallel optimizer is enabled. Currently it supports the key `gradient_accumulation_shard`. +The configuration will be effective when we use context.set_auto_parallel_context(enable_parallel_optimizer=True). It supports the following keys. + +- `gradient_accumulation_shard(bool)`: If true, the accumulation gradient parameters will be sharded across the data parallel devices. This will bring in additional communication(ReduceScatter) at each step when accumulate the gradients, but saves a lot of device memory, thus can make model be trained with larger batch size. This configuration is effective only when the model runs on pipeline training or gradient accumulation with data parallel. The default value is True. + + ```python + from mindspore import context + context.set_auto_parallel_context(parallel_optimizer_config={"gradient_accumulation_shard": True}, enable_parallel_optimizer=True) + ``` + +- `parallel_optimizer_threshold(int)`: This is the threshold of parameter size when applying parallel optimizer. Parameters with size smaller than this value will not be sharded in parallel optimizer. + + ```python + import numpy as np + from mindspore import Parameter, Tensor, context, dtype + param = Parameter(Tensor(np.ones((10, 2)), dtype=dtype.float32), name='weight1') + # float32 data type usually takes 4Bytes: + # param_size = np.prod(list(param.shape)) * 4 = (10 * 2) * 4 = 80B < 24KB, thus this param will not be sharded + context.set_auto_parallel_context(parallel_optimizer_config={"parallel_optimizer_threshold": 24}) + ``` + +#### dataset_strategy + +In the distributed training scenario of semi_auto_parallel/auto_parallel mode, there are rich segmentation strategies for the import method of datasets, such as data parallel import, full batch import and more free hybrid parallel import, which can be configured through `dataset_strategy`. + +The following is a code example: + +```python +from mindspore import context +# Set the input to be split on the first dimension. At this time, the user is required to ensure that the input returned by the dataset is split on the first dimension. +context.set_auto_parallel_context(dataset_strategy=((1, 8), (1, 8))) +# Datasets are imported into each card in data-parallel +context.set_auto_parallel_context(dataset_strategy="data_parallel") +# Datasets are imported into each card in full-bacth +context.set_auto_parallel_context(dataset_strategy="full_batch") +``` + +## Distributed Communication Interface + +`mindspore.communication` encapsulates a collection of communication interfaces used by parallel distributed training, facilitating users to configure distributed information. + +### init + +`init` enables MindSpore communication and initializes distributed training. `init` must be invoked after `context.set_context`. You can transfer the communication backend information to the `init`. The `init` performs initialization based on the backend information. + +- `hccl`: short for `Huawei Collective Communication Library`, used for the `Ascend` processor platform. +- `nccl`: short for `NVIDIA Collective Communication Library`, used for the `GPU` processor platform. + +If you do not configure the communication backend, MindSpore automatically configures it based on the `device_target` information in `context`. + +The following is a code example: + +```python +from mindspore import context +from mindspore.communication import init + +context.set_context(device_target='GPU') +init() +``` + +> On the GPU processor platform, MindSpore also supports starting distributed training without relying on 'OpenMPI', and also uses this interface for distributed training initialization. For specific usage, please refer to [not using OpenMPI training](https://www.mindspore.cn/docs/programming_guide/en/master/distributed_training_gpu.html#openmpi). In this case, when the user does not use 'mpirun' to start the process, but still calls the 'init()' method, MindSpore requires the user to follow [not using OpenMPI training](https://www.mindspore.cn/docs/programming_guide/zh-CN/master/distributed_training_gpu.html#openmpi) to configure several environment variables. If not configured, MindSpore will give a reasonable error prompt. Therefore, it is recommended to call this method only when executing distributed training, and when trying to start distributed training without using 'mpirun', please configure the correct environment variables according to the document. + +### get_group_size + +`get_group_size` allows users to obtain the number of clusters. Invoke `init` before using the `get_group_size` interface. + +The following is a code example: + +```python +from mindspore import context +from mindspore.communication import init, get_group_size + +context.set_context(device_target='GPU') +init() +group_size = get_group_size() +``` + +### get_rank + +`get_rank` allows users to obtain the ID of the current device in the cluster. Invoke `init` before using the `get_rank` interface. + +The following is a code example: + +```python +from mindspore import context +from mindspore.communication import init, get_rank + +context.set_context(device_target='GPU') +init() +rank_id = get_rank() +``` + +## Distributed Attribute Configuration + +### cross_batch + +In specific scenarios, the calculation logic of `data_parallel` is different from that of `stand_alone`. The calculation logic of `auto_parallel` is the same as that of `stand_alone` in any scenario. The convergence effect of `data_parallel` may be better. Therefore, MindSpore provides the `cross_batch` parameter to ensure that the calculation logic of `auto_parallel` is consistent with that of `data_parallel`. You can use the `add_prim_attr` method to configure the logic. The default value is False. + +The following is a code example: + +```python +import mindspore.ops as ops + +mul = ops.Mul().add_prim_attr("cross_batch", True) +``` + +### fusion + +To ensure performance, MindSpore provides the fusion function for the `AllGather` and `AllReduce` operators. Operators of the same type (of the same operator type and in the same communication domain) with the same `fusion` value will be fused together. The value of `fusion` must be greater than or equal to 0. When the value of `fusion` is 0, operators will not be fused together. Only `Ascend` backend is supported. + +There are two ways for configuration. If the communication operators are called explicitly, `add_prim_attr` could be used to configure. The following is a code example: + +```python +import mindspore.ops as ops + +allreduce1 = ops.AllReduce().add_prim_attr("fusion", 1) +allreduce2 = ops.AllReduce().add_prim_attr("fusion", 1) +``` + +`allreduce1` and `allreduce2` will be fused into one operator during execution. + +In `auto_parallel` and `semi_auto_parallel` mode, some communication operators used for parameters or gradients aggregation are inserted automatically. So the attribute should be added on a `Cell` or a `Parameter`. For example: + +```python +import numpy as np +from mindspore import ops +import mindspore.nn as nn +from mindspore import Tensor, Parameter +from mindspore import context + +class Net(nn.Cell): + """Net definition""" + def __init__(self): + super(Net, self).__init__() + self.fc1 = ops.MatMul() + self.fc2 = ops.MatMul() + self.p1 = Parameter(Tensor(np.ones([48, 64]).astype(np.float32)), name="weight1") + self.p1.comm_fusion = 2 + self.p2 = Parameter(Tensor(np.ones([64, 16]).astype(np.float32)), name="weight2") + + def construct(self, x, y): + x = self.fc1(x, self.p1) + x = self.fc2(x, self.p2) + return x - y + +context.set_context(mode=context.GRAPH_MODE) +context.set_auto_parallel_context(parallel_mode="auto_parallel", device_num=8) +net = Net().set_comm_fusion(2) +``` + +Here the `comm_fusion` of parameter `Net.p1` is 2, which means the attribute `fusion` is 2 for the communication operators generated for this parameter. When you need to manipulate the parameters in batches, it is recommended to call `set_comm_fusion` to set `comm_fusion` for all the parameters in the Net. The value of attribute will be overwritten when the function is invoked for multiple times. + +> When a parameter is shared, the operators connected with the parameter should have the same data type. Otherwise, fusion would failed. + +### layerwise_parallel + +In `hybrid_parallel` mode, you need to manually split the model. You need to manually add the `layerwise_parallel` flag to the parallel parameters of the model. The framework filters out the gradient aggregation operation for the parallel parameters of the model based on the flag. + +The following is a code example: + +```python +import numpy as np +from mindspore import Parameter, Tensor + +x = Parameter(Tensor(np.ones([2, 2])), name='weight1', layerwise_parallel=True) +``` diff --git a/tutorials/experts/source_en/parallel/distributed_advanced.rst b/tutorials/experts/source_en/parallel/distributed_advanced.rst new file mode 100644 index 0000000000000000000000000000000000000000..47a77c464d646d1c3460e279d7c9a4809052c6b5 --- /dev/null +++ b/tutorials/experts/source_en/parallel/distributed_advanced.rst @@ -0,0 +1,12 @@ +Distributed Parallel Advanced Features +======================================= + +.. toctree:: + :maxdepth: 1 + + pipeline_parallel + host_device_training + parameter_server_training + distributed_inference + recompute + sharding_propagation diff --git a/tutorials/experts/source_en/parallel/distributed_example.rst b/tutorials/experts/source_en/parallel/distributed_example.rst new file mode 100644 index 0000000000000000000000000000000000000000..8b22f7e7fb32c2b1846d6c98dabaffda81209e7a --- /dev/null +++ b/tutorials/experts/source_en/parallel/distributed_example.rst @@ -0,0 +1,9 @@ +Distributed Parallel Usage Example +================================== + +.. toctree:: + :maxdepth: 1 + + distributed_training_ascend + distributed_training_gpu + save_load_model_hybrid_parallel diff --git a/tutorials/experts/source_en/parallel/distributed_inference.md b/tutorials/experts/source_en/parallel/distributed_inference.md new file mode 100644 index 0000000000000000000000000000000000000000..331583ba5020da4c8c56bec4ca85e4bdcf22c40f --- /dev/null +++ b/tutorials/experts/source_en/parallel/distributed_inference.md @@ -0,0 +1,116 @@ +# Distributed Inference + +`Ascend` `Inference Application` + + + +Distributed inference means use multiple devices for prediction. If data parallel or integrated save is used in training, the method of distributed inference is same with the above description. It is noted that each device should load one same checkpoint file. + +## Process of Distributed Inference + +This tutorial would focus on the process that the model slices are saved on each device in the distributed training process, and the model is reloaded according to the predication strategy in the inference stage. In view of the problem that there are too many parameters in the super large scale neural network model, the model can not be fully loaded into a single device for inference, so multiple devices can be used for distributed inference. + +> Distributed inference sample code: +> +> + +The process of distributed inference is as follows: + +1. Execute training, generate the checkpoint file and the model strategy file. + + > - The distributed training tutorial and sample code can be referred to the link: . + > - In the distributed Inference scenario, during the training phase, the `integrated_save` of `CheckpointConfig` interface should be set to `False`, which means that each device only saves the slice of model instead of the full model. + > - `parallel_mode` of `set_auto_parallel_context` interface should be set to `auto_parallel` or `semi_auto_parallel`. + > - In addition, you need to specify `strategy_ckpt_save_file` to indicate the path of the strategy file. + > - If pipeline distributed inference is used, then the pipeline parallel training also must be used. And the `device_num` and `pipeline_stages` used for pipeline training and inference must be the same. While applying pipeline inference, `micro_size` is 1 and there is no need to use `PipelineCell`. The pipeline distributed training tutorial can be referred the link: . + +2. Set context and infer predication strategy according to the predication data. + + ```python + context.set_auto_parallel_context(full_batch=True, parallel_mode='semi_auto_parallel', strategy_ckpt_load_file='./train_strategy.ckpt') + network = Net() + model = Model(network) + predict_data = create_predict_data() + predict_strategy = model.infer_predict_layout(predict_data) + ``` + + In the preceding information: + + - `full_batch`: whether to load the dataset in full or not. When `True`, it indicates full load, and data of each device is the same. It must be set to `True` in this scenario. + - `parallel_mode`: parallel mode, it must be `auto_parallel` or `semi_auto_parallel`. + - `strategy_ckpt_load_file`: file path of the strategy generated in the training phase, which must be set in the distributed inference scenario. + - `create_predict_data`: user-defined interface that returns predication data whose type is `Tensor`. + - `infer_predict_layout`: generates predication strategy based on predication data. + +3. Load checkpoint files, and load the corresponding model slice into each device based on the predication strategy. + + ```python + ckpt_file_list = create_ckpt_file_list() + load_distributed_checkpoint(network, ckpt_file_list, predict_strategy) + ``` + + In the preceding information: + + - `create_ckpt_file_list`:user-defined interface that returns a list of checkpoint file path in order of rank id. + - `load_distributed_checkpoint`:merges model slices, then splits it according to the predication strategy, and loads it into the network. + + > For pipeline inference, each `stage` only needs to load the checkpoint file of self_stage. + > + > The `load_distributed_checkpoint` interface supports that predict_strategy is `None`, which is single device inference, and the process is different from distributed inference. The detailed usage can be referred to the link: + > . + +4. Execute inference. + + ```python + model.predict(predict_data) + ``` + +## Distributed Export MindIR File With Multi Devices + +When the super-large-scale neural network model has too many parameters, the MindIR format model cannot be completely loaded into a single card for reasoning. Multi-card should be used for distributed reasoning. In this case, multiple MindIRs need to be exported before the reasoning task. file. +For Multi-card training and distributed reasoning, it is necessary to export MindIR files in a distributed manner. The specific methods are as follows: + +First, you need to prepare checkpoint files and training strategy files. + +The checkpoint file is generated during the training process. For specific usage of checkpoint, please refer to: [checkpoint usage](https://www.mindspore.cn/docs/programming_guide/en/master/save_model.html#checkpoint). + +The training strategy file needs to be generated by setting the context during training. The context configuration items are as follows: +`context.set_auto_parallel_context(strategy_ckpt_save_file='train_strategy.ckpt')` + +In this way, after training, a training strategy file named `train_strategy.ckpt` will be generated in the set directory. + +Before exporting the MindIR file, it is necessary to load the checkpoint file, the distributed training checkpoint file, the training strategy and the reasoning strategy need to be combined, so the reasoning strategy file can be generated. +The code to generate the inference strategy is as follows: +`predict_strategy = model.infer_predict_layout(predict_data)` + +Then, use the method of loading distributed checkpoints to load the previously trained parameters to the network. +code show as below: +`load_distributed_checkpoint(model, ckpt_file_list, predict_strategy)` + +For the specific usage of `load_distributed_checkpoint`, please refer to: [Distributed Inference](https://www.mindspore.cn/docs/programming_guide/en/master/multi_platform_inference_ascend_910.html#distributed-inference-with-multi-devices). + +Finally, you can export the MindIR file in the distributed reasoning scenario. + +The core code is as follows: + +```python +# Configure the strategy file generated during the training process in the context +context.set_auto_parallel_context(strategy_ckpt_load_file='train_strategy.ckpt') +# Define network structure +network = Net() +model = Model(network) +# Get the reasoning strategy file +predict_strategy = model.infer_predict_layout(predict_data) +# Create checkpoint list +ckpt_file_list = create_ckpt_file_list() +# Load distributed parameters +load_distributed_checkpoint(model, ckpt_file_list, predict_strategy) +# Export distributed MindIR file +export(net, Tensor(input), file_name='net', file_format='MINDIR') +``` + +In the case of multi-card training and single-card inference, the usage of exporting MindIR is the same as that of single machine. For the usage of loading checkpoint, please refer to: [Distributed Inference](https://www.mindspore.cn/docs/programming_guide/en/master/multi_platform_inference_ascend_910.html#ascend-910-ai). + +> Distributed scene export MindIR file sample code: +> +> diff --git a/tutorials/experts/source_en/parallel/distributed_training.rst b/tutorials/experts/source_en/parallel/distributed_training.rst new file mode 100644 index 0000000000000000000000000000000000000000..2cffc2af1919c9cf600ca6fbd4f27d66693e9953 --- /dev/null +++ b/tutorials/experts/source_en/parallel/distributed_training.rst @@ -0,0 +1,25 @@ +Distributed Parallel Overview +============================== + +In deep learning, the increasing number of dataset and parameters prolongs the training time and requires more hardware resources, which becomes a training bottleneck. Parallel training is an important optimization method, which reduces requirements of a single device for high memory and performance. Parallelisms are generally classified into the following types: + +- Data parallelism: splits data into many batches and then allocates the batches to each device for model computation. +- Model parallelism: splits the model across multiple devices, which includes op-level model parallelism, pipeline model parallelism and optimizer model parallelism. +- Hybrid parallelism: contains data parallelism and model parallelism. + +MindSpore also provides the parallel training functionality. It supports the following modes: + +- `DATA_PARALLEL`: data parallelism. +- `AUTO_PARALLEL`: automatic parallelism, which integrates data parallelism, model parallelism, and hybrid parallelism. A cost model is built to characterize training time and memory usage. Currently, MindSpore supports searching strategies for op-level model parallelism, which includes three different algorithms as follows: + + - `dynamic_programming`: Dynamic programming search algorithm. The optimal strategy under the cost model description can be found, but it takes a long time to search for parallel strategy of huge network model. Its cost model refers to modeling the training time based on the memory-based computation and communication overheads of the Ascend 910 chip. + - `recursive_programming`: Double recursive programming search algorithm. The optimal strategy can be generated instantly even for a large network. Its symbolic cost model can be flexibly adapted to different accelerator clusters. + - `sharding_propagation`: Sharding Propagation algorithms. This mode requires users to configure sharding strategies for some operators, and then propagates from these operators to other operators, with the goal of minimizing communication cost in tensor redistribution. For definitions of sharding strategy and tensor redistribution, please refer to this [design article](https://www.mindspore.cn/docs/programming_guide/en/master/design/distributed_training_design.html#id10) + +- `HYBRID_PARALLEL`: On MindSpore, users manually split parameters to implement intra-layer model parallelism. + +.. toctree:: + :maxdepth: 1 + + auto_parallel + distributed_training_mode \ No newline at end of file diff --git a/tutorials/experts/source_en/parallel/distributed_training_ascend.md b/tutorials/experts/source_en/parallel/distributed_training_ascend.md new file mode 100644 index 0000000000000000000000000000000000000000..934b6fe4c8a008c5a9eb22cb86f72b2049256c45 --- /dev/null +++ b/tutorials/experts/source_en/parallel/distributed_training_ascend.md @@ -0,0 +1,687 @@ +# Parallel Distributed Training Example (Ascend) + +`Ascend` `Distributed Parallel` `Whole Process` + + + +## Overview + +This tutorial describes how to train the ResNet-50 network in data parallel and automatic parallel modes on MindSpore based on the Ascend 910 AI processor. +> Download address of the complete sample code: + +The directory structure is as follow: + +```text +└─sample_code + ├─distributed_training + │ rank_table_16pcs.json + │ rank_table_8pcs.json + │ rank_table_2pcs.json + │ cell_wrapper.py + │ model_accu.py + │ resnet.py + │ resnet50_distributed_training.py + │ resnet50_distributed_training_gpu.py + │ resnet50_distributed_training_grad_accu.py + │ run.sh + │ run_gpu.sh + │ run_grad_accu.sh + │ run_cluster.sh +``` + +`rank_table_16pcs.json`, `rank_table_8pcs.json` and `rank_table_2pcs.json` are the networking information files. `resnet.py`,`resnet50_distributed_training.py` , `resnet50_distributed_training_gpu.py` and `resnet50_distributed_training_grad_accu.py` are the network structure files. `run.sh` , `run_gpu.sh`, `run_grad_accu.sh` and `run_cluster.sh` are the execute scripts. + +Besides, we describe the usages of hybrid parallel and semi-auto parallel modes in the sections [Defining the Network](https://www.mindspore.cn/docs/programming_guide/en/master/distributed_training_ascend.html#defining-the-network) and [Distributed Training Model Parameters Saving and Loading](https://www.mindspore.cn/docs/programming_guide/en/master/distributed_training_ascend.html#distributed-training-model-parameters-saving-and-loading). + +## Preparations + +### Downloading the Dataset + +This sample uses the `CIFAR-10` dataset, which consists of color images of 32 x 32 pixels in 10 classes, with 6000 images per class. There are 50,000 images in the training set and 10,000 images in the test set. + +> `CIFAR-10` dataset download address: + +Download the dataset and decompress it to a local path. The folder generated after the decompression is `cifar-10-batches-bin`. + +### Configuring Distributed Environment Variables + +When distributed training is performed in the bare-metal environment (compared with the cloud environment where the Ascend 910 AI processor is deployed on the local host), you need to configure the networking information file for the current multi-device environment. If the HUAWEI CLOUD environment is used, skip this section because the cloud service has been configured. + +The following uses the Ascend 910 AI processor as an example. The JSON configuration file for an environment with eight devices is as follows. In this example, the configuration file is named as `rank_table_8pcs.json`. For details about how to configure the 2-device environment, see the `rank_table_2pcs.json` file in the sample code. + +```json +{ + "version": "1.0", + "server_count": "1", + "server_list": [ + { + "server_id": "10.155.111.140", + "device": [ + {"device_id": "0","device_ip": "192.1.27.6","rank_id": "0"}, + {"device_id": "1","device_ip": "192.2.27.6","rank_id": "1"}, + {"device_id": "2","device_ip": "192.3.27.6","rank_id": "2"}, + {"device_id": "3","device_ip": "192.4.27.6","rank_id": "3"}, + {"device_id": "4","device_ip": "192.1.27.7","rank_id": "4"}, + {"device_id": "5","device_ip": "192.2.27.7","rank_id": "5"}, + {"device_id": "6","device_ip": "192.3.27.7","rank_id": "6"}, + {"device_id": "7","device_ip": "192.4.27.7","rank_id": "7"}], + "host_nic_ip": "reserve" + } + ], + "status": "completed" +} +``` + +The following parameters need to be modified based on the actual training environment: + +- `server_count`: number of hosts. +- `server_id`: IP address of the local host. +- `device_id`: physical sequence number of a device, that is, the actual sequence number of the device on the corresponding host. +- `device_ip`: IP address of the integrated NIC. You can run the `cat /etc/hccn.conf` command on the current host. The key value of `address_x` is the IP address of the NIC. +- `rank_id`: logical sequence number of a device, which starts from 0. + +### Calling the Collective Communication Library + +The Huawei Collective Communication Library (HCCL) is used for the communication of MindSpore parallel distributed training and can be found in the Ascend 310 AI processor software package. In addition, `mindspore.communication.management` encapsulates the collective communication API provided by the HCCL to help users configure distributed information. +> HCCL implements multi-device multi-node communication based on the Ascend AI processor. The common restrictions on using the distributed service are as follows. For details, see the HCCL documentation. +> +> - In a single-node system, a cluster of 1, 2, 4, or 8 devices is supported. In a multi-node system, a cluster of 8 x N devices is supported. +> - Each host has four devices numbered 0 to 3 and four devices numbered 4 to 7 deployed on two different networks. During training of 2 or 4 devices, the devices must be connected and clusters cannot be created across networks. +> - When we create a multi-node system, all nodes should use one same switch. +> - The server hardware architecture and operating system require the symmetrical multi-processing (SMP) mode. +> - Currently only supports global single group communication in PyNative mode. + +The sample code for calling the HCCL is as follows: + +```python +import os +from mindspore import context +from mindspore.communication import init + +if __name__ == "__main__": + context.set_context(mode=context.GRAPH_MODE, device_target="Ascend", device_id=int(os.environ["DEVICE_ID"])) + init() + ... +``` + +In the preceding code: + +- `mode=context.GRAPH_MODE`: sets the running mode to graph mode for distributed training. (The PyNative mode only support data parallel running.) +- `device_id`: physical sequence number of a device, that is, the actual sequence number of the device on the corresponding host. +- `init`: enables HCCL communication and completes the distributed training initialization. + +## Loading the Dataset in Data Parallel Mode + +During distributed training, data is imported in data parallel mode. The following takes the CIFAR-10 dataset as an example to describe how to import the CIFAR-10 dataset in data parallel mode. `data_path` indicates the dataset path, which is also the path of the `cifar-10-batches-bin` folder. + +```python +from mindspore import dtype as mstype +import mindspore.dataset as ds +import mindspore.dataset.transforms.c_transforms as C +import mindspore.dataset.vision.c_transforms as vision +from mindspore.communication import get_rank, get_group_size + +def create_dataset(data_path, repeat_num=1, batch_size=32, rank_id=0, rank_size=1): + resize_height = 224 + resize_width = 224 + rescale = 1.0 / 255.0 + shift = 0.0 + + # get rank_id and rank_size + rank_id = get_rank() + rank_size = get_group_size() + data_set = ds.Cifar10Dataset(data_path, num_shards=rank_size, shard_id=rank_id) + + # define map operations + random_crop_op = vision.RandomCrop((32, 32), (4, 4, 4, 4)) + random_horizontal_op = vision.RandomHorizontalFlip() + resize_op = vision.Resize((resize_height, resize_width)) + rescale_op = vision.Rescale(rescale, shift) + normalize_op = vision.Normalize((0.4465, 0.4822, 0.4914), (0.2010, 0.1994, 0.2023)) + changeswap_op = vision.HWC2CHW() + type_cast_op = C.TypeCast(mstype.int32) + + c_trans = [random_crop_op, random_horizontal_op] + c_trans += [resize_op, rescale_op, normalize_op, changeswap_op] + + # apply map operations on images + data_set = data_set.map(operations=type_cast_op, input_columns="label") + data_set = data_set.map(operations=c_trans, input_columns="image") + + # apply shuffle operations + data_set = data_set.shuffle(buffer_size=10) + + # apply batch operations + data_set = data_set.batch(batch_size=batch_size, drop_remainder=True) + + # apply repeat operations + data_set = data_set.repeat(repeat_num) + + return data_set +``` + +Different from the single-node system, the multi-node system needs to transfer the `num_shards` and `shard_id` parameters to the dataset API. The two parameters correspond to the number of devices and logical sequence numbers of devices, respectively. You are advised to obtain the parameters through the HCCL API. + +- `get_rank`: obtains the ID of the current device in the cluster. +- `get_group_size`: obtains the number of devices. + +> Under data parallel mode, it is recommended to load the same dataset file for each device, or it may cause accuracy problems. + +## Defining the Network + +In data parallel and automatic parallel modes, the network definition method is the same as that in a single-node system. The reference code of ResNet is as follows: + +In this section we focus on how to define a network in hybrid parallel or semi-auto parallel mode. + +### Hybrid Parallel Mode + +Hybrid parallel mode adds the setting `layerwise_parallel` for `parameter` based on the data parallel mode. The `parameter` with the setting would be saved and computed in slice tensor and would not apply gradients aggregation. In this mode, MindSpore would not infer computation and communication for parallel operators automatically. To ensure the consistency of calculation logic, users are required to manually infer extra operations and insert them to networks. Therefore, this parallel mode is suitable for the users with deep understanding of parallel theory. + +In the following example, specify the `self.weight` as the `layerwise_parallel`, that is, the `self.weight` and the output of `MatMul` are sliced on the second dimension. At this time, perform ReduceSum on the second dimension would only get one sliced result. `AllReduce.Sum` is required here to accumulate the results among all devices. More information about the parallel theory please refer to the [design document](https://www.mindspore.cn/docs/programming_guide/en/master/design/distributed_training_design.html). + +```python +from mindspore import Tensor +import mindspore.ops as ops +from mindspore import dtype as mstype +import mindspore.nn as nn + +class HybridParallelNet(nn.Cell): + def __init__(self): + super(HybridParallelNet, self).__init__() + # initialize the weight which is sliced at the second dimension + weight_init = np.random.rand(512, 128/2).astype(np.float32) + self.weight = Parameter(Tensor(weight_init), layerwise_parallel=True) + self.fc = ops.MatMul() + self.reduce = ops.ReduceSum() + self.allreduce = ops.AllReduce(op='sum') + + def construct(self, x): + x = self.fc(x, self.weight) + x = self.reduce(x, -1) + x = self.allreduce(x) + return x +``` + +### Semi Auto Parallel Mode + +Compared with the auto parallel mode, semi auto parallel mode supports manual configuration on shard strategies for network tuning. The definition of shard strategies could be referred by this [design document](https://www.mindspore.cn/docs/programming_guide/en/master/design/distributed_training_design.html). + +In the above example `HybridParallelNet`, the script in semi auto parallel mode is as follows. The shard stratege of `MatMul` is `((1, 1), (1, 2))`, which means `self.weight` is sliced at the second dimension. + +```python +from mindspore import Tensor +import mindspore.ops as ops +from mindspore import dtype as mstype +import mindspore.nn as nn + +class SemiAutoParallelNet(nn.Cell): + def __init__(self): + super(SemiAutoParallelNet, self).__init__() + # initialize full tensor weight + weight_init = np.random.rand(512, 128).astype(np.float32) + self.weight = Parameter(Tensor(weight_init)) + # set shard strategy + self.fc = ops.MatMul().shard(((1, 1),(1, 2))) + self.reduce = ops.ReduceSum() + + def construct(self, x): + x = self.fc(x, self.weight) + x = self.reduce(x, -1) + return x +``` + +> - In the semi auto parallel mode, the operators that are not assigned with any shard strategies would be executed in data parallel. +> - The auto parallel mode not only supports the parallel strategy that can automatically acquire efficient operators by strategy searching algorithms, this mode also enables users to manually assign specific parallel strategies. +> - If a parameter is used by multiple operators, each operator's shard strategy for this parameter needs to be consistent, otherwise an error will be reported. + +## Defining the Loss Function and Optimizer + +### Defining the Loss Function + +Automatic parallelism splits models using the operator granularity and obtains the optimal parallel strategy through algorithm search. Therefore, to achieve a better parallel training effect, you are advised to use small operators to implement the loss function. + +In the loss function, the `SoftmaxCrossEntropyWithLogits` is expanded into multiple small operators for implementation according to a mathematical formula. The sample code is as follows: + +```python +import mindspore.ops as ops +from mindspore import Tensor +from mindspore import dtype as mstype +import mindspore.nn as nn + +class SoftmaxCrossEntropyExpand(nn.Cell): + def __init__(self, sparse=False): + super(SoftmaxCrossEntropyExpand, self).__init__() + self.exp = ops.Exp() + self.sum = ops.ReduceSum(keep_dims=True) + self.onehot = ops.OneHot() + self.on_value = Tensor(1.0, mstype.float32) + self.off_value = Tensor(0.0, mstype.float32) + self.div = ops.Div() + self.log = ops.Log() + self.sum_cross_entropy = ops.ReduceSum(keep_dims=False) + self.mul = ops.Mul() + self.mul2 = ops.Mul() + self.mean = ops.ReduceMean(keep_dims=False) + self.sparse = sparse + self.max = ops.ReduceMax(keep_dims=True) + self.sub = ops.Sub() + + def construct(self, logit, label): + logit_max = self.max(logit, -1) + exp = self.exp(self.sub(logit, logit_max)) + exp_sum = self.sum(exp, -1) + softmax_result = self.div(exp, exp_sum) + if self.sparse: + label = self.onehot(label, ops.shape(logit)[1], self.on_value, self.off_value) + softmax_result_log = self.log(softmax_result) + loss = self.sum_cross_entropy((self.mul(softmax_result_log, label)), -1) + loss = self.mul2(ops.scalar_to_array(-1.0), loss) + loss = self.mean(loss, -1) + + return loss +``` + +### Defining the Optimizer + +The `Momentum` optimizer is used as the parameter update tool. The definition is the same as that in the single-node system. For details, see the implementation in the sample code. + +## Training the Network + +`context.set_auto_parallel_context` is an API for users to set parallel training parameters and must be called before the initialization of networks. The related parameters are as follows: + +- `parallel_mode`: parallel distributed mode. The default value is `ParallelMode.STAND_ALONE`. The other options are `ParallelMode.DATA_PARALLEL` and `ParallelMode.AUTO_PARALLEL`. +- `parameter_broadcast`: the data parallel weights on the first device would be broadcast to other devices. The default value is `False`, +- `gradients_mean`: During backward computation, the framework collects gradients of parameters in data parallel mode across multiple hosts, obtains the global gradient value, and transfers the global gradient value to the optimizer for update. The default value is `False`, which indicates that the `AllReduce.Sum` operation is applied. The value `True` indicates that the `AllReduce.Mean` operation is applied. +- You are advised to set `device_num` and `global_rank` to their default values. The framework calls the HCCL API to obtain the values. + +> More about the distributed training configurations please refer to the [programming guide](https://www.mindspore.cn/docs/programming_guide/en/master/auto_parallel.html). + +If multiple network cases exist in the script, call `context.reset_auto_parallel_context` to restore all parameters to default values before executing the next case. + +In the following sample code, the automatic parallel mode is specified. To switch to the data parallel mode, you only need to change `parallel_mode` to `DATA_PARALLEL` and do not need to specify the strategy search algorithm `auto_parallel_search_mode`. In the sample code, the recursive programming strategy search algorithm is specified for automatic parallel. + +```python +from mindspore import context, Model +from mindspore.nn import Momentum +from mindspore.train.callback import LossMonitor +from mindspore.context import ParallelMode +from resnet import resnet50 + +device_id = int(os.getenv('DEVICE_ID')) +context.set_context(mode=context.GRAPH_MODE, device_target="Ascend") +context.set_context(device_id=device_id) # set device_id + +def test_train_cifar(epoch_size=10): + context.set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL, gradients_mean=True) + loss_cb = LossMonitor() + dataset = create_dataset(data_path) + batch_size = 32 + num_classes = 10 + net = resnet50(batch_size, num_classes) + loss = SoftmaxCrossEntropyExpand(sparse=True) + opt = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()), 0.01, 0.9) + model = Model(net, loss_fn=loss, optimizer=opt) + model.train(epoch_size, dataset, callbacks=[loss_cb], dataset_sink_mode=True) +``` + +In the preceding code: + +- `dataset_sink_mode=True`: uses the dataset sink mode. That is, the training computing is sunk to the hardware platform for execution. +- `LossMonitor`: returns the loss value through the callback function to monitor the loss function. + +## Running the Script + +### Single-host Training + +After the script required for training is edited, run the corresponding command to call the script. + +Currently, MindSpore distributed execution uses the single-device single-process running mode. That is, one process runs on each device, and the number of total processes is the same as the number of devices that are being used. For device 0, the corresponding process is executed in the foreground. For other devices, the corresponding processes are executed in the background. You need to create a directory for each process to store log information and operator compilation information. The following takes the distributed training script for eight devices as an example to describe how to run the script: + +```bash +#!/bin/bash + +echo "==============================================================================================================" +echo "Please run the script as: " +echo "bash run.sh DATA_PATH RANK_SIZE" +echo "For example: bash run.sh /path/dataset 8" +echo "It is better to use the absolute path." +echo "==============================================================================================================" +DATA_PATH=$1 +export DATA_PATH=${DATA_PATH} +RANK_SIZE=$2 + +EXEC_PATH=$(pwd) + +test_dist_8pcs() +{ + export RANK_TABLE_FILE=${EXEC_PATH}/rank_table_8pcs.json + export RANK_SIZE=8 +} + +test_dist_2pcs() +{ + export RANK_TABLE_FILE=${EXEC_PATH}/rank_table_2pcs.json + export RANK_SIZE=2 +} + +test_dist_${RANK_SIZE}pcs + +for((i=1;i<${RANK_SIZE};i++)) +do + rm -rf device$i + mkdir device$i + cp ./resnet50_distributed_training.py ./resnet.py ./device$i + cd ./device$i + export DEVICE_ID=$i + export RANK_ID=$i + echo "start training for device $i" + env > env$i.log + pytest -s -v ./resnet50_distributed_training.py > train.log$i 2>&1 & + cd ../ +done +rm -rf device0 +mkdir device0 +cp ./resnet50_distributed_training.py ./resnet.py ./device0 +cd ./device0 +export DEVICE_ID=0 +export RANK_ID=0 +echo "start training for device 0" +env > env0.log +pytest -s -v ./resnet50_distributed_training.py > train.log0 2>&1 +if [ $? -eq 0 ];then + echo "training success" +else + echo "training failed" + exit 2 +fi +cd ../ +``` + +The variables `DATA_PATH` and `RANK_SIZE` need to be transferred to the script, which indicate the absolute path of the dataset and the number of devices, respectively. + +The distributed related environment variables are as follows: + +- `RANK_TABLE_FILE`: path for storing the network information file. +- `DEVICE_ID`: actual sequence number of the current device on the corresponding host. +- `RANK_ID`: logical sequence number of the current device. + +For details about other environment variables, see configuration items in the installation guide. + +The running time is about 5 minutes, which is mainly occupied by operator compilation. The actual training time is within 20 seconds. You can use `ps -ef | grep pytest` to monitor task processes. + +Log files are saved in the `device0`,`device1`... directory. The `env.log` file records environment variable information. The `train.log` file records the loss function information. The following is an example: + +```text +epoch: 1 step: 156, loss is 2.0084016 +epoch: 2 step: 156, loss is 1.6407638 +epoch: 3 step: 156, loss is 1.6164391 +epoch: 4 step: 156, loss is 1.6838071 +epoch: 5 step: 156, loss is 1.6320667 +epoch: 6 step: 156, loss is 1.3098773 +epoch: 7 step: 156, loss is 1.3515002 +epoch: 8 step: 156, loss is 1.2943741 +epoch: 9 step: 156, loss is 1.2316195 +epoch: 10 step: 156, loss is 1.1533381 +``` + +### Multi-host Training + +The previous chapters introduced the distributed training of MindSpore, which is based on the Ascend environment of a single host with multiple devices. Using multiple hosts for distributed training can greatly improve the training speed. +In the Ascend environment, the communication between NPU units across hosts is the same as the communication between each NPU unit in a single host. It is still communicated through HCCL. The difference is that the NPU units in a single host are naturally interoperable, while cross-host communication needs to be guaranteed that the networks of the two hosts are interoperable. + +Execute the following command on server 1 to configure the target connect IP as the `device ip` on the server 2. For example, configure the target IP of device 0 of server 1 as the IP of device 0 of server 2. Configuration command requires the `hccn_tool` tool. +[HCCL tool](https://support.huawei.com/enterprise/en/ascend-computing/a300t-9000-pid-250702906?category=developer-documents) comes with the CANN package. + +```bash +hccn_tool -i 0 -netdetect -s address 192.98.92.131 +hccn_tool -i 1 -netdetect -s address 192.98.93.131 +hccn_tool -i 2 -netdetect -s address 192.98.94.131 +hccn_tool -i 3 -netdetect -s address 192.98.95.131 +hccn_tool -i 4 -netdetect -s address 192.98.92.141 +hccn_tool -i 5 -netdetect -s address 192.98.93.141 +hccn_tool -i 6 -netdetect -s address 192.98.94.141 +hccn_tool -i 7 -netdetect -s address 192.98.95.141 +``` + +`-i 0` specifies the device ID. `-netdetect` specifies the IP attribute of the network detection. `-s address` means to set the property to an IP address. `192.98.92.131` represents the ip address of device 0 on the server 2. Interface commands can be found [here](https://support.huawei.com/enterprise/en/doc/EDOC1100207443/efde9769/sets-the-ip-address-of-the-network-detection-object). + +After executing the above command on server 1, run the following command to start the detection of the network link status. The corresponding command can be found [here](https://support.huawei.com/enterprise/en/doc/EDOC1100207443/f6c5a628/obtains-the-status-of-a-link). + +```bash +hccn_tool -i 0 -net_health -g +hccn_tool -i 1 -net_health -g +hccn_tool -i 2 -net_health -g +hccn_tool -i 3 -net_health -g +hccn_tool -i 4 -net_health -g +hccn_tool -i 5 -net_health -g +hccn_tool -i 6 -net_health -g +hccn_tool -i 7 -net_health -g +``` + +If the connection is normal, the corresponding output is as follows: + +```bash +net health status: Success +``` + +If the connection fails, the corresponding output is as follows: + +```bash +net health status: Fault +``` + +After confirming that the network of the NPU unit between the hosts is connected, configure the json configuration file of multiple hosts. This tutorial takes the configuration file of 16 devices as an example. The detailed configuration file description can refer to the introduction of the single-host multi-device part of this tutorial. It should be noted that in the json file configuration of multiple hosts, the order of rank_id is required to be consistent with the lexicographic order of server_id. + +```json +{ + "version": "1.0", + "server_count": "2", + "server_list": [ + { + "server_id": "10.155.111.140", + "device": [ + {"device_id": "0","device_ip": "192.1.27.6","rank_id": "0"}, + {"device_id": "1","device_ip": "192.2.27.6","rank_id": "1"}, + {"device_id": "2","device_ip": "192.3.27.6","rank_id": "2"}, + {"device_id": "3","device_ip": "192.4.27.6","rank_id": "3"}, + {"device_id": "4","device_ip": "192.1.27.7","rank_id": "4"}, + {"device_id": "5","device_ip": "192.2.27.7","rank_id": "5"}, + {"device_id": "6","device_ip": "192.3.27.7","rank_id": "6"}, + {"device_id": "7","device_ip": "192.4.27.7","rank_id": "7"}], + "host_nic_ip": "reserve" + }, + { + "server_id": "10.155.111.141", + "device": [ + {"device_id": "0","device_ip": "192.1.27.8","rank_id": "8"}, + {"device_id": "1","device_ip": "192.2.27.8","rank_id": "9"}, + {"device_id": "2","device_ip": "192.3.27.8","rank_id": "10"}, + {"device_id": "3","device_ip": "192.4.27.8","rank_id": "11"}, + {"device_id": "4","device_ip": "192.1.27.9","rank_id": "12"}, + {"device_id": "5","device_ip": "192.2.27.9","rank_id": "13"}, + {"device_id": "6","device_ip": "192.3.27.9","rank_id": "14"}, + {"device_id": "7","device_ip": "192.4.27.9","rank_id": "15"}], + "host_nic_ip": "reserve" + } + ], + "status": "completed" +} +``` + +After preparing the configuration file, you can organize distributed multi-host training scripts. Taking 2 hosts with 16 devices as an example, the scripts written on the two hosts are similar to the running scripts of a single host with multiple devices. The difference is that different rank_id variables are specified. + +```bash +#!/bin/bash + +echo "==============================================================================================================" +echo "Please run the script as: " +echo "bash run_cluster.sh DATA_PATH RANK_TABLE_FILE RANK_SIZE RANK_START" +echo "For example: bash run_cluster.sh /path/dataset /path/rank_table.json 16 0" +echo "It is better to use the absolute path." +echo "The time interval between multiple hosts to execute the script should not exceed 120s" +echo "==============================================================================================================" + +execute_path=$(pwd) +echo ${execute_path} +script_self=$(readlink -f "$0") +self_path=$(dirname "${script_self}") +echo ${self_path} + +export DATA_PATH=$1 +export RANK_TABLE_FILE=$2 +export RANK_SIZE=$3 +RANK_START=$4 +DEVICE_START=0 +for((i=0;i<=7;i++)); +do + export RANK_ID=$[i+RANK_START] + export DEVICE_ID=$[i+DEVICE_START] + rm -rf ${execute_path}/device_$RANK_ID + mkdir ${execute_path}/device_$RANK_ID + cd ${execute_path}/device_$RANK_ID || exit + pytest -s ${self_path}/resnet50_distributed_training.py >train$RANK_ID.log 2>&1 & +done +``` + +When executing, the two hosts execute the following commands respectively, among which rank_table.json is configured according to the 16-device distributed json file reference configuration shown in this chapter. + +```bash +# server0 +bash run_cluster.sh /path/dataset /path/rank_table.json 16 0 +# server1 +bash run_cluster.sh /path/dataset /path/rank_table.json 16 8 +``` + +### Non-sink Mode Training + +In graph mode, you can specify to train the model in a non-sink mode by setting the environment variable [GRAPH_OP_RUN](https://www.mindspore.cn/docs/note/en/master/env_var_list.html)=1. In this case, you need to set environment variable `HCCL_WHITELIST_DISABLE=1` and train model with OpenMPI `mpirun`. The startup script is consistent with the [GPU's distributed training](https://www.mindspore.cn/docs/programming_guide/en/master/distributed_training_gpu.html#running-the-script) script. + +## Distributed Training Model Parameters Saving and Loading + +The below content introduced how to save and load models under the four distributed parallel training modes respectively. Before saving model parameters for distributed training, it is necessary to configure distributed environment variables and collective communication library in accordance with this tutorial. + +### Auto Parallel Mode + +It is convenient to save and load the model parameters in auto parallel mode. Just add configuration `CheckpointConfig` and `ModelCheckpoint` to `test_train_cifar` method in the training network steps of this tutorial, and the model parameters can be saved. It should be noted that in parallel mode, you need to specify a different checkpoint save path for the scripts running on each device to prevent conflicts when reading and writing files, The code is as follows: + +```python +from mindspore.train.callback import ModelCheckpoint, CheckpointConfig + +def test_train_cifar(epoch_size=10): + context.set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL, gradients_mean=True) + loss_cb = LossMonitor() + dataset = create_dataset(data_path) + batch_size = 32 + num_classes = 10 + net = resnet50(batch_size, num_classes) + loss = SoftmaxCrossEntropyExpand(sparse=True) + opt = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()), 0.01, 0.9) + ckpt_config = CheckpointConfig() + ckpt_callback = ModelCheckpoint(prefix='auto_parallel', directory="./ckpt_" + str(get_rank()) + "/", config=ckpt_config) + model = Model(net, loss_fn=loss, optimizer=opt) + model.train(epoch_size, dataset, callbacks=[loss_cb, ckpt_callback], dataset_sink_mode=True) +``` + +After saving the checkpoint file, users can easily load model parameters for reasoning or retraining. For example, the following code can be used for retraining: + +```python +from mindspore import load_checkpoint, load_param_into_net + +net = resnet50(batch_size=32, num_classes=10) +# The parameter for load_checkpoint is a .ckpt file which has been successfully saved +param_dict = load_checkpoint(pretrain_ckpt_path) +load_param_into_net(net, param_dict) +``` + +For checkpoint configuration policy and saving method, please refer to [Saving and Loading Model Parameters](https://www.mindspore.cn/docs/programming_guide/en/master/save_model.html#checkpoint-configuration-policies). + +By default, sliced parameters would be merged before saving automatocally. However, considering large-scaled networks, a large size checkpoint file will be difficult to be transferred and loaded. So every device can save sliced parameters separately by setting `integrated_save` as `False` in `CheckpointConfig`. If the shard strategies of retraining or inference are different with that of training, the special loading way is needed. + +In retraining with multiple devices scenarios, users can infer shard strategy of retraining with `model.infer_train_layout` (only dataset sink mode is supported). The shard strategy will be used as `predict_strategy` for `load_distributed_checkpoint` function, which restores sliced parameters from `strategy_ckpt_load_file` (training strategy) to `predict_strategy` (retraining strategy) and load them into `model.train_network`. If there is only one device in retraining, `predict_strategy` could be `None`. The code is as follows: + +```python +from mindspore import load_distributed_checkpoint, context +from mindspore.communication import init + +context.set_context(mode=context.GRAPH_MODE) +init() +context.set_auto_parallel_context(full_batch=True, parallel_mode='semi_auto_parallel', strategy_ckpt_load_file='./train_strategy.ckpt') +# create model and dataset +dataset = create_custom_dataset() +resnet = ResNet50() +opt = Momentum() +loss = SoftmaxCrossEntropyWithLogits() +model = Model(resnet, loss, opt) +# infer train strategy +layout_dict = model.infer_train_layout(dataset, True, 100) +# load into `model.train_network` net +ckpt_file_list = create_ckpt_file_list() +load_distributed_checkpoint(model.train_network, ckpt_file_list, layout_dict) +# training the model +model.train(2, dataset) +``` + +> Distributed inference could be referred to [Distributed inference](https://www.mindspore.cn/docs/programming_guide/en/master/multi_platform_inference_ascend_910.html#id1). + +### Data Parallel Mode + +In data parallel mode, checkpoint is used in the same way as in auto parallel mode. You just need to change: + +```python +context.set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL, gradients_mean=True) +``` + +to: + +```python +context.set_auto_parallel_context(parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True) +``` + +> Under data parallel mode, we recommend to load the same checkpoint for each device to avoid accuracy problems. `parameter_broadcast` could also be used for sharing the values of parameters among devices. + +### Semi Auto Parallel Mode + +In semi auto parallel mode, checkpoint is used in the same way as in auto parallel mode and data parallel mode. The difference is in the definition of a network and the definition of network model, you can refer to defining the network [Semi Auto Parallel Mode](https://www.mindspore.cn/docs/programming_guide/en/master/distributed_training_ascend.html#semi-auto-parallel-mode) in this tutorial. + +To save the model, you can use the following code: + +```python +... +net = SemiAutoParallelNet() +... +ckpt_config = CheckpointConfig() +ckpt_callback = ModelCheckpoint(prefix='semi_auto_parallel', config=ckpt_config) +``` + +To load the model, you can use the following code: + +```python +net = SemiAutoParallelNet() +# The parameter for load_checkpoint is a .ckpt file which has been successfully saved +param_dict = load_checkpoint(pretrain_ckpt_path) +load_param_into_net(net, param_dict) +``` + +For the three parallel training modes described above, the checkpoint file is saved in a complete way on each device. Users also can save only the checkpoint file of this device on each device, take Semi Auto parallel Mode as an example for explanation. + +Only by changing the code that sets the checkpoint saving policy, the checkpoint file of each device can be saved by itself. The specific changes are as follows: + +Change the checkpoint configuration policy from: + +```python +# config checkpoint +ckpt_config = CheckpointConfig(keep_checkpoint_max=1) +``` + +to: + +```python +# config checkpoint +ckpt_config = CheckpointConfig(keep_checkpoint_max=1, integrated_save=False) +``` + +It should be noted that if users choose this checkpoint saving policy, users need to save and load the segmented checkpoint for subsequent reasoning or retraining. Specific usage can refer to [Integrating the Saved Checkpoint Files](https://www.mindspore.cn/docs/programming_guide/en/master/save_load_model_hybrid_parallel.html#integrating-the-saved-checkpoint-files). + +### Hybrid Parallel Mode + +For model parameter saving and loading in Hybrid Parallel Mode, please refer to [Saving and Loading Model Parameters in the Hybrid Parallel Scenario](https://www.mindspore.cn/docs/programming_guide/en/master/save_load_model_hybrid_parallel.html). diff --git a/tutorials/experts/source_en/parallel/distributed_training_gpu.md b/tutorials/experts/source_en/parallel/distributed_training_gpu.md new file mode 100644 index 0000000000000000000000000000000000000000..edb4a1ad4df4b1fefba26c9b7f0c1cf5fda40bdc --- /dev/null +++ b/tutorials/experts/source_en/parallel/distributed_training_gpu.md @@ -0,0 +1,137 @@ +# Distributed Parallel Training Example (GPU) + +`GPU` `Distributed Parallel` `Whole Process` + + + +## Overview + +This tutorial describes how to train the ResNet-50 network using MindSpore data parallelism and automatic parallelism on the GPU hardware platform. + +## Preparation + +### Downloading the Dataset + +The `CIFAR-10` dataset is used as an example. The method of downloading and loading the dataset is the same as that for the Ascend 910 AI processor. + +The method of downloading and loading the dataset: + +### Configuring Distributed Environment + +- `OpenMPI-4.0.3`: multi-process communication library used by MindSpore. + + Download the OpenMPI-4.0.3 source code package `openmpi-4.0.3.tar.gz` from . + + For details about how to install OpenMPI, see the official tutorial: . + +- Password-free login between hosts (required for multi-host training). If multiple hosts are involved in the training, you need to configure password-free login between them. The procedure is as follows: + 1. Ensure that the same user is used to log in to each host. (The root user is not recommended.) + 2. Run the `ssh-keygen -t rsa -P ""` command to generate a key. + 3. Run the `ssh-copy-id DEVICE-IP` command to set the IP address of the host that requires password-free login. + 4. Run the`ssh DEVICE-IP` command. If you can log in without entering the password, the configuration is successful. + 5. Run the preceding command on all hosts to ensure that every two hosts can communicate with each other. + +### Calling the Collective Communication Library + +On the GPU hardware platform, MindSpore parallel distributed training uses NCCL for communication. + +> On the GPU platform, MindSpore does not support the following operations: +> +> `get_local_rank`, `get_local_size`, `get_world_rank_from_group_rank`, `get_group_rank_from_world_rank` and `create_group` + +The sample code for calling the HCCL is as follows: + +```python +from mindspore import context +from mindspore.communication import init + +if __name__ == "__main__": + context.set_context(mode=context.GRAPH_MODE, device_target="GPU") + init("nccl") + ... +``` + +In the preceding information, + +- `mode=context.GRAPH_MODE`: sets the running mode to graph mode for distributed training. (The PyNative mode does not support parallel running.) +- `init("nccl")`: enables NCCL communication and completes the distributed training initialization. + +## Defining the Network + +On the GPU hardware platform, the network definition is the same as that for the Ascend 910 AI processor. + +For details about the definitions of the network, optimizer, and loss function, see . + +## Running the Script + +On the GPU hardware platform, MindSpore uses OpenMPI `mpirun` for distributed training. + +### Single-host Training + +The following takes the distributed training script for eight devices as an example to describe how to run the script: + +> Obtain the running script of the example from: +> +> +> +> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`. + +```bash +#!/bin/bash + +echo "==============================================================================================================" +echo "Please run the script as: " +echo "bash run_gpu.sh DATA_PATH" +echo "For example: bash run_gpu.sh /path/dataset" +echo "It is better to use the absolute path." +echo "==============================================================================================================" +DATA_PATH=$1 +export DATA_PATH=${DATA_PATH} + +rm -rf device +mkdir device +cp ./resnet50_distributed_training.py ./resnet.py ./device +cd ./device +echo "start training" +mpirun -n 8 pytest -s -v ./resnet50_distributed_training.py > train.log 2>&1 & +``` + +The script will run in the bachground. The log file is saved in the device directory, we will run 10 epochs and each epochs contain 234 steps, and the loss result is saved in train.log. The output loss values of the grep command are as follows: + +```text +epoch: 1 step: 1, loss is 2.3025854 +epoch: 1 step: 1, loss is 2.3025854 +epoch: 1 step: 1, loss is 2.3025854 +epoch: 1 step: 1, loss is 2.3025854 +epoch: 1 step: 1, loss is 2.3025854 +epoch: 1 step: 1, loss is 2.3025854 +epoch: 1 step: 1, loss is 2.3025854 +epoch: 1 step: 1, loss is 2.3025854 +``` + +### Multi-host Training + +If multiple hosts are involved in the training, you need to set the multi-host configuration in the `mpirun` command. You can use the `-H` option in the `mpirun` command. For example, `mpirun -n 16 -H DEVICE1_IP:8,DEVICE2_IP:8 python hello.py` indicates that eight processes are started on the hosts whose IP addresses are DEVICE1_IP and DEVICE2_IP, respectively. Alternatively, you can create a hostfile similar to the following and transfer its path to the `--hostfile` option of `mpirun`. Each line in the hostfile is in the format of `[hostname] slots=[slotnum]`, where hostname can be an IP address or a host name. + +```text +DEVICE1 slots=8 +DEVICE2 slots=8 +``` + +The following is the execution script of the 16-device two-host cluster. The variables `DATA_PATH` and `HOSTFILE` need to be transferred, indicating the dataset path and hostfile path. For details about more mpirun options, see the OpenMPI official website. + +```bash +#!/bin/bash + +DATA_PATH=$1 +HOSTFILE=$2 + +rm -rf device +mkdir device +cp ./resnet50_distributed_training.py ./resnet.py ./device +cd ./device +echo "start training" +mpirun -n 16 --hostfile $HOSTFILE -x DATA_PATH=$DATA_PATH -x PATH -mca pml ob1 pytest -s -v ./resnet50_distributed_training.py > train.log 2>&1 & +``` + +Run running on GPU, the model parameters can be saved and loaded by referring to [Distributed Training Model Parameters Saving and Loading](https://www.mindspore.cn/docs/programming_guide/en/master/distributed_training_ascend.html#distributed-training-model-parameters-saving-and-loading). diff --git a/tutorials/experts/source_en/parallel/distributed_training_mode.md b/tutorials/experts/source_en/parallel/distributed_training_mode.md new file mode 100644 index 0000000000000000000000000000000000000000..69d4439542fc5fd6f24f0dfaf90f4e45845f9e6e --- /dev/null +++ b/tutorials/experts/source_en/parallel/distributed_training_mode.md @@ -0,0 +1,6 @@ +# Distributed Parallel Training Mode + +No English version available right now, welcome to contribute. + + + diff --git a/tutorials/experts/source_en/parallel/host_device_training.md b/tutorials/experts/source_en/parallel/host_device_training.md new file mode 100644 index 0000000000000000000000000000000000000000..4298b22695cdbe425a624531fa9b3df47390ce52 --- /dev/null +++ b/tutorials/experts/source_en/parallel/host_device_training.md @@ -0,0 +1,91 @@ +# Host&Device Heterogeneous + +`Ascend` `GPU` `Distributed Parallel` `Whole Process` + + + +## Overview + +In deep learning, one usually has to deal with the huge model problem, in which the total size of parameters in the model is beyond the device memory capacity. To efficiently train a huge model, one solution is to employ homogeneous accelerators (*e.g.*, Ascend 910 AI Accelerator and GPU) for distributed training. When the size of a model is hundreds of GBs or several TBs, +the number of required accelerators is too overwhelming for people to access, resulting in this solution inapplicable. One alternative is Host+Device hybrid training. This solution simultaneously leveraging the huge memory in hosts and fast computation in accelerators, is a promisingly +efficient method for addressing huge model problem. + +In MindSpore, users can easily implement hybrid training by configuring trainable parameters and necessary operators to run on hosts, and other operators to run on accelerators. +This tutorial introduces how to train [Wide&Deep](https://gitee.com/mindspore/models/tree/master/official/recommend/wide_and_deep) in the Host+Ascend 910 AI Accelerator mode. + +## Preliminaries + +1. Prepare the model. The Wide&Deep code can be found at: , in which `train_and_eval_auto_parallel.py` is the main function for training, `src/` directory contains the model definition, data processing and configuration files, `script/` directory contains the launch scripts in different modes. + +2. Prepare the dataset. Please refer the link in [1] to download the dataset, and use the script `src/preprocess_data.py` to transform dataset into MindRecord format. + +3. Configure the device information. When performing training in the bare-metal environment, the network information file needs to be configured. This example only employs one accelerator, thus `rank_table_1p_0.json` containing #0 accelerator is configured (about the rank table file, you can refer to [HCCL_TOOL](https://gitee.com/mindspore/models/tree/master/utils/hccl_tools)). + +## Configuring for Hybrid Training + +1. Configure the flag of hybrid training. In the file `default_config.yaml`, change the default value of `host_device_mix` to be `1`: + + ```python + host_device_mix: 1 + ``` + +2. Check the deployment of necessary operators and optimizers. In class `WideDeepModel` of file `src/wide_and_deep.py`, check the execution of `EmbeddingLookup` is at host: + + ```python + self.deep_embeddinglookup = nn.EmbeddingLookup() + self.wide_embeddinglookup = nn.EmbeddingLookup() + ``` + + In `class TrainStepWrap(nn.Cell)` of file `src/wide_and_deep.py`, check two optimizers are also executed at host: + + ```python + self.optimizer_w.target = "CPU" + self.optimizer_d.target = "CPU" + ``` + +## Training the Model + +In order to save enough log information, use the command `export GLOG_v=1` to set the log level to INFO before executing the script, and add the `-p on` option when compiling MindSpore. For the details about compiling MindSpore, refer to [Compiling MindSpore](https://www.mindspore.cn/install/detail/en?path=install/master/mindspore_ascend_install_source_en.md&highlight=%E7%BC%96%E8%AF%91mindspore). + +Use the script `script/run_auto_parallel_train.sh`. Run the command `bash run_auto_parallel_train.sh 1 1 `, +where the first `1` is the number of accelerators, the second `1` is the number of epochs, `DATASET_PATH` is the path of dataset, +and `RANK_TABLE_FILE` is the path of the above `rank_table_1p_0.json` file. + +The running log is in the directory of `device_0`, where `loss.log` contains every loss value of every step in the epoch. Here is an example: + +```text +epoch: 1 step: 1, wide_loss is 0.6873926, deep_loss is 0.8878349 +epoch: 1 step: 2, wide_loss is 0.6442529, deep_loss is 0.8342661 +epoch: 1 step: 3, wide_loss is 0.6227323, deep_loss is 0.80273706 +epoch: 1 step: 4, wide_loss is 0.6107221, deep_loss is 0.7813441 +epoch: 1 step: 5, wide_loss is 0.5937832, deep_loss is 0.75526017 +epoch: 1 step: 6, wide_loss is 0.5875453, deep_loss is 0.74038756 +epoch: 1 step: 7, wide_loss is 0.5798845, deep_loss is 0.7245408 +epoch: 1 step: 8, wide_loss is 0.57553077, deep_loss is 0.7123517 +epoch: 1 step: 9, wide_loss is 0.5733629, deep_loss is 0.70278376 +epoch: 1 step: 10, wide_loss is 0.566089, deep_loss is 0.6884129 +... +``` + +`test_deep0.log` contains the runtime log. +Search `EmbeddingLookup` in `test_deep0.log`, the following can be found: + +```text +[INFO] DEVICE(109904,python3.7):2020-06-27-12:42:34.928.275 [mindspore/ccsrc/device/cpu/cpu_kernel_runtime.cc:324] Run] cpu kernel: Default/network-VirtualDatasetCellTriple/_backbone-NetWithLossClass/network-WideDeepModel/EmbeddingLookup-op297 costs 3066 us. +[INFO] DEVICE(109904,python3.7):2020-06-27-12:42:34.943.896 [mindspore/ccsrc/device/cpu/cpu_kernel_runtime.cc:324] Run] cpu kernel: Default/network-VirtualDatasetCellTriple/_backbone-NetWithLossClass/network-WideDeepModel/EmbeddingLookup-op298 costs 15521 us. +``` + +The above shows the running time of `EmbeddingLookup` on the host. + +Search `FusedSparseFtrl` and `FusedSparseLazyAdam` in `test_deep0.log`, the following can be found: + +```text +[INFO] DEVICE(109904,python3.7):2020-06-27-12:42:35.422.963 [mindspore/ccsrc/device/cpu/cpu_kernel_runtime.cc:324] Run] cpu kernel: Default/optimizer_w-FTRL/FusedSparseFtrl-op299 costs 54492 us. +[INFO] DEVICE(109904,python3.7):2020-06-27-12:42:35.565.953 [mindspore/ccsrc/device/cpu/cpu_kernel_runtime.cc:324] Run] cpu kernel: Default/optimizer_d-LazyAdam/FusedSparseLazyAdam-op300 costs 142865 us. +``` + +The above shows the running time of two optimizers on the host. + +## Reference + +[1] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, Xiuqiang He. [DeepFM: A Factorization-Machine based Neural Network for CTR Prediction.](https://doi.org/10.24963/ijcai.2017/239) IJCAI 2017. diff --git a/tutorials/experts/source_en/parallel/images/checkpoint_integrate_process.pptx b/tutorials/experts/source_en/parallel/images/checkpoint_integrate_process.pptx new file mode 100644 index 0000000000000000000000000000000000000000..29ecea853306ea5ea769915510ca755037797cf0 Binary files /dev/null and b/tutorials/experts/source_en/parallel/images/checkpoint_integrate_process.pptx differ diff --git a/tutorials/experts/source_en/parallel/images/checkpoint_integration_process.jpg b/tutorials/experts/source_en/parallel/images/checkpoint_integration_process.jpg new file mode 100644 index 0000000000000000000000000000000000000000..39d89bc4a04f0076ac5fa435553e202eb1cc21b3 Binary files /dev/null and b/tutorials/experts/source_en/parallel/images/checkpoint_integration_process.jpg differ diff --git a/tutorials/experts/source_en/parallel/parameter_server_training.md b/tutorials/experts/source_en/parallel/parameter_server_training.md new file mode 100644 index 0000000000000000000000000000000000000000..49444af8b5bf7bccf75854b997c760bc92e9182b --- /dev/null +++ b/tutorials/experts/source_en/parallel/parameter_server_training.md @@ -0,0 +1,156 @@ +# Parameter Server Mode + +`Ascend` `GPU` `Distributed Parallel` `Whole Process` + + + +## Overview + +A parameter server is a widely used architecture in distributed training. Compared with the synchronous AllReduce training method, a parameter server has better flexibility, scalability, and node failover capabilities. Specifically, the parameter server supports both synchronous and asynchronous SGD(Stochastic Gradient Descent) training algorithms. In terms of scalability, model computing and update are separately deployed in the worker and server processes, so that resources of the worker and server can be independently scaled out and in horizontally. In addition, in an environment of a large-scale data center, various failures often occur in a computing device, a network, and a storage device, and consequently some nodes are abnormal. However, in an architecture of a parameter server, such a failure can be relatively easily handled without affecting a training job. + +In the parameter server implementation of MindSpore, the self-developed communication framework (core) is used as the basic architecture. Based on the remote communication capability provided by the core and abstract Send/Broadcast primitives, the distributed training algorithm of the synchronous SGD is implemented. In addition, with the high-performance collective communication library in Ascend and GPU(HCCL and NCCL), MindSpore also provides the hybrid training mode of parameter server and AllReduce. Some weights can be stored and updated through the parameter server, and other weights are still trained through the AllReduce algorithm. + +The ps-lite architecture consists of three independent components: server, worker, and scheduler. Their functions are as follows: + +- Server: saves model weights and backward computation gradients, and updates the model using gradients pushed by workers. + +- Worker: performs forward and backward computation on the network. The gradient value for backward computation is uploaded to a server through the `Push` API, and the model updated by the server is downloaded to the worker through the `Pull` API. + +- Scheduler: establishes the communication relationship between the server and worker. + +## Preparations + +The following describes how to use parameter server to train LeNet on Ascend 910: + +### Training Script Preparation + +Learn how to train a LeNet using the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) by referring to . + +### Parameter Setting + +1. First of all, use `mindspore.context.set_ps_context(enable_ps=True)` to enable Parameter Server training mode. + + - This method should be called before `mindspore.communication.init()`. + - If you don't call this method, the [Environment Variable Setting](https://www.mindspore.cn/docs/programming_guide/en/master/apply_parameter_server_training.html#environment-variable-setting) below will not take effect. + - Use `mindspore.context.reset_ps_context()` to disable Parameter Server training mode. + +2. In this training mode, you can use either of the following methods to control whether the training parameters are updated by the Parameter Server and whether the training parameters are initialized on Worker or Server: + + - Use `mindspore.nn.Cell.set_param_ps()` to set all weight recursions of `nn.Cell`. + - Use `mindspore.Parameter.set_param_ps()` to set the weight. + - The size of the weight which is updated by Parameter Server should not exceed INT_MAX(2^31 - 1) bytes. + - The interface `set_param_ps` can receive a `bool` parameter:`init_in_server`, indicating whether this training parameter is initialized on the Server side. `init_in_server` defaults to `False`, indicating that this training parameter is initialized on Worker. Currently, only the training parameter `embedding_table` of the `EmbeddingLookup` operator is supported to be initialized on Server side to solve the problem of insufficient memory caused by the initialization of a large shape `embedding_table` on Worker. The `EmbeddingLookup` operator's `target` attribute needs to be set to 'CPU'. The training parameter initialized on the Server side will no longer be synchronized to Worker. If it involves multi-Server training and saves CheckPoint, each Server will save a CheckPoint after the training. + +3. On the basis of the [original training script](https://gitee.com/mindspore/models/blob/master/official/cv/lenet/train.py), set all LeNet model weights to be trained on the parameter server: + + ```python + context.set_ps_context(enable_ps=True) + network = LeNet5(cfg.num_classes) + network.set_param_ps() + ``` + +4. [optional configuration] For a large shape `embedding_table`, because the device can not store a full amount of `embedding_table`. You can configure the `vocab_cache_size` of [EmbeddingLookup operator](https://www.mindspore.cn/docs/api/en/master/api_python/nn/mindspore.nn.EmbeddingLookup.html) to enable the cache function of `EmbeddingLookup` in the Parameter Server training mode. The `vocab_cache_size` of `embedding_table` is trained on device, and a full amount of `embedding_table` is stored in the Server. The `embedding_table` of the next batch is swapped to the cache in advance, and the expired `embedding_table` is put back to the Server when the cache cannot be placed, to achieve the purpose of improving the training performance. Each Server could save a checkpoint containing the trained `embedding_table` after the training. Detailed network training script can be referred to . + + ```python + context.set_auto_parallel_context(full_batch=True, + parallel_mode=ParallelMode.AUTO_PARALLEL) + context.set_context(enable_sparse=True) + network = Net() + model = Model(network) + model.train(epoch, train_dataset, dataset_sink_mode=True) + ``` + + In the information: + + - `dataset_sink_mode`: whether to enable the sink mode of dataset or not. When `True`, it indicates enabled, and pass the data through the dataset channel. It must be set to `True` in this scenario (The inference during training also needs to enable the sink mode of dataset). + - `full_batch`: whether to load the dataset in full or not. When `True`, it indicates fully load, and data of each device is the same. It must be set to `True` in the multi-workers scenario. + - `parallel_mode`:parallel mode, auto parallel mode must be enabled in the multi-workers scenario, please set `parallel_mode`=`ParallelMode.AUTO_PARALLEL`. + - `enable_sparse`: whether to enable sparse training, default: `False`. `enable_sparse`=`True` indicates enabling sparse training. The parameter `sparse` of all `EmbeddingLookup` kernels which enable cache must be equal to the value of `enable_sparse` in the parameter server mode. + +> In `Parameter Server` mode, control flow is not supported. So we need to change `model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()}, amp_level="O2")` to `model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()})` in `train.py`. This will unset `amp_level` and eliminate the impact of control flow. + +### Environment Variable Setting + +MindSpore reads environment variables to control parameter server training. The environment variables include the following options (all scripts of `MS_SCHED_HOST` and `MS_SCHED_PORT` must be consistent): + +```text +export MS_SERVER_NUM=1 # Server number +export MS_WORKER_NUM=1 # Worker number +export MS_SCHED_HOST=XXX.XXX.XXX.XXX # Scheduler IP address +export MS_SCHED_PORT=XXXX # Scheduler port +export MS_ROLE=MS_SCHED # The role of this process: MS_SCHED represents the scheduler, MS_WORKER represents the worker, MS_PSERVER represents the Server +``` + +## Training + +1. Shell scripts + + Provide the shell scripts corresponding to the worker, server, and scheduler roles to start training: + + `Scheduler.sh`: + + ```bash + #!/bin/bash + export MS_SERVER_NUM=1 + export MS_WORKER_NUM=1 + export MS_SCHED_HOST=XXX.XXX.XXX.XXX + export MS_SCHED_PORT=XXXX + export MS_ROLE=MS_SCHED + python train.py --device_target=Ascend --data_path=path/to/dataset + ``` + + `Server.sh`: + + ```bash + #!/bin/bash + export MS_SERVER_NUM=1 + export MS_WORKER_NUM=1 + export MS_SCHED_HOST=XXX.XXX.XXX.XXX + export MS_SCHED_PORT=XXXX + export MS_ROLE=MS_PSERVER + python train.py --device_target=Ascend --data_path=path/to/dataset + ``` + + `Worker.sh`: + + ```bash + #!/bin/bash + export MS_SERVER_NUM=1 + export MS_WORKER_NUM=1 + export MS_SCHED_HOST=XXX.XXX.XXX.XXX + export MS_SCHED_PORT=XXXX + export MS_ROLE=MS_WORKER + python train.py --device_target=Ascend --data_path=path/to/dataset + ``` + + Run the following commands separately: + + ```bash + sh Scheduler.sh > scheduler.log 2>&1 & + sh Server.sh > server.log 2>&1 & + sh Worker.sh > worker.log 2>&1 & + ``` + + Start training. + +2. Viewing result + + Run the following command to view the communication logs between the server and worker in the `scheduler.log` file: + + ```text + The server node id:b5d8a47c-46d7-49a5-aecf-d29d7f8b6124,node ip: 10.90.53.118,node port:46737 assign rank id:0 + The worker node id:55e86d4b-d717-4930-b414-ebd80082f541 assign rank id:1 + Start the scheduler node is successful! + ``` + + The preceding information indicates that the communication between the server, worker, and scheduler is established successfully. + + Check the training result in the `worker.log` file: + + ```text + epoch: 1 step: 1, loss is 2.302287 + epoch: 1 step: 2, loss is 2.304071 + epoch: 1 step: 3, loss is 2.308778 + epoch: 1 step: 4, loss is 2.301943 + ... + ``` diff --git a/tutorials/experts/source_en/parallel/pipeline_parallel.md b/tutorials/experts/source_en/parallel/pipeline_parallel.md new file mode 100644 index 0000000000000000000000000000000000000000..ddafd6149ff09cc2611f36e61ff6c8ac2592ef4a --- /dev/null +++ b/tutorials/experts/source_en/parallel/pipeline_parallel.md @@ -0,0 +1,152 @@ +# Pipeline Parallelism + +`Ascend` `GPU` `Distributed Parallel` `Whole Process` + + + +## Overview + +In recent years, the scale of neural networks has increased exponentially. Limited by the memory on a single device, the +number of devices used for training large models is also increasing. Due to the low communication bandwidth between +servers, the performance of the conventional hybrid parallelism (data parallel + model parallel) is poor. Therefore, +pipeline parallelism needs to be introduced. Pipeline parallelism can divide a model in space based on `stage`. +Each `stage` needs to execute only a part of the network, which greatly reduces memory overheads, shrinks the +communication domain, and shortens the communication time. MindSpore can automatically convert a standalone model to the +pipeline parallel mode based on user configurations. + +> Download address of the complete sample code: +> +> . + +The directory structure is as follows: + +```text +└─sample_code + ├─distributed_training + │ rank_table_16pcs.json + │ rank_table_8pcs.json + │ rank_table_2pcs.json + │ resnet.py + │ resnet50_distributed_training_pipeline.py + │ run_pipeline.sh + ... +``` + +`rank_table_16pcs.json`, `rank_table_8pcs.json` and `rank_table_2pcs.json` are the networking information files. `resnet.py` and `resnet50_distributed_training_pipeline.py` are the network structure files. `run_pipeline.sh` are the execute scripts. + +## Preparations + +### Downloading the Dataset + +This example uses the `CIFAR-10` dataset. For details about how to download and load the dataset, +visit . + +### Configuring the Distributed Environment + +> Pipeline parallelism supports Ascend and GPU. + +For details about how to configure the distributed environment and call the HCCL, +visit . + +## Defining the Network + +The network definition is the same as that in the Parallel Distributed Training Example. + +For details about the definitions of the network, optimizer, and loss function, +visit . + +> To implement pipeline parallelism, you need to define the parallel strategy and call the `pipeline_stage` API to specify the stage on which each layer is to be executed. The granularity of the `pipeline_stage` API is `Cell`. `pipeline_stage` must be configured for all `Cells` that contain training parameters. + +```python +class ResNet(nn.Cell): + """ResNet""" + + def __init__(self, block, num_classes=100, batch_size=32): + """init""" + super(ResNet, self).__init__() + self.batch_size = batch_size + self.num_classes = num_classes + + self.head = Head() + self.layer1 = MakeLayer0(block, in_channels=64, out_channels=256, stride=1) + self.layer2 = MakeLayer1(block, in_channels=256, out_channels=512, stride=2) + self.layer3 = MakeLayer2(block, in_channels=512, out_channels=1024, stride=2) + self.layer4 = MakeLayer3(block, in_channels=1024, out_channels=2048, stride=2) + + self.pool = ops.ReduceMean(keep_dims=True) + self.squeeze = ops.Squeeze(axis=(2, 3)) + self.fc = fc_with_initialize(512 * block.expansion, num_classes) + + # pipeline parallel config + self.head.pipeline_stage = 0 + self.layer1.pipeline_stage = 0 + self.layer2.pipeline_stage = 0 + self.layer3.pipeline_stage = 1 + self.layer4.pipeline_stage = 1 + self.fc.pipeline_stage = 1 + + def construct(self, x): + """construct""" + x = self.head(x) + + x = self.layer1(x) + x = self.layer2(x) + x = self.layer3(x) + x = self.layer4(x) + + x = self.pool(x, (2, 3)) + x = self.squeeze(x) + x = self.fc(x) + return x +``` + +## Training the Network + +To enable pipeline parallelism, you need to add the following configurations to the training script: + +- Set `pipeline_stages` in `set_auto_parallel_context` to specify the total number of `stages`. +- Set the `SEMI_AUTO_PARALLEL` mode. Currently, the pipeline parallelism supports only this mode. +- Define the LossCell. In this example, the `nn.WithLossCell` API is called. +- Finally, wrap the LossCell with `PipelineCell`, and specify the Micro_batch size. To improve machine utilization, + MindSpore divides Mini_batch into finer-grained Micro_batch to streamline the entire cluster. The final loss value is + the sum of the loss values computed by all Micro_batch. The size of Micro_batch must be greater than or equal to the + number of `stages`. + +```python +from mindspore import context, Model, nn +from mindspore.nn import Momentum +from mindspore.train.callback import LossMonitor +from mindspore.context import ParallelMode +from resnet import resnet50 + + +def test_train_cifar(epoch_size=10): + context.set_auto_parallel_context(parallel_mode=ParallelMode.SEMI_AUTO_PARALLEL, gradients_mean=True) + context.set_auto_parallel_context(pipeline_stages=2, save_graphs=True) + loss_cb = LossMonitor() + data_path = os.getenv('DATA_PATH') + dataset = create_dataset(data_path) + batch_size = 32 + num_classes = 10 + net = resnet50(batch_size, num_classes) + loss = SoftmaxCrossEntropyExpand(sparse=True) + net_with_loss = nn.WithLossCell(net, loss) + net_pipeline = nn.PipelineCell(net_with_loss, 2) + opt = Momentum(net.trainable_params(), 0.01, 0.9) + model = Model(net_pipeline, optimizer=opt) + model.train(epoch_size, dataset, callbacks=[loss_cb], dataset_sink_mode=True) +``` + +## Running the Single-host with 8 devices Script + +Using the sample code, you can run a 2-stage pipeline on 8 Ascend devices using below scripts: + +```bash +bash run_pipeline.sh [DATA_PATH] Ascend +``` + +You can run a 2-stage pipeline on 8 GPU devices using below scripts: + +```bash +bash run_pipeline.sh [DATA_PATH] GPU +``` \ No newline at end of file diff --git a/tutorials/experts/source_en/parallel/recompute.md b/tutorials/experts/source_en/parallel/recompute.md new file mode 100644 index 0000000000000000000000000000000000000000..c59cc7b414894b269d973997c5c60df377d1d75c --- /dev/null +++ b/tutorials/experts/source_en/parallel/recompute.md @@ -0,0 +1,137 @@ +# Recomputation + + `Ascend` `GPU` `CPU` `Model Running` `Intermediate` `Expert` + + + +## Overview + +The automatic differential of MindSpore is in reverse-mode, which derives the backward pass according to the forward pass. Before some backward operators are computed, the results of some forward operators should be ready. It leads to the problem that the memory occupied by these results of the forward operators, can not be reused until the computation of the backward operators are completed. This problem can drive up the peak of memory, which is particularly significant in the large model. + +In order to solve this problem, Mindspore provides the recomputation function. It will recompute the forward operators before computing the backward operators rather than storing the results of forward operators, which can help the memory be reused. This tutorial takes the model ResNet-50 for example to explain how to configure recomputation to train your model in MindSpore. + +## Preliminaries + +1. Prepare the model. The ResNet-50 code can be found at: , in which `train.py` is the main function for training, `src/` directory contains the model definition and configuration files of ResNet-50, `script/` directory contains the training and evaluation scripts. + +2. Prepare the dataset. This example uses the `CIFAR-10` dataset. For details about how to download and load the dataset, visit . + +## Configuring for Recomputation + +We can call two kinds of interface to configure the recomputation. Take `src/resnet.py` for example: + +1. The [recompute api](https://www.mindspore.cn/docs/api/en/master/api_python/ops/mindspore.ops.Primitive.html#mindspore.ops.Primitive.recompute) of `Primitive`. It can set an operator to be recomputed. After setting, the operator will be recomputed in the backward pass. + + ```python + class ResNet(nn.Cell): + ... + def __init__(self, + block, + layer_nums, + in_channels, + out_channels, + strides, + num_classes, + use_se=False, + res_base=False): + super(ResNet, self).__init__() + ... + self.relu = ops.ReLU() + self.relu.recompute() + ... + ``` + +2. Call the [recompute api](https://www.mindspore.cn/docs/api/en/master/api_python/nn/mindspore.nn.Cell.html#mindspore.nn.Cell.recompute) of `Cell`. It can set the whole `Cell` to be recomputed. After setting, except the output of the `Cell`, all the operators in this cell and its sub `Cell` will be recomputed in the backward pass. + + ```python + class ResNet(nn.Cell): + def __init__(self, + block, + layer_nums, + in_channels, + out_channels, + strides, + num_classes, + use_se=False, + res_base=False): + super(ResNet, self).__init__() + ... + self.layer1 = self._make_layer(block, + layer_nums[0], + in_channel=in_channels[0], + out_channel=out_channels[0], + stride=strides[0], + use_se=self.use_se) + + def _make_layer(self, block, layer_num, in_channel, out_channel, stride, use_se=False, se_block=False): + ... + if se_block: + for _ in range(1, layer_num - 1): + resnet_block = block(out_channel, out_channel, stride=1, use_se=use_se) + resnet_block.recompute() + else: + for _ in range(1, layer_num): + resnet_block = block(out_channel, out_channel, stride=1, use_se=use_se) + resnet_block.recompute() + ... + + class ResidualBlock(nn.Cell): + def __init__(self, + in_channel, + out_channel, + stride=1, + use_se=False, se_block=False): + super(ResidualBlock, self).__init__() + ... + + def construct(self, x): + ... + + def resnet50(class_num=10): + return ResNet(ResidualBlock, + [3, 4, 6, 3], + [64, 256, 512, 1024], + [256, 512, 1024, 2048], + [1, 2, 2, 2], + class_num) + ``` + +## Training the Model + +We take the GPU environment for example, use the script `script/run_standalone_train_gpu.sh`. Run the command `bash scripts/run_standalone_train_gpu.sh $date_set_path config/resnet50_cifar10_config.yaml`. +We can set the context: `save_graph=True` in `src/train.py` to print the construction of the computation graph to do comparison. + +The graph before setting recomputation is as follow: + +```text +... +%56(equivoutput) = Conv2D(%53, %55) {instance name: conv2d} primitive_attrs: {pad_list: (0, 0, 0, 0), stride: (1, 1, 1, 1), pad: (0, 0, 0, 0), pad_mode: 1, out_channel: 64, kernel_size: (1, 1), input_names: [x, w], format: NCHW, groups: 1, mode: 1, group: 1, dilation: (1, 1, 1, 1), output_names: [output]} + : (, ) -> () +... +%61(equiv[CNode]707) = BatchNorm(%56, %57, %58, %59, %60) {instance name: bn_train} primitive_attrs: {epsilon: 0.000100, is_training: true, momentum: 0.100000, format: NCHW, output_names: [y, batch_mean, batch_variance, reserve_space_1, reserve_space_2], input_names: [x, scale, offset, mean, variance]} + : (, , , , ) -> () +... +%927(out) = BatchNormGrad(%923, %56, %57, %924, %925, %926) primitive_attrs: {epsilon: 0.000100, format: NCHW, is_training: true} cnode_primal_attrs: {forward_node_name: BatchNorm_102499} + : (, , , , , ) -> () +... +``` + +The graph after setting recomputation is as follow: + +```text +... +%56(equivoutput) = Conv2D(%53, %55) {instance name: conv2d} primitive_attrs: {pad_list: (0, 0, 0, 0), stride: (1, 1, 1, 1), pad: (0, 0, 0, 0), pad_mode: 1, out_channel: 64, kernel_size: (1, 1), input_names: [x, w], format: NCHW, groups: 1, mode: 1, group: 1, dilation: (1, 1, 1, 1), output_names: [output]} cnode_attrs: {need_cse_after_recompute: true, recompute: true} + : (, ) -> () +... +%61(equiv[CNode]707) = BatchNorm(%56, %57, %58, %59, %60) {instance name: bn_train} primitive_attrs: {epsilon: 0.000100, is_training: true, momentum: 0.100000, format: NCHW, output_names: [y, batch_mean, batch_variance, reserve_space_1, reserve_space_2], input_names: [x, scale, offset, mean, variance]} + : (, , , , ) -> () +... +%1094([CNode]15682) = Conv2D(%1091, %1093) {instance name: conv2d} primitive_attrs: {pad_list: (1, 1, 1, 1), stride: (1, 1, 1, 1), pad: (0, 0, 0, 0), pad_mode: 1, out_channel: 64, kernel_size: (3, 3), input_names: [x, w], format: NCHW, groups: 1, mode: 1, group: 1, dilation: (1, 1, 1, 1), output_names: [output]} cnode_attrs: {need_cse_after_recompute: true, duplicated: true} + : (, ) -> () +... +%1095([CNode]15681) = BatchNormGrad(%1085, %1094, %98, %1086, %1087, %1088) primitive_attrs: {epsilon: 0.000100, format: NCHW, is_training: true} cnode_attrs: {target_grad: true} cnode_primal_attrs: {forward_node_name: BatchNorm_102499} + : (, , , , , ) -> () +... +``` + +We can see that `Conv2D` is replicated to become the input to `BatchNormGrad`. diff --git a/tutorials/experts/source_en/parallel/save_load_model_hybrid_parallel.md b/tutorials/experts/source_en/parallel/save_load_model_hybrid_parallel.md new file mode 100644 index 0000000000000000000000000000000000000000..248b6a30b610027799d4aa2c8065992d20a6ea3a --- /dev/null +++ b/tutorials/experts/source_en/parallel/save_load_model_hybrid_parallel.md @@ -0,0 +1,531 @@ +# Saving and Loading Models in Hybrid Parallel Mode + +`Ascend` `GPU` `Distributed Parallel` `Model Export` `Model Loading` + + + +## Overview + +### Background + +In the MindSpore model parallel scenario, each instance process stores only the parameter data on the current node. The parameter data of a model parallel Cell on each node is a slice of the complete parameter data. For example, the complete parameter data shape is \[8, 8], and the parameter data on each node is a part of the data, for example, shape \[2, 8]. + +In the auto parallel scenario, MindSpore automatically generates the dividing strategy. The MindSpore checkpoint module supports automatic integrating, saving, and loading. + +In the hybrid parallel scenario, the dividing strategy is implemented by users. MindSpore saves the slice strategy of model, which is the same on each node, and the data corresponding to each node is stored respectively. Users need to integrate, save, and load the checkpoint files by themselves. This tutorial describes how to integrate, save, and load checkpoint files in the hybrid parallel scenario. + +### Application Scenario + +If you encounter the following scenarios, refer to this tutorial to integrate, save, and load checkpoint files: + +Scenario 1: multi-device training and single-device inference + +The following describes the overall process of training on 64 devices and inference on a single device: + +1. Execute the training to automatically generate the checkpoint files and the slice strategy files. + +2. Integrate the saved checkpoint files. + + Integrate the divided model parameters based on the specific dividing strategy to generate a new checkpoint file. + +3. Load the new checkpoint file in the single-GPU environment and call the export API to export the model for inference as required. + +If the number of GPUs in a cluster in the checkpoint saving environment is the same as that in the loading environment, for example, if the checkpoint files are saved and loaded in the same training environment or training and inference is performed on a single device, you do not need to perform integration, saving and loading. + +Scenario 2: The training is divided into multiple stages, and the cluster size in each stage is different. + +For example, in the training stage 1, the training environment with 64 devices is used, and in the training stage 2, the training environment with 56 devices is used. The overall operation process is as follows: + +1. Execute the training in stage 1 to automatically generate the checkpoint files and the slice strategy files. + +2. Integrate the saved checkpoint files. + + Integrate the divided model parameters based on the specific dividing strategy to generate a new checkpoint file. + +3. Load the checkpoint file that is integrated and saved in the stage 2 cluster. + + During the loading, you need to redivide the parameter data in the checkpoint file based on the new training environment configuration. + +4. Perform stage 2 training. + +## Integrating the Saved Checkpoint Files + +### Overall Process + +Import the checkpoint files to be integrated to the network in rank id order and obtain the list of all parameters through the API provided by MindSpore, and then obtain the slice strategy of model. See steps 1 and 2 in the following figure. + +Then, update the parameter list and integrate the model parallel parameters. See step 3 in the following figure. + +Finally, save the updated parameter list to a file through the API provided by MindSpore to generate a new checkpoint file. See step 4 in the following figure. + +![img](./images/checkpoint_integration_process.jpg) + +### Preparations + +#### Importing the Checkpoint Files in rank id order + +Define the network, call the `load_checkpoint` and `load_param_into_net` APIs to import the checkpoint files to the network in rank id order, and then call `parameters_and_names` API to obtain all parameters in this network. + +```python +net = Net() +opt = Momentum(learning_rate=0.01, momentum=0.9, params=net.get_parameters()) +net = TrainOneStepCell(net, opt) +param_dicts = [] +for i in range(rank_size): + file_name = os.path.join("./node"+str(i), "CKP_1-4_32.ckpt") # checkpoint file name of current node + param_dict = load_checkpoint(file_name) + load_param_into_net(net, param_dict) + param_dict = {} + for _, param in net.parameters_and_names(): + param_dict[param.name] = param + param_dicts.append(param_dict) +``` + +In the preceding information: + +- `rank_size`: number of nodes in previous distributed training. +- `load_checkpoint`: loads the checkpoint model parameter file and returns a parameter dictionary. +- `load_param_into_net`: loads model parameter data to the network. + +#### Obtaining a List of All Parameters on the Network + +Call the `build_searched_strategy` API to obtain the slice strategy of model. + +```python +strategy = build_searched_strategy("./strategy_train.ckpt") +``` + +In the preceding information: + +- `strategy_train.ckpt`: name of model slice strategy, set by users calling `set_auto_parallel_context` API and customizing `strategy_ckpt_save_file` parameter before training network. + +### Integrate the Model Parallel Parameters + +The following uses a model parameter as an example to describe a specific integration process. + +The parameter name is weight and the dividing strategy is to perform dividing in a 4-device scenario. + +1. Obtain the data value on all nodes for model parallel parameters. + + ```python + sliced_parameters = [] + for i in range(4): + parameter = param_dicts[i].get("weight") + sliced_parameters.append(parameter) + ``` + + > To ensure that the parameter update speed remains unchanged, you need to integrate the parameters saved in the optimizer, for example, moments.weight. + +2. Call the `merge_sliced_parameter` API to merge the sliced parameters. + + ```python + merged_parameter = merge_sliced_parameter(sliced_parameters, strategy) + ``` + +> If there are multiple model parallel parameters, repeat steps 1 to 2 to process them one by one. + +### Saving the Data and Generating a New Checkpoint File + +1. Convert `param_dict` to `param_list`. + + ```python + param_list = [] + for (key, value) in param_dict.items(): + each_param = {} + each_param["name"] = key + if isinstance(value.data, Tensor): + param_data = value.data + else: + param_data = Tensor(value.data) + each_param["data"] = param_data + param_list.append(each_param) + ``` + +2. Call the `save_checkpoint` API to write the parameter data to a file and generate a new checkpoint file. + + ```python + save_checkpoint(param_list, "./CKP-Integrated_1-4_32.ckpt") + ``` + + In the preceding information: + + - `save_checkpoint`: saves network model parameters to a file. + - `CKP-Integrated_1-4_32.ckpt`: name of the generated checkpoint model parameter file. + +## Loading the Integrated and Saved Checkpoint File + +### Overall Process + +If you need to load the integrated and saved checkpoint file to multi-device training or inference, divide the parallel parameter data based on the new strategy before loading the model parameters to the network. The following steps are implemented in the pre-training script. Steps 1 and 3 are the same as the strategy of checkpoint loading in a single-node system. Step 2 is added to divide model parallel parameters. In the single-device training/inference scenario, data dividing is not involved. In this case, step 2 can be skipped. + +### Step 1: Loading the Checkpoint File + +Call the `load_checkpoint` API to load model parameter data from the checkpoint file. + +```python +param_dict = load_checkpoint("./CKP-Integrated_1-4_32.ckpt") +``` + +- `load_checkpoint`: loads the checkpoint model parameter file and returns a parameter dictionary. +- `CKP-Integrated_1-4_32.ckpt`: name of the checkpoint model parameter file to be loaded. + +### Step 2: Dividing a Model Parallel Parameter + +The following uses a specific model parameter as an example. The parameter name is weight, the data value is Tensor \[\[1, 2, 3, 4], \[5, 6, 7, 8]], and the dividing strategy is to perform dividing in the two-device scenario based on \[2, 1]. Data distribution after dividing is as follows: + +| Device0 | Device1 | +|--------------------|---------------------| +| Value [1, 2, 3, 4] | Value \[5, 6, 7, 8] | + +1. Divide the model parameter data. + + In the following code example, data is divided into two slices in dimension 0. + + ```python + new_param = parameter_dict["weight"] + slice_list = np.split(new_param.data.asnumpy(), 2, axis=0) + new_param_moments = parameter_dict["moments.weight"] + slice_moments_list = np.split(new_param_moments.data.asnumpy(), 2, axis=0) + ``` + + Data after dividing: + + ```text + slice_list[0] --- [1, 2, 3, 4] Corresponding to device0 + slice_list[1] --- [5, 6, 7, 8] Corresponding to device1 + ``` + + Similar to slice\_list, slice\_moments\_list is divided into two tensors with the shape of \[1, 4]. + +2. Load the corresponding data slice on each node. + + Obtain rank\_id of the current node and load data based on rank\_id. + + ```python + rank = get_rank() + tensor_slice = Tensor(slice_list[rank]) + tensor_slice_moments = Tensor(slice_moments_list[rank]) + ``` + + - `get_rank`: obtains the ID of the current device in the cluster. + +3. Modify values of model parameters. + + ```python + new_param.set_data(tensor_slice, True) + new_param_moments.set_data(tensor_slice_moments, True) + ``` + + - `set_data`: sets the value of a model parameter. The API parameter type is Tensor or number. + +### Step 3: Loading the Modified Parameter Data to the Network + +Call the `load_param_into_net` API to load the model parameter data to the network. + +```python +net = Net() +opt = Momentum(learning_rate=0.01, momentum=0.9, params=parallel_net.get_parameters()) +load_param_into_net(net, param_dict) +load_param_into_net(opt, param_dict) +``` + +## Example + +### Scenario Description + +Overall scenario: The training is divided into two stages. The cluster scales in the two stages are different. The MatMul operator at the FC layer is simulated to run in parallel. + +User process: + +1. Execute stage 1 training. There are four devices in stage 1 training environment. The weight shape of the MatMul operator on each device is \[2, 8]. Checkpoint files are automatically exported during the training. + +2. Execute the script to integrate checkpoint files. Based on the specific dividing strategy, integrate the divided model parameters to generate the integrated checkpoint file. + +3. Execute stage 2 training: There are two devices in stage 2 training environment. The weight shape of the MatMul operator on each device is \[4, 8]. Load the initialized model parameter data from the integrated checkpoint file and then perform training. + +> For details about the distributed environment configuration and training code, see [Distributed Training](https://www.mindspore.cn/docs/programming_guide/en/master/distributed_training_ascend.html). +> +> This document provides the example code for integrating checkpoint files and loading checkpoint files before distributed training. The code is for reference only. + +### Example Code + +1. Run the following script to integrate the checkpoint files: + + ```bash + python ./integrate_checkpoint.py "Name of the checkpoint file to be integrated" "Path and name of the checkpoint file generated after integration" "Path and name of the strategy file" "Number of nodes" + ``` + + integrate\_checkpoint.py: + + ```python + import numpy as np + import os + import mindspore.nn as nn + from mindspore import Tensor, Parameter + import mindspore.ops as ops + from mindspore import save_checkpoint, load_checkpoint, build_searched_strategy, merge_sliced_parameter + + class Net(nn.Cell): + def __init__(self,weight_init): + super(Net, self).__init__() + self.weight = Parameter(Tensor(weight_init), layerwise_parallel=True) + self.fc = ops.MatMul(transpose_b=True) + + def construct(self, x): + x = self.fc(x, self.weight) + return x + + def integrate_ckpt_file(old_ckpt_file, new_ckpt_file, strategy_file, rank_size): + weight = np.ones([2, 8]).astype(np.float32) + net = Net(weight) + opt = Momentum(learning_rate=0.01, momentum=0.9, params=net.get_parameters()) + net = TrainOneStepCell(net, opt) + + # load CheckPoint into net in rank id order + param_dicts = [] + for i in range(rank_size): + file_name = os.path.join("./node"+str(i), old_ckpt_file) + param_dict = load_checkpoint(file_name) + load_param_into_net(net, param_dict) + param_dict = {} + for _, param in net.parameters_and_names(): + param_dict[param.name] = param + param_dicts.append(param_dict) + + strategy = build_searched_strategy(strategy_file) + param_dict = {} + + for paramname in ["weight", "moments.weight"]: + # get layer wise model parallel parameter + sliced_parameters = [] + for i in range(rank_size): + parameter = param_dicts[i].get(paramname) + sliced_parameters.append(parameter) + + # merge the parallel parameters of the model + merged_parameter = merge_sliced_parameter(sliced_parameters, strategy) + param_dict[paramname] = merged_parameter + + # convert param_dict to list type data + param_list = [] + for (key, value) in param_dict.items(): + each_param = {} + each_param["name"] = key + if isinstance(value.data, Tensor): + param_data = value.data + else: + param_data = Tensor(value.data) + each_param["data"] = param_data + param_list.append(each_param) + + # call the API to generate a new CheckPoint file + save_checkpoint(param_list, new_ckpt_file) + + return + + if __name__ == "__main__": + try: + old_ckpt_file = sys.argv[1] + new_ckpt_file = sys.argv[2] + strategy_file = sys.argv[3] + rank_size = int(sys.argv[4]) + integrate_ckpt_file(old_ckpt_file, new_ckpt_file, strategy_file, rank_size) + except: + print("Fail to integrate checkpoint file") + sys.exit(-1) + ``` + + The command output is as follows. + + Before the script is executed, the parameter values in the checkpoint files are as follows: + + ```text + device0: + name is weight + value is + [[0.87537426 1.0448935 0.86736983 0.8836905 0.77354026 0.69588304 0.9183654 0.7792076] + [0.87224025 0.8726848 0.771446 0.81967723 0.88974726 0.7988162 0.72919345 0.7677011]] + name is learning_rate + value is [0.01] + name is momentum + value is [0.9] + name is moments.weight + value is + [[0.2567724 -0.07485991 0.282002 0.2456022 0.454939 0.619168 0.18964815 0.45714882] + [0.25946522 0.24344791 0.45677605 0.3611395 0.23378398 0.41439137 0.5312468 0.4696194]] + + device1: + name is weight + value is + [[0.9210751 0.9050457 0.9827775 0.920396 0.9240526 0.9750359 1.0275179 1.0819869] + [0.73605865 0.84631145 0.9746683 0.9386582 0.82902765 0.83565056 0.9702136 1.0514659]] + name is learning_rate + value is [0.01] + name is momentum + value is [0.9] + name is moments.weight + value is + [[0.2417504 0.28193963 0.06713893 0.21510397 0.23380603 0.11424308 0.0218009 -0.11969765] + [0.45955992 0.22664294 0.01990281 0.0731914 0.27125207 0.27298513 -0.01716102 -0.15327111]] + + device2: + name is weight + value is + [[1.0108461 0.8689414 0.91719437 0.8805056 0.7994629 0.8999671 0.7585804 1.0287056 ] + [0.90653455 0.60146594 0.7206475 0.8306303 0.8364681 0.89625114 0.7354735 0.8447268]] + name is learning_rate + value is [0.01] + name is momentum + value is [0.9] + name is moments.weight + value is + [[0.03440702 0.41419312 0.24817684 0.30765256 0.48516113 0.24904746 0.57791173 0.00955463] + [0.13458519 0.6690533 0.49259356 0.28319967 0.25951773 0.16777472 0.45696738 0.24933104]] + + device3: + name is weight + value is + [[0.7147005 0.9168278 0.80178416 0.6258351 0.8413766 0.5909515 0.696347 0.71359116] + [0.20506378 0.03691584 0.2454556 0.12978578 0.19065076 0.23904312 0.27509746 0.34614682]] + name is learning_rate + value is [0.01] + name is momentum + value is [0.9] + name is moments.weight + value is + [[0.14152306 0.5040985 0.24455397 0.10907605 0.11319532 0.19538902 0.01208619 0.40430856] + [-0.7773164 -0.47611716 -0.6041424 -0.6144473 -0.2651842 -0.31909415 -0.4510405 -0.12860501]] + ``` + + After the script is executed, the parameter values in the checkpoint files are as follows: + + ```text + name is weight + value is + [[1.1138763 1.0962057 1.3516843 1.0812817 1.1579804 1.1078343 1.0906502 1.3207073] + [0.916671 1.0781671 1.0368758 0.9680898 1.1735439 1.0628364 0.9960786 1.0135143] + [0.8828271 0.7963984 0.90675324 0.9830291 0.89010954 0.897052 0.7890109 0.89784735] + [1.0011744 1.0840297 1.0201758 1.0882459 0.94232416 1.0775206 1.0195118 1.0528734] + [1.0053468 0.98402303 0.99762845 0.97587246 1.0259694 1.0055295 0.99420834 0.9496847] + [1.0851002 1.0295962 1.0999886 1.0958165 0.9765328 1.146529 1.0970603 1.1388365] + [0.7147005 0.9168278 0.80178416 0.6258351 0.8413766 0.5909515 0.696347 0.71359116] + [0.20506378 0.03691584 0.2454556 0.12978578 0.19065076 0.23904312 0.27509746 0.34614682]] + name is learning_rate + value is [0.01] + name is momentum + value is [0.9] + name is moments.weight + value is + [[0.2567724 -0.07485991 0.282002 0.2456022 0.454939 0.619168 0.18964815 0.45714882] + [0.25946522 0.24344791 0.45677605 0.3611395 0.23378398 0.41439137 0.5312468 0.4696194 ] + [0.2417504 0.28193963 0.06713893 0.21510397 0.23380603 0.11424308 0.0218009 -0.11969765] + [0.45955992 0.22664294 0.01990281 0.0731914 0.27125207 0.27298513 -0.01716102 -0.15327111] + [0.03440702 0.41419312 0.24817684 0.30765256 0.48516113 0.24904746 0.57791173 0.00955463] + [0.13458519 0.6690533 0.49259356 0.28319967 0.25951773 0.16777472 0.45696738 0.24933104] + [0.14152306 0.5040985 0.24455397 0.10907605 0.11319532 0.19538902 0.01208619 0.40430856] + [-0.7773164 -0.47611716 -0.6041424 -0.6144473 -0.2651842 -0.31909415 -0.4510405 + -0.12860501]] + ``` + +2. Execute stage 2 training and load the checkpoint file before training. The training code needs to be supplemented based on the site requirements. + + ```python + import numpy as np + import os + import mindspore.nn as nn + from mindspore import context + from mindspore.communication import init + from mindspore import Tensor, Parameter + import mindspore.ops as ops + from mindspore import load_checkpoint, load_param_into_net + + from mindspore.communication import init + devid = int(os.getenv('DEVICE_ID')) + context.set_context(mode=context.GRAPH_MODE,device_target='Ascend',save_graphs=True, device_id=devid) + init() + + class Net(nn.Cell): + def __init__(self,weight_init): + super(Net, self).__init__() + self.weight = Parameter(Tensor(weight_init), layerwise_parallel=True) + self.fc = ops.MatMul(transpose_b=True) + + def construct(self, x): + x = self.fc(x, self.weight) + return x + def train_mindspore_impl_fc(input, label, ckpt_file): + param_dict = load_checkpoint(ckpt_file) + + for paramname in ["weight", "moments.weight"]: + # get layer wise model parallel parameter + new_param = parameter_dict[paramname] + # split the model parameter data + slice_list = np.split(new_param.data.asnumpy(), 2, axis=0) + # Load the corresponding data slice + rank = get_rank() + tensor_slice = Tensor(slice_list[rank]) + # modify model parameter data values + new_param.set_data(tensor_slice, True) + + # load the modified parameter data into the network + weight = np.ones([4, 8]).astype(np.float32) + net = Net(weight) + load_param_into_net(net, param_dict) + opt = Momentum(learning_rate=0.01, momentum=0.9, params=parallel_net.get_parameters()) + load_param_into_net(opt, param_dict) + # train code + ... + + if __name__ == "__main__": + input = np.random.random((4, 8)).astype(np.float32) + print("mean = ", np.mean(input,axis=1, keepdims=True)) + label = np.random.random((4, 4)).astype(np.float32) + ckpt_file = sys.argv[1] + train_mindspore_impl_fc(input, label, ckpt_file) + ``` + + In the preceding information: + + - `mode=context.GRAPH_MODE`: sets the running mode to graph mode for distributed training. (The PyNative mode does not support parallel running.) + - `device_id`: physical sequence number of a device, that is, the actual sequence number of the device on a computer where the device is located. + - `init`: completes the distributed training initialization. + + Parameter values after loading: + + ```text + device0: + name is weight + value is + [[0.87537426 1.0448935 0.86736983 0.8836905 0.77354026 0.69588304 0.9183654 0.7792076] + [0.87224025 0.8726848 0.771446 0.81967723 0.88974726 0.7988162 0.72919345 0.7677011] + [0.8828271 0.7963984 0.90675324 0.9830291 0.89010954 0.897052 0.7890109 0.89784735] + [1.0011744 1.0840297 1.0201758 1.0882459 0.94232416 1.0775206 1.0195118 1.0528734]] + name is learning_rate + value is [0.01] + name is momentum + value is [0.9] + name is moments.weight + value is + [[0.2567724 -0.07485991 0.282002 0.2456022 0.454939 0.619168 0.18964815 0.45714882] + [0.25946522 0.24344791 0.45677605 0.3611395 0.23378398 0.41439137 0.5312468 0.4696194] + [0.2417504 0.28193963 0.06713893 0.21510397 0.23380603 0.11424308 0.0218009 -0.11969765] + [0.45955992 0.22664294 0.01990281 0.0731914 0.27125207 0.27298513 -0.01716102 -0.15327111]] + + device1: + name is weight + value is + [[1.0053468 0.98402303 0.99762845 0.97587246 1.0259694 1.0055295 0.99420834 0.9496847] + [1.0851002 1.0295962 1.0999886 1.0958165 0.9765328 1.146529 1.0970603 1.1388365] + [0.7147005 0.9168278 0.80178416 0.6258351 0.8413766 0.5909515 0.696347 0.71359116] + [0.20506378 0.03691584 0.2454556 0.12978578 0.19065076 0.23904312 0.27509746 0.34614682]] + name is learning_rate + value is [0.01] + name is momentum + value is [0.9] + name is moments.weight + value is + [[0.03440702 0.41419312 0.24817684 0.30765256 0.48516113 0.24904746 0.57791173 0.00955463] + [0.13458519 0.6690533 0.49259356 0.28319967 0.25951773 0.16777472 0.45696738 0.24933104] + [0.14152306 0.5040985 0.24455397 0.10907605 0.11319532 0.19538902 0.01208619 0.40430856] + [-0.7773164 -0.47611716 -0.6041424 -0.6144473 -0.2651842 -0.31909415 -0.4510405 -0.12860501]] + ``` diff --git a/tutorials/experts/source_en/parallel/sharding_propagation.md b/tutorials/experts/source_en/parallel/sharding_propagation.md new file mode 100644 index 0000000000000000000000000000000000000000..066f4fcaab6e0efa0206b5d62ee98a947f2ae2ba --- /dev/null +++ b/tutorials/experts/source_en/parallel/sharding_propagation.md @@ -0,0 +1,171 @@ +# Sharding Propagation + +`Ascend` `GPU` `Parallel Training` `Automatic Parallelization` + + + +## Background + +Distributed operator, Tensor Layout, and Tensor Redistribution are fundamental concepts in op-level parallelism of MindSpore. In [here](https://www.mindspore.cn/docs/programming_guide/en/master/design/distributed_training_design.html#automatic-parallelism), these concepts are introduced by examples. Here, we formally define them. + +In op-level parallelism, we conduct SPMD (Single Program Multiple Data) style parallelism, that is, a single program is produced for all partitions. MindSpore transforms a stand-alone program to a parallel one. The transformation is fine-grained in the sense that each operator in the stand-alone program is substituted by (a) distributed operator(s), guaranteeing that the substitution is mathematically equivalent. + +### Distributed Operator + +Distributed Operator: together, the distributed operators running on multiple devices preserve the same semantics of the stand-alone counterpart. That is, given the same input, the distributed operators’ output is the same as the stand-alone counterpart. + +Say a matrix multiplication (MatMul) operator with two matrix X and W as input: Y = MatMul(X, W) is to be parallelized on 4 devices. If X is replicated on 4 devices and W is partitioned along the column dimension, then after the transformation, MatMul is the distributed operator on each device. If X is partitioned along the column dimension and W is partitioned along the row dimension, then MatMul followed by AllReduce are distributed operators on each device. + +Besides the SP (Single Program) part, MD (Multiple Data) part also needs to be specified. Before that, we first define the Sharding Strategy. + +### Sharding Strategy + +Sharding Strategy: a Sharding Strategy for an operator is a two-dimensional array, specifying how many partitions to split each dimension of each input tensor for the operator. + +Derived from the Sharding Strategy, Tensor Layout is defined to specify how a tensor is distributed across devices. + +### Tensor Layout + +Tensor Layout: given a Sharding Strategy for an operator, the Tensor Layout is inferred to describe the distributions of the input tensors of the operator, which includes the **Logical Device Matrix** and the **Tensor Map**. The Logical Device Matrix is an one-dimensional array, describing how devices are arranged for the operator. The Tensor Map the dimensions of input tensors to dimensions of the device matrix, indicating that input tensors are partitioned across the Logical Device Matrix. + +Use again the MatMul operator Y = MatMul(X, W). We configure the operator with Sharding Strategy [[2, 1], [1, 4]] and the corresponding Tensor Layout information is demonstrated in the following figure. X is partitioned into 2 parts along the row dimension, and W is partitioned into 4 parts along the column dimension (figure (b)). From the Sharding Strategy, the Logical Device Matrix and the Tensor Map are inferred, as shown in figure (c). The coordinates are also determined to describe the locations of devices in the Logical Device Matrix, based on which the distributions of tensors are determined. From the ‘2’ column in the coordinate table, Device 0—3 are assigned X0, while Device 4—7 are assigned X1. From the ‘4’ column in the coordinate table, Device 0 and Device 4 are assigned W0, Device 1 and Device 5 are assigned W1, Device 2 and Device 6 are assigned W2, and Device 3 and Device 7 are assigned W3. As a result, the local computation is determined, as shown in figure (d). + +![tensor_layout](./images/tensor_layout.png "From Sharding Strategy, Tensor Layout and local computation are inferred.") + +For two consecutive operators that are dependent, the Tensor Layouts defined by two operators may be inconsistent, due to either Logical Device Matrix or Tensor Map. We propose an algorithm, called Tensor Redistribution, that transforms the inconsistent Tensor Layout. We omit the algorithm here, and only give a definition. + +### Tensor Redistribution + +Tensor Redistribution: given two inconsistent Tensor Layouts of a tensor, Tensor Redistribution is an algorithm that can transform from the source Tensor Layout to the target Tensor Layout, with minimum communication cost. + +Here, the communication cost is measured by the bytes that each device transmits. + +Say a two-operator example: Z = MatMul(X, W), O = MatMul(Z, Y). To make Tensor Redistribution effective, two operators are configured Sharding Strategies so that the Tensor Layouts of Z are inconsistent, as shown in the following figure. In figure (a), the output of the first MatMul is row partitioned, while the second MatMul requires that Z are full-sized. Therefore, an AllGather is inferred by Tensor Redistribution to perform the transformation[1]. In figure (b), an AllToAll in inferred to perform the transformation. + +![tensor_redistribution](./images/tensor_redistribution.png "The full-sized programs with Sharding Strategy, and their corresponding local computation for each device.") + +## Sharding Propagation + +Given a computation graph, Sharding Propagation is a functionality that propagates the Sharding Strategies from configured operator to the whole graph, with the goal of minimizing the communication cost in Tensor Redistribution. + +The input of Sharding Propagation is a computation graph, in which nodes represent operators, and edges encode the data-dependency relationship of operators. From a model definition with some operators configured Sharding Strategies, Sharding Propagation executes as follows: + +1. Generate possible Sharding Strategies for non-configured operators; +2. Generate Tensor Redistributions and the associated communication costs for each edge; +3. Start from the configured operators, and propagate the Sharding Strategies to non-configured operators using BFS, with the goal of minimizing the communication cost along each edge. + +The following figure illustrates an example process of applying Sharding Propagation. Given an computation graph with some configured strategies, it first enumerates possible strategies for non-configured operators, as shown in figure (b). Next, it enumerates possible strategies and the Tensor Redistribution costs for each edge. Demonstrated in figure (c), the strategy for an edge is defined as a pair [*s_strategy*, *t_strategy*], where *s_strategy* and *t_strategy* denote Sharding Strategy for source operator and target operator, respectively. Finally, starting from the configured operator, it determines the next operator’s Sharding Strategy, such that the communication cost in Tensor Redistribution is minimized. The propagation ends when the Sharding Strategies for all operators are settled, as shown in figure (d). + +![sharding_propagation](./images/sharding_propagation.png "An example process of applying Sharding Propagation.") + +## How to use Sharding Propagation in MindSpore + +### Preliminaries + +> Download the complete sample code: +> +> . + +The directory structure is as follows, where `rank_table_8pcs.json` is the IP configuration for Ascend devices (see [here]( https://www.mindspore.cn/docs/programming_guide/en/master/distributed_training_ascend.html#configuring-distributed-environment-variables) for the explanation), `train.py` is the model definition, and `run.sh` is the execution script. + +```text +└─sample_code + ├─sharding_propagatinon + │ rank_table_8pcs.json + │ run.sh + │ train.py + ... +``` + +### Model definition + +We use the FeedForward Network (`FFN`) as an example. + +```python +class FFN(Cell): + def __init__(self): + super().__init__() + self.dense1 = Dense(64, 64) + self.relu = ops.ReLU() + self.dense2= Dense(64, 64) + + def construct(self, x): + x = self.dense1(x) + x = self.relu(x) + x = self.dense2(x) + return x +``` + +### Configuring the Sharding Propagation + +Annotate Sharding Strategy for a `MatMul` operator in `FFN`: + +```python +self.dense1.matmul.shard(((2, 1), (1, 4))) +``` + +Configure the search_mode as `sharding_propagation` in Auto_Parallel mode: + +```python +context.set_auto_parallel_context(parallel_mode="auto_parallel", search_mode="sharding_propagation") +``` + +### Training the model and checking the Sharding Strategies + +Run the command `bash run.sh 8`. By setting the context: `save_graphs=True`, the IR graphs in the compilation process are saved. We choose the IRs corresponding to device 0. + +In `step_parallel_begin_xxx.ir`, each computation operator is annotated with a Sharding Strategy: + +```text +... + %3(x) = MatMul(%1, %2) {instance name: matmul} primitive_attrs: {input_names: [x1, x2], out_strategy: None, transpose_x2: false, transpose_b: false, in_strategy: ((2, 1), (1, 4)), output_names: [output], transpose_a: false, transpose_x1: false} + {in_strategy: ((2, 1), (1, 4))} : (, ) -> () + %4([CNode]453) = Load($(@1_construct_wrapper.298:para4_dense1.bias), %para15_u) + : (, ) -> () + %5(x) = Add(%3, %4) {instance name: add} primitive_attrs: {output_names: [output], input_names: [x, y]} + {in_strategy: ((2, 4), (4))} : (, ) -> () + %6(x) = ReLU(%5) {instance name: relu} primitive_attrs: {output_names: [output], input_names: [x]} + {in_strategy: ((2, 4))} : () -> () + %7([CNode]447) = Load($(@1_construct_wrapper.298:para5_dense2.weight), %para15_u) + : (, ) -> () + %8(x) = MatMul(%6, %7) {instance name: matmul} primitive_attrs: {output_names: [output], transpose_a: false, input_names: [x1, x2], transpose_x2: false, transpose_x1: false, transpose_b: false} + {in_strategy: ((2, 4), (4, 1))} : (, ) -> () + %9([CNode]449) = Load($(@1_construct_wrapper.298:para6_dense2.bias), %para15_u) + : (, ) -> () + %10(x) = Add(%8, %9) {instance name: add} primitive_attrs: {output_names: [output], input_names: [x, y]} + {in_strategy: ((2, 4), (4))} : (, ) -> () +... +``` + +In `xx_validate_xxx.ir`, each input and output tensor in the computation operator is sliced according to the Sharding Strategy. + +```text +… + %2(equivx) = MatMul(%0, %1) {instance name: matmul} primitive_attrs: {input_names: [x1, x2], out_strategy: None, transpose_x2: false, transpose_b: false, in_strategy: ((2, 1), (1, 4)), output_names: [output], transpose_a: false, transpose_x1: false} + {in_strategy: ((2, 1), (1, 4))} : (, ) -> () + # In file ./train.py(33)/ x = self.matmul(x, self.weight)/ + %3(equiv[CNode]453) = Load(%para4_dense1.bias, U) + : (, ) -> () + %4(equivx) = Add(%2, %3) {instance name: add} primitive_attrs: {output_names: [output], input_names: [x, y]} + {in_strategy: ((2, 4), (4))} : (, ) -> () + # In file ./train.py(34)/ x = self.add(x, self.bias)/ + %5(equivx) = ReLU(%4) {instance name: relu} primitive_attrs: {output_names: [output], input_names: [x]} + {in_strategy: ((2, 4))} : () -> () + # In file ./train.py(48)/ x = self.relu(x)/ + %6(equiv[CNode]447) = Load(%para5_dense2.weight, U) + : (, ) -> () + %7(equivx) = MatMul(%5, %6) {instance name: matmul} primitive_attrs: {output_names: [output], transpose_a: false, input_names: [x1, x2], transpose_x2: false, transpose_x1: false, transpose_b: false} + {in_strategy: ((2, 4), (4, 1))} : (, ) -> () + # In file ./train.py(33)/ x = self.matmul(x, self.weight)/ + %8(equiv[CNode]493) = AllReduce(%7) {instance name: forward_op_4025687080669949636} primitive_attrs: {group: 4-6301172352641561019, fusion: 0, op: sum, group_ranks: 0-1-2-3, index: 0} + : () -> () + %9(equiv[CNode]492) = StridedSlice(%8, (0, 0), (32, 16), (1, 1)) {instance name: redistribution_op_145462406996255498StridedSlice} primitive_attrs: {new_axis_mask: 0, shrink_axis_mask: 0, end_mask: 0, input_names: [x, begin, end, strides], output_names: [output], keep_value_node_input: true, begin_mask: 0, ellipsis_mask: 0} + : (, , , ) -> () + %10(equiv[CNode]449) = Load(%para6_dense2.bias, U) + : (, ) -> () + %11(equivx) = Add(%9, %10) {instance name: add} primitive_attrs: {output_names: [output], input_names: [x, y]} + {in_strategy: ((2, 4), (4))} : (, ) -> () +… +``` + +[^1]: Note: actually, AllGather+Concat is needed here to perform the transformation. diff --git a/tutorials/experts/source_en/test/test1.md b/tutorials/experts/source_en/test/test1.md deleted file mode 100644 index cab9dc1609fc287bfdc2fc7a905e187faca88e5a..0000000000000000000000000000000000000000 --- a/tutorials/experts/source_en/test/test1.md +++ /dev/null @@ -1,3 +0,0 @@ -# Test1 - -Coming soon. \ No newline at end of file diff --git a/tutorials/experts/source_zh_cn/data_engine/eager.ipynb b/tutorials/experts/source_zh_cn/data_engine/eager.ipynb index c754b69a93aa4f6f4ffc2ced9899097526ff0160..6a0f82ae186dbd3a2e6d17734da518e46b0db625 100644 --- a/tutorials/experts/source_zh_cn/data_engine/eager.ipynb +++ b/tutorials/experts/source_zh_cn/data_engine/eager.ipynb @@ -3,7 +3,7 @@ { "cell_type": "markdown", "source": [ - "# 数据处理的Eager模式\n", + "# 轻量化数据处理\n", "\n", "`Ascend` `GPU` `CPU` `数据准备`\n", "\n", diff --git a/tutorials/experts/source_zh_cn/model_infer/offline_inference.rst b/tutorials/experts/source_zh_cn/model_infer/offline_inference.rst index 0c67ed7d0120830384d81dc108a96106ab80ea12..4fa24090a520178f0411a234e773e918f3961a6f 100644 --- a/tutorials/experts/source_zh_cn/model_infer/offline_inference.rst +++ b/tutorials/experts/source_zh_cn/model_infer/offline_inference.rst @@ -1,4 +1,4 @@ -使用离线模型推理 +离线推理 =================== .. toctree:: diff --git a/tutorials/experts/source_zh_cn/model_infer/online_inference.md b/tutorials/experts/source_zh_cn/model_infer/online_inference.md index c739436cce2049ef65fba858b27c051ba60fb321..817bb64aaa9a711e41129729e73953ecfe4862b0 100644 --- a/tutorials/experts/source_zh_cn/model_infer/online_inference.md +++ b/tutorials/experts/source_zh_cn/model_infer/online_inference.md @@ -1,4 +1,4 @@ -# 加载Checkpoint在线推理 +# 在线推理 `Ascend` `推理应用` diff --git a/tutorials/experts/source_zh_cn/parallel/distributed_training_comm_fusion.md b/tutorials/experts/source_zh_cn/parallel/comm_fusion.md similarity index 100% rename from tutorials/experts/source_zh_cn/parallel/distributed_training_comm_fusion.md rename to tutorials/experts/source_zh_cn/parallel/comm_fusion.md diff --git a/tutorials/experts/source_zh_cn/parallel/distributed_training_dataset_slice.md b/tutorials/experts/source_zh_cn/parallel/dataset_slice.md similarity index 100% rename from tutorials/experts/source_zh_cn/parallel/distributed_training_dataset_slice.md rename to tutorials/experts/source_zh_cn/parallel/dataset_slice.md diff --git a/tutorials/experts/source_zh_cn/parallel/distributed_advanced.rst b/tutorials/experts/source_zh_cn/parallel/distributed_advanced.rst index 8ac698c80002bf53fd1efc27f3cc3e8a4327d14b..9c4343a51dba133f682a42897e9c9d8a6a326ef9 100644 --- a/tutorials/experts/source_zh_cn/parallel/distributed_advanced.rst +++ b/tutorials/experts/source_zh_cn/parallel/distributed_advanced.rst @@ -5,14 +5,14 @@ :maxdepth: 1 distributed_advanced_overview - apply_operator_parallel - apply_pipeline_parallel - distributed_training_parallel_opt - apply_host_device_training - apply_recompute + operator_parallel + pipeline_parallel + optimizer_parallel + host_device_training + recompute sharding_propagation - apply_parameter_server_training - distributed_training_comm_fusion - distributed_training_dataset_slice + parameter_server_training + comm_fusion + dataset_slice distributed_inference pynative_shard_function_parallel diff --git a/tutorials/experts/source_zh_cn/parallel/distributed_example.rst b/tutorials/experts/source_zh_cn/parallel/distributed_example.rst index 58197b35daeae6c139515b134eef85450d991f63..723b469ddd2065809eb6a8c1e2a90165a95b10a5 100644 --- a/tutorials/experts/source_zh_cn/parallel/distributed_example.rst +++ b/tutorials/experts/source_zh_cn/parallel/distributed_example.rst @@ -7,6 +7,6 @@ distributed_training_ascend distributed_training_gpu save_load_model_hybrid_parallel - distributed_training_transformer - distributed_training_fault_recover + transformer + fault_recover pangu_alpha diff --git a/tutorials/experts/source_zh_cn/parallel/distributed_training_fault_recover.md b/tutorials/experts/source_zh_cn/parallel/fault_recover.md similarity index 100% rename from tutorials/experts/source_zh_cn/parallel/distributed_training_fault_recover.md rename to tutorials/experts/source_zh_cn/parallel/fault_recover.md diff --git a/tutorials/experts/source_zh_cn/parallel/apply_host_device_training.md b/tutorials/experts/source_zh_cn/parallel/host_device_training.md similarity index 100% rename from tutorials/experts/source_zh_cn/parallel/apply_host_device_training.md rename to tutorials/experts/source_zh_cn/parallel/host_device_training.md diff --git a/tutorials/experts/source_zh_cn/parallel/apply_operator_parallel.md b/tutorials/experts/source_zh_cn/parallel/operator_parallel.md similarity index 100% rename from tutorials/experts/source_zh_cn/parallel/apply_operator_parallel.md rename to tutorials/experts/source_zh_cn/parallel/operator_parallel.md diff --git a/tutorials/experts/source_zh_cn/parallel/distributed_training_parallel_opt.md b/tutorials/experts/source_zh_cn/parallel/optimizer parallel.md similarity index 100% rename from tutorials/experts/source_zh_cn/parallel/distributed_training_parallel_opt.md rename to tutorials/experts/source_zh_cn/parallel/optimizer parallel.md diff --git a/tutorials/experts/source_zh_cn/parallel/apply_parameter_server_training.md b/tutorials/experts/source_zh_cn/parallel/parameter_server_training.md similarity index 100% rename from tutorials/experts/source_zh_cn/parallel/apply_parameter_server_training.md rename to tutorials/experts/source_zh_cn/parallel/parameter_server_training.md diff --git a/tutorials/experts/source_zh_cn/parallel/apply_pipeline_parallel.md b/tutorials/experts/source_zh_cn/parallel/pipeline_parallel.md similarity index 100% rename from tutorials/experts/source_zh_cn/parallel/apply_pipeline_parallel.md rename to tutorials/experts/source_zh_cn/parallel/pipeline_parallel.md diff --git a/tutorials/experts/source_zh_cn/parallel/apply_recompute.md b/tutorials/experts/source_zh_cn/parallel/recompute.md similarity index 100% rename from tutorials/experts/source_zh_cn/parallel/apply_recompute.md rename to tutorials/experts/source_zh_cn/parallel/recompute.md diff --git a/tutorials/experts/source_zh_cn/parallel/distributed_training_transformer.md b/tutorials/experts/source_zh_cn/parallel/transformer.md similarity index 100% rename from tutorials/experts/source_zh_cn/parallel/distributed_training_transformer.md rename to tutorials/experts/source_zh_cn/parallel/transformer.md