diff --git a/docs/federated/docs/source_en/image_classification_application.md b/docs/federated/docs/source_en/image_classification_application.md index 5c77e7ac53631c49d6ad2afbb8a6aaf72b294efd..51e7b2935cddf22646908300110f3f420bd30416 100644 --- a/docs/federated/docs/source_en/image_classification_application.md +++ b/docs/federated/docs/source_en/image_classification_application.md @@ -307,7 +307,7 @@ Currently, the `3500_clients_bin` folder contains data of 3500 clients. This scr The following figure shows the accuracy of the test dataset for federated learning on 50 clients (set `server_num` to 16). -![lenet_50_clients_acc](images/lenet_50_clients_acc.png) +![lenet_50_clients_acc](images/lenet_50_clients_acc_en.png) The total number of federated learning iterations is 100, the number of epochs for local training on the client is 20, and the value of batchSize is 32. diff --git a/docs/federated/docs/source_en/images/lenet_50_clients_acc.png b/docs/federated/docs/source_en/images/lenet_50_clients_acc.png deleted file mode 100644 index c1282811f7161d77ec2ea563d96983ef293dbf43..0000000000000000000000000000000000000000 Binary files a/docs/federated/docs/source_en/images/lenet_50_clients_acc.png and /dev/null differ diff --git a/docs/federated/docs/source_en/split_wnd_application.md b/docs/federated/docs/source_en/split_wnd_application.md index a75fa829d9d690ae94bd30765f0b8485b53e4b55..bad52fa81570a8bf532c9d94995df97b9534bee5 100644 --- a/docs/federated/docs/source_en/split_wnd_application.md +++ b/docs/federated/docs/source_en/split_wnd_application.md @@ -10,12 +10,18 @@ Vertical FL model training scenarios: including two stages of forward propagatio Forward propagation: After the data intersection module processes the parameter-side data and aligns the feature information and label information, the Follower participant inputs the local feature information into the precursor network model, and the feature tensor output from the precursor network model is encrypted/scrambled by the privacy security module and transmitted to the Leader participant by the communication module. The Leader participants input the received feature tensor into the post-level network model, and the predicted values and local label information output from the post-level network model are used as the loss function input to calculate the loss values. +![](./images/vfl_forward_en.png) + Backward propagation: The Leader participant calculates the parameter gradient of the backward network model based on the loss value, trains and updates the parameters of the backward network model, and transmits the gradient tensor associated with the feature tensor to the Follower participant by the communication module after encrypted and scrambled by the privacy security module. The Follower participant uses the received gradient tensor for training and update of of frontward network model parameters. +![](./images/vfl_backward_en.png) + Vertical FL model inference scenario: similar to the forward propagation phase of the training scenario, but with the predicted values of the backward network model directly as the output, without calculating the loss values. ## Network and Data +![](./images/splitnn_wide_and_deep_en.png) + This sample provides a federated learning training example for recommendation-oriented tasks by using Wide&Deep network and Criteo dataset as examples. As shown above, in this case, the vertical federated learning system consists of the Leader participant and the Follower participant. Among them, the Leader participant holds 20×2 dimensional feature information and label information, and the Follower participant holds 19×2 dimensional feature information. Leader participant and Follower participant deploy 1 set of Wide&Deep network respectively, and realize the collaborative training of the network model by exchanging embedding vectors and gradient vectors without disclosing the original features and label information. For a detailed description of the principle properties of Wide&Deep networks, see [MindSpore ModelZoo - Wide&Deep - Wide&Deep Overview](https://gitee.com/mindspore/models/blob/master/official/recommend/wide_and_deep/README.md#widedeep-description) and its [research paper](https://arxiv.org/pdf/1606.07792.pdf). diff --git a/docs/mindspore/source_en/note/static_graph_syntax_support.md b/docs/mindspore/source_en/note/static_graph_syntax_support.md index f6d7f72cd295fa3a9088e9f71f9e6255277c4c88..588871ea84b89e375b05e0a6a2ffd815297fedf4 100644 --- a/docs/mindspore/source_en/note/static_graph_syntax_support.md +++ b/docs/mindspore/source_en/note/static_graph_syntax_support.md @@ -775,9 +775,7 @@ Parameter: `cond` -- Variables of `Bool` type and constants of `Bool`, `List`, ` Restrictions: -- If `cond` is not a constant, the variable or constant assigned to a same sign in different branches should have same data type.If the data type of assigned variables or constants is `Tensor`, the variables and constants should have same shape and element type. - -- The number of `if` cannot exceed 100. +- If `cond` is not a constant, the variable or constant assigned to a same sign in different branches should have same data type. If the data type of assigned variables or constants is `Tensor`, the variables and constants should have same shape and element type. For shape consistency restrictions, please refer to [ShapeJoin Rules](https://www.mindspore.cn/tutorials/experts/en/master/network/control_flow.html#shapejoin-rules). Example 1: @@ -930,14 +928,12 @@ Parameter: `cond` -- Variables of `Bool` type and constants of `Bool`, `List`, ` Restrictions: -- If `cond` is not a constant, the variable or constant assigned to a same sign inside body of `while` and outside body of `while` should have same data type.If the data type of assigned variables or constants is `Tensor`, the variables and constants should have same shape and element type. +- If `cond` is not a constant, the variable or constant assigned to a same sign inside body of `while` and outside body of `while` should have same data type.If the data type of assigned variables or constants is `Tensor`, the variables and constants should have same shape and element type. For shape consistency restrictions, please refer to [ShapeJoin Rules](https://www.mindspore.cn/tutorials/experts/en/master/network/control_flow.html#shapejoin-rules). - The `while...else...` statement is not supported. - If `cond` is not a constant, in while body, the data with type of `Number`, `List`, `Tuple` are not allowed to update and the shape of `Tensor` data are not allowed to change. -- The number of `while` cannot exceed 100. - Example 1: ```python diff --git a/docs/recommender/docs/source_en/images/offline_training.png b/docs/recommender/docs/source_en/images/offline_training.png index 8d0993a881318d0a5b802973187ac2aad327f7a1..41eac8a2105a981227866823e209cd04e8ccb391 100644 Binary files a/docs/recommender/docs/source_en/images/offline_training.png and b/docs/recommender/docs/source_en/images/offline_training.png differ diff --git a/docs/recommender/docs/source_en/images/online_training.png b/docs/recommender/docs/source_en/images/online_training.png index 40b43f66b44d51d33723a2bb1de4515168ed502a..230b248e36abb4db7acf1a7d42524ccf1a03a0da 100644 Binary files a/docs/recommender/docs/source_en/images/online_training.png and b/docs/recommender/docs/source_en/images/online_training.png differ diff --git a/docs/recommender/docs/source_en/index.rst b/docs/recommender/docs/source_en/index.rst index d36b23bfe34d8e73d9a316f0cb1742f1fdc34485..5b25cd96876d85e5820fbe943c3d874103f1ed0a 100644 --- a/docs/recommender/docs/source_en/index.rst +++ b/docs/recommender/docs/source_en/index.rst @@ -1,11 +1,18 @@ MindSpore Recommender Documents ================================ -MindSpore Recommender是一个构建在MindSpore框架基础上,面向推荐领域的开源训练加速库,通过MindSpore大规模的异构计算加速能力,MindSpore Recommender支持在线以及离线场景大规模动态特征的高效训练。 +MindSpore Recommender is an open source training acceleration library based on the MindSpore framework for the recommendation domain. With MindSpore's large-scale heterogeneous computing acceleration capability, MindSpore Recommender supports efficient training of large-scale dynamic features for online and offline scenarios. .. raw:: html - +

+ +The MindSpore Recommender acceleration library consists of the following components: + +- online training: implements online training of real-time data and incremental model updates by streaming data from real-time data sources (e.g., Kafka) and online real-time data processing to support business scenarios that require real-time model updates. +- offline training: for the traditional offline dataset training scenario, it supports the training of recommendation models containing large-scale feature vectors through automatic parallelism, distributed feature caching, heterogeneous acceleration and other technical solutions. +- data processing: MindPandas and MindData provide the ability to read and process data online and offline, saving the overhead of multiple languages and frameworks through full-Python expression support, and opening up efficient data flow links for data processing and model training. +- model library: includes continuous rich training of typical recommendation models. After rigorous validation for accuracy and performance, it can be used right after installation. .. toctree:: :glob: diff --git a/docs/recommender/docs/source_en/offline_learning.md b/docs/recommender/docs/source_en/offline_learning.md index fbfea10950c7b5cba162c43a89d38d5227c5d9fe..4c088d7f03fc31033e76cc6efb9dfe5250576746 100644 --- a/docs/recommender/docs/source_en/offline_learning.md +++ b/docs/recommender/docs/source_en/offline_learning.md @@ -1,3 +1,17 @@ # Offline Training + +## Overview + +One of the main challenges of recommendation model training is the storage and training of large-scale feature vectors. MindSpore Recommender provides a perfect solution for training large-scale feature vectors for offline scenarios. + +## Overall Architecture + +The training architecture for large-scale feature vectors in recommendation models is shown in the figure below, in which the core adopts the technical scheme of distributed multi-level Embedding Cache. The distributed parallel technology of multi-machine and multi-card based on model parallelism implements large-scale and low-cost recommendation training of large models. + +![image.png](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/docs/recommender/docs/source_en/images/offline_training.png) + +## Example + +[Wide&Deep distributed training](https://gitee.com/mindspore/recommender/tree/master/models/wide_deep) diff --git a/docs/recommender/docs/source_zh_cn/images/offline_training.png b/docs/recommender/docs/source_zh_cn/images/offline_training.png index 8d0993a881318d0a5b802973187ac2aad327f7a1..41eac8a2105a981227866823e209cd04e8ccb391 100644 Binary files a/docs/recommender/docs/source_zh_cn/images/offline_training.png and b/docs/recommender/docs/source_zh_cn/images/offline_training.png differ diff --git a/docs/recommender/docs/source_zh_cn/images/online_training.png b/docs/recommender/docs/source_zh_cn/images/online_training.png index 40b43f66b44d51d33723a2bb1de4515168ed502a..230b248e36abb4db7acf1a7d42524ccf1a03a0da 100644 Binary files a/docs/recommender/docs/source_zh_cn/images/online_training.png and b/docs/recommender/docs/source_zh_cn/images/online_training.png differ diff --git a/tutorials/experts/source_en/network/control_flow.md b/tutorials/experts/source_en/network/control_flow.md index 60026c104e542e2b15078df493a0e0a3f4a32649..0a90346861afbfc2c228df4cbdf62372e1badfd3 100644 --- a/tutorials/experts/source_en/network/control_flow.md +++ b/tutorials/experts/source_en/network/control_flow.md @@ -47,7 +47,7 @@ The operator output can be determined only when each step is executed. Therefore ## if Statement -When defining a network in `GRAPH_MODE` using the `if` statement, pay attention to the following: **When the condition expression is a variable condition, the same variable in different branches must be assigned the same data type.** +When defining a network in `GRAPH_MODE` using the `if` statement, pay attention to the following: **When the condition expression is a variable condition, the same variable in different branches must be assigned the same data type. For example, the Tensor type variable requires the shape and type to be the same. For shape consistency restrictions, please refer to [ShapeJoin Rules](#shapejoin-rules).** ### if Statement Under a Variable Condition @@ -336,7 +336,7 @@ print("output:", output) IndexError: mindspore/core/abstract/prim_structures.cc:127 InferTupleOrListGetItem] list_getitem evaluator index should be in range[-3, 3), but got 3. ``` -2. Constraint 2: **When the condition expression in the while statement is a variable condition, the input shape of the operator cannot be changed in the loop body.** +2. Constraint 2: **When the condition expression in the while statement is a variable condition, the input shape of the operator cannot be changed in the loop body. The data types of variables with the same name inside the loop body and outside the loop body should be the same, for example, Tensor type variables require the same shape and type. For shape consistency restrictions, please refer to [ShapeJoin Rules](#shapejoin-rules).** MindSpore requires that the input shape of the same operator on the network be determined during graph build. However, changing the input shape of the operator in the `while` loop body takes effect during graph execution. @@ -382,3 +382,44 @@ print("output:", output) ValueError: mindspore/ccsrc/pipeline/jit/static_analysis/static_analysis.cc:800 ProcessEvalResults] Cannot join the return values of different branches, perhaps you need to make them equal. Shape Join Failed: shape1 = (1), shape2 = (1, 1). ``` + +## ShapeJoin Rules + +`unknow_shape` indicates that the length of the corresponding dimension is dynamic in the dynamic shape scenario, and `unknown_rank` indicates that the dimension of the shape is dynamic in the dynamic rank scenario. `shape1` and `shape2` indicate the shapes of the two branches where the Join is performed, respectively. Shape Join will succeed when any of the following rules are met, otherwise a `Shape Join Failed` exception will be reported. + +- Rule 1: + + Both shape1 and shape2 dimensions are fixed and both dimensions are equal, and shape1[i] is equal to shape2[i]. + +- Rule 2: + + Both shape1 and shape2 dimensions are fixed and both dimensions are equal, and at least one of shape1[i] or shape2[i] is unknown_shape. + +- Rule 3: + + At least one of the shape1 and shape2 dimensions is dynamic, i.e., shape1 or shape2 is dynamic rank. + +- Rule 4: + + The dimensions of shape1 and shape2 are fixed and unequal, with the smaller dimension being m and the larger dimension being n. + + In the 0 to m-1 dimensional range, satisfy: + + 1. shape1[i] or shape2[i] are equal. + + 2. Both shape1[i] and shape2[i] are unknown_shape. + + In the m to n-1 dimensional range, satisfy: The shape[i] of the larger dimension is unknown_shape. + +The following list is an example of the rules for Shape Join. + +| shape1 | shape2 | Join Results| +| :----- | :----- | :------- | +| (3, 4)| (3, 4)| (3, 4) | +| (3, 5)| (3, 4)| Join Fail | +| (3, 4)| (3, 4, 1)| Join Fail | +| (3, unknown_shape) | (3, 4)| (3, unknown_shape) | +| unknown_rank | (3, 4)| unknown_rank | +| (3, unknown_shape)| (3, unknown_shape, unknown_shape)| unknown_rank | +| (3, unknown_shape)| (4, unknown_shape, unknown_shape)| Join Fail | +| (3, unknown_shape)| (3, 4, unknown_shape)| Join Fail | diff --git a/tutorials/experts/source_zh_cn/network/control_flow.ipynb b/tutorials/experts/source_zh_cn/network/control_flow.ipynb index fc315f3dc0b84899d0343122ea6ed8160a6ecc9f..d1a212d0fdae81b591351ab2378e138dd1416961 100644 --- a/tutorials/experts/source_zh_cn/network/control_flow.ipynb +++ b/tutorials/experts/source_zh_cn/network/control_flow.ipynb @@ -493,7 +493,7 @@ "source": [ "## ShapeJoin规则\n", "\n", - "`unknow_shape`表示动态shape场景下, 对应维度的长度是动态的,`unknown_rank`表示动态rank场景下shape的维度是动态的,`shape1`与`shape2`分别表示进行Join的两个分支的shape。当满足下面任一规则时,Shape Join会成功,否则会出现`Shape Join Failed`异常。\n", + "`unknow_shape`表示动态shape场景下,对应维度的长度是动态的,`unknown_rank`表示动态rank场景下shape的维度是动态的,`shape1`与`shape2`分别表示进行Join的两个分支的shape。当满足下面任一规则时,Shape Join会成功,否则会出现`Shape Join Failed`异常。\n", "\n", "- 规则1:\n", "\n",