From d457e2e6ee6ec5055accd0eb86e04b42d0cd4fb7 Mon Sep 17 00:00:00 2001
From: huanxiaoling <3174348550@qq.com>
Date: Wed, 2 Nov 2022 14:45:56 +0800
Subject: [PATCH] update the en files in reinforcement
---
.../docs/source_en/environment.md | 80 +++++++++++
docs/reinforcement/docs/source_en/index.rst | 2 +
.../docs/source_en/replaybuffer.md | 134 ++++++++++++++++++
.../docs/source_zh_cn/environment.md | 2 +-
.../docs/source_zh_cn/replaybuffer.md | 4 +-
5 files changed, 219 insertions(+), 3 deletions(-)
create mode 100644 docs/reinforcement/docs/source_en/environment.md
create mode 100644 docs/reinforcement/docs/source_en/replaybuffer.md
diff --git a/docs/reinforcement/docs/source_en/environment.md b/docs/reinforcement/docs/source_en/environment.md
new file mode 100644
index 0000000000..eac22b83ab
--- /dev/null
+++ b/docs/reinforcement/docs/source_en/environment.md
@@ -0,0 +1,80 @@
+# Reinforcement Learning Environment Access
+
+
+
+## Overview
+
+In the field of reinforcement learning, learning strategy maximizes numerical gain signals during interaction between an intelligent body and its environment. The "environment" is an important element in the field of reinforcement learning as a problem to be solved.
+
+A wide variety of environments are currently used for reinforcement learning: [Mujoco](https://github.com/deepmind/mujoco), [MPE](https://github.com/openai/multiagent-particle-envs), [Atari]( https://github.com/gsurma/atari), [PySC2](https://www.github.com/deepmind/pysc2), [SMAC](https://github/oxwhirl/smac), [TORCS](https: //github.com/ugo-nama-kun/gym_torcs), [Isaac](https://github.com/NVIDIA-Omniverse/IsaacGymEnvs), etc. Currently MindSpore Reinforcement is connected to two environments Gym and SMAC, and will gradually access more environments with the subsequent enrichment of algorithms. In this article, we will introduce how to access the third-party environment under MindSpore Reinforcement.
+
+## Encapsulating Environmental Python Functions as Operators
+
+Before that, introduce the static and dynamic graph modes.
+
+- In dynamic graph mode, the program is executed line by line in the order in which the code is written, and the compiler sends down the individual operators in the neural network to the device one by one for computation operations, making it easy for the user to write and debug the neural network model.
+
+- In static graph mode, the program compiles the developer-defined algorithm into a computation graph when the program is compiled for execution. In the process, the compiler can reduce resource overhead to obtain better execution performance by using graph optimization techniques.
+
+Since the syntax supported by the static graph mode is a subset of the Python language, and commonly-used environments generally use the Python interface to implement interactions. The syntax differences between the two often result in graph compilation errors. For this problem, developers can use the `PyFunc` operator to encapsulate a Python function as an operator in a MindSpore computation graph.
+
+Next, using gym as an example, encapsulate `env.reset()` as an operator in a MindSpore computation graph.
+
+The following code creates a `CartPole-v0` environment and executes the `env.reset()` method. You can see that the type of `state` is `numpy.ndarray`, and the data type and dimension are `np.float64` and `(4,)` respectively.
+
+```python
+import gym
+
+env = gym.make('CartPole-v0')
+state = env.reset()
+print('type: {}, shape: {}, dtype: {}'.format(type(state), state.dtype, state.shape))
+
+# Result:
+# type: , shape: (4,), dtype: float64
+```
+
+`env.reset()` is encapsulated into a MindSpore operator by using the `PyFunc` operator.
+
+- `fn` specifies the name of the Python function to be encapsulated, either as a normal function or as a member function.
+- `in_types` and `in_shapes` specify the input data types and dimensions. `env.reset` has no input, so it fills in an empty list.
+- `out_types`, `out_shapes` specify the data types and dimensions of the returned values. From the previous execution, it can be seen that `env.reset()` returns a numpy array with data type and dimension `np.float64` and `(4,)` respectively, so `[ms.float64,]` and `[(4,),]` are filled in.
+- `PyFunc` returns tuple(Tensor).
+- For more detailed instructions, refer to the [reference](https://gitee.com/mindspore/mindspore/blob/master/mindspore/python/mindspore/ops/operations/other_ops.py).
+
+## Decoupling Environment and Algorithms
+
+Reinforcement learning algorithms should usually have good generalization, e.g., an algorithm that solves `HalfCheetah` should also be able to solve `Pendulum`. In order to implement the generalization, it is necessary to decouple the environment from the rest of the algorithm, thus ensuring that the rest of the script is modified as little as possible after changing the environment. It is recommended that developers refer to `Environment` to encapsulate the environment.
+
+```python
+class Environment(nn.Cell):
+ def __init__(self):
+ super(Environment, self).__init__(auto_prefix=False)
+
+ def reset(self):
+ pass
+
+ def step(self, action):
+ pass
+
+ @property
+ def action_space(self) -> Space:
+ pass
+
+ @property
+ def observation_space(self) -> Space:
+ pass
+
+ @property
+ def reward_space(self) -> Space:
+ pass
+
+ @property
+ def done_space(self) -> Space:
+ pass
+```
+
+`Environment` needs to provide methods such as `action_space` and `observation_space`, in addition to interfaces for interacting with the environment, such as `reset` and `step`, which return [Space](https://mindspore.cn/reinforcement/docs/en/master/reinforcement.html#mindspore_rl.environment.Space) type. The algorithm can achieve the following operations based on the `Space` information:
+
+- obtain the dimensions of the state space and action space in the environment, which used to construct the neural network.
+- read the range of legal actions, and scale and crop the actions given by the policy network.
+- Identify whether the action space of the environment is discrete or continuous, and choose whether to explore the environment by using a continuous or discrete distribution.
diff --git a/docs/reinforcement/docs/source_en/index.rst b/docs/reinforcement/docs/source_en/index.rst
index 48659416f7..72c3f798f3 100644
--- a/docs/reinforcement/docs/source_en/index.rst
+++ b/docs/reinforcement/docs/source_en/index.rst
@@ -48,6 +48,8 @@ Typical MindSpore Reinforcement Application Scenarios
custom_config_info
dqn
+ replaybuffer
+ environment
.. toctree::
:maxdepth: 1
diff --git a/docs/reinforcement/docs/source_en/replaybuffer.md b/docs/reinforcement/docs/source_en/replaybuffer.md
new file mode 100644
index 0000000000..df489c040e
--- /dev/null
+++ b/docs/reinforcement/docs/source_en/replaybuffer.md
@@ -0,0 +1,134 @@
+# ReplayBuffer Usage Introduction
+
+
+
+## Brief Introduction of ReplayBuffer
+
+In reinforcement learning, ReplayBuffer is a common basic data storage method, whose functions is to store the data obtained from the interaction of an intelligent body with its environment.
+
+Solve the following problems by using ReplayBuffer:
+
+1. Stored historical data can be extracted by sampling to break the correlation of training data, so that the sampled data have independent and identically distributed characteristics.
+2. Provide temporary storage of data and improve the utilization of data.
+
+## ReplayBuffer Implementation of MindSpore Reinforcement Learning
+
+Typically, algorithms people use native Python data structures or Numpy data structures to construct ReplayBuffer, or general reinforcement learning frameworks also provide standard API encapsulation. The difference is that MindSpore implements the ReplayBuffer structure on the device side. On the one hand, the structure can reduce the frequent copying of data between Host and Device when using GPU hardware, and on the other hand, expressing the ReplayBuffer in the form of MindSpore operator can build a complete IR graph and enable various graph optimizations of MindSpore GRAPH_MODE to improve the overall performance.
+
+In MindSpore, two kinds of ReplayBuffer are provided, UniformReplayBuffer and PriorityReplayBuffer, which are used for common FIFO storage and storage with priority, respectively. The following is an example of UniformReplayBuffer implementation and usage.
+
+ReplayBuffer is represented as a List of Tensors, and each Tensor represents a set of data stored by column (e.g., a set of [state, action, reward]). The data that is newly put into the UniformReplayBuffer is updated in a FIFO mechanism with insert, search, and sample functions.
+
+### Parameter Explanation
+
+Create a UniformReplayBuffer with the initialization parameters batch_size, capacity, shapes, and types.
+
+* batch_size indicates the size of the data at a time for sample, an integer value.
+* capacity indicates the total capacity of the created UniformReplayBuffer, an integer value.
+* shapes indicates the shape size of each set of data in Buffer, expressed as a list.
+* types indicates the data type corresponding to each set of data in the Buffer, represented as a list.
+
+### Functions Introduction
+
+#### 1 Insert
+
+The insert method takes a set of data as input, and needs to satisfy that the shape and type of the data are the same as the created UniformReplayBuffer parameters. No output.
+To simulate the FIFO characteristics of a circular queue, we use two cursors to determine the head and effective length count of the queue. The following figure shows the process of several insertion operations.
+
+1. The total size of the buffer is 6. In the initial state, the cursor head and count are both 0.
+2. After inserting a batch_size of 2, the current head is unchanged and count is added by 2.
+3. After continuing to insert a batch_size of 4, the queue is full and the count is 6.
+4. After continuing to insert a batch_size of 2, overwrite updates the old data and adds 2 to the head.
+
+#### 2 Search
+
+The search method accepts an index as an input, indicating the specific location of the data to be found. The output is a set of Tensor, as shown in the following figure:
+
+1. If the UniformReplayBuffer is just full or not full, the corresponding data is found directly according to the index.
+2. For data that has been overwritten, remap it by cursors.
+
+
+
+#### 3 Sample
+
+The sampling method has no input and the output is a set of Tensor with the size of the batch_size when the UniformReplayBuffer is created. This is shown in the following figure:
+Assuming that batch_size is 3, a random set of indexes will be generated in the operator, and this random set of indexes has two cases:
+
+1. Packet ordering: each index means the real data position, which needs to be remapped by cursor operation.
+2. No packet ordering: each index does not represent the real position and is obtained directly.
+
+Both approaches have a slight impact on randomness, and the default is to use no-packet ordering to get the best performance.
+
+## UniformReplayBuffer Introduction of MindSpore Reinforcement Learning
+
+### Creation of UniformReplayBuffer
+
+MindSpore Reinforcement Learning provides a standard ReplayBuffer API. The user can use the ReplayBuffer created by the framework by means of a configuration file, shaped like the configuration file of [dqn](https://gitee.com/mindspore/reinforcement/blob/master/mindspore_rl/algorithm/dqn/config.py).
+
+```python
+'replay_buffer':
+ {'number': 1,
+ 'type': UniformReplayBuffer,
+ 'capacity': 100000,
+ 'data_shape': [(4,), (1,), (1,), (4,)],
+ 'data_type': [ms.float32, ms.int32, ms.foat32, ms.float32],
+ 'sample_size': 64}
+```
+
+Alternatively, users can use the interfaces directly to create the required data structures:
+
+```python
+from mindspore_rl.core.uniform_replay_buffer import UniformReplayBuffer
+import mindspore as ms
+sample_size = 2
+capacity = 100000
+shapes = [(4,), (1,), (1,), (4,)]
+types = [ms.float32, ms.int32, ms.float32, ms.float32]
+replaybuffer = UniformReplayBuffer(sample_size, capacity, shapes, types)
+```
+
+### Using the Created UniformReplayBuffer
+
+Take [UniformReplayBuffer](https://gitee.com/mindspore/reinforcement/blob/master/mindspore_rl/core/uniform_replay_buffer.py) created in the form of an API to perform data manipulation as an example:
+
+* Insert operation
+
+```python
+state = ms.Tensor([0.1, 0.2, 0.3, 0.4], ms.float32)
+action = ms.Tensor([1], ms.int32)
+reward = ms.Tensor([1], ms.float32)
+new_state = ms.Tensor([0.4, 0.3, 0.2, 0.1], ms.float32)
+replaybuffer.insert([state, action, reward, new_state])
+replaybuffer.insert([state, action, reward, new_state])
+```
+
+* Search operation
+
+```python
+exp = replaybuffer.get_item(0)
+```
+
+* Sample operation
+
+```python
+samples = replaybuffer.sample()
+```
+
+* Reset operation
+
+```python
+replaybuffer.reset()
+```
+
+* The size of the current buffer used
+
+```python
+size = replaybuffer.size()
+```
+
+* Determine if the current buffer is full
+
+```python
+if replaybuffer.full():
+ print("Full use of this buffer.")
+```
diff --git a/docs/reinforcement/docs/source_zh_cn/environment.md b/docs/reinforcement/docs/source_zh_cn/environment.md
index d7a0d91309..a1ba9cf071 100644
--- a/docs/reinforcement/docs/source_zh_cn/environment.md
+++ b/docs/reinforcement/docs/source_zh_cn/environment.md
@@ -20,7 +20,7 @@
接下来以gym为例,将`env.reset()`封装为一个MindSpore计算图中的算子:
-下面的代码中创建了一个`CartPole-v0`的环境,执行`env.reset()`方法,可以看到`state`的类型是`numpy.ndarray`,数据类型和维度分别是`np.float64`和`(4,)`
+下面的代码中创建了一个`CartPole-v0`的环境,执行`env.reset()`方法,可以看到`state`的类型是`numpy.ndarray`,数据类型和维度分别是`np.float64`和`(4,)`。
```python
import gym
diff --git a/docs/reinforcement/docs/source_zh_cn/replaybuffer.md b/docs/reinforcement/docs/source_zh_cn/replaybuffer.md
index 59539169f3..f3117fdf93 100644
--- a/docs/reinforcement/docs/source_zh_cn/replaybuffer.md
+++ b/docs/reinforcement/docs/source_zh_cn/replaybuffer.md
@@ -12,7 +12,7 @@
## MindSpore Reinforcement Learning 的 ReplayBuffer 实现
-一般情况下,算法人员使用原生的Python数据结构或Numpy的数据结构来构造ReplayBuffer, 或者一般的强化学习框架也提供了标准的API封装。不同的是,MindSpore实现了设备端的ReplayBuffer结构,一方面能在使用GPU硬件时减少数据在Host和Device之间的频繁拷贝,另一方面,以MindSpore算子的形式表达ReplayBuffer,可以构建完整的IR图,使能MindSpore GRAPH_MODE的各种图优化,提升整体的性能。
+一般情况下,算法人员使用原生的Python数据结构或Numpy的数据结构来构造ReplayBuffer,或者一般的强化学习框架也提供了标准的API封装。不同的是,MindSpore实现了设备端的ReplayBuffer结构,一方面能在使用GPU硬件时减少数据在Host和Device之间的频繁拷贝,另一方面,以MindSpore算子的形式表达ReplayBuffer,可以构建完整的IR图,使能MindSpore GRAPH_MODE的各种图优化,提升整体的性能。
在MindSpore中,提供了两种ReplayBuffer,分别是UniformReplayBuffer和PriorityReplayBuffer,分别用于常用的FIFO存储和带有优先级的存储。下面以UniformReplayBuffer为例介绍实现及使用。
以一个List的Tensor表示,每个Tensor代表一组按列存储的数据(如一组[state, action, reward])。新放入UniformReplayBuffer中的数据以FIFO的机制进行内容的更新,具有插入、查找、采样等功能。
@@ -65,7 +65,7 @@
### UniformReplayBuffer的创建
-MindSpore Reinforcement Learning 提供了标准的ReplayBuffer API. 用户可以使用配置文件的方式使用框架创建的ReplayBuffer,形如[dqn](https://gitee.com/mindspore/reinforcement/blob/master/mindspore_rl/algorithm/dqn/config.py)的配置文件:
+MindSpore Reinforcement Learning 提供了标准的ReplayBuffer API。用户可以使用配置文件的方式使用框架创建的ReplayBuffer,形如[dqn](https://gitee.com/mindspore/reinforcement/blob/master/mindspore_rl/algorithm/dqn/config.py)的配置文件:
```python
'replay_buffer':
--
Gitee