diff --git a/assignment-2/submission/17307130331/README.md b/assignment-2/submission/17307130331/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..abd8de5834bacc838e1b813905da469a8d9168c3
--- /dev/null
+++ b/assignment-2/submission/17307130331/README.md
@@ -0,0 +1,343 @@
+# 实验报告
+
+陈疏桐   17307130331
+
+本次实验，我用numpy实现了Matmul、log、softmax和relu四个算子的前向计算与后向计算，用四个算子构建分类模型，通过了自动测试，并实现了mini_batch函数，在mnist数据集上用不同的学习率与Batch大小进行训练和测试，讨论学习率与Batch大小对模型训练效果的影响。最后，我还实现Momentum、RMSProp与Adam三种优化方法，与传统梯度下降进行比较。
+
+## 算子的反向传播与实现
+### Matmul
+
+Matmul是矩阵的乘法，在模型中的作用相当于pytorch的一个线性层，前向传播的公式是：
+
+$$ \mathrm{Y} = \mathrm{X}\mathrm{W} $$
+
+其中，$\mathrm{X}$是形状为 $N \times d$的输入矩阵，$\mathrm{W}$是形状为$d \times d'$的矩阵， $\mathrm{Y}$是形状为$N\times d'$的输出矩阵。Matmul算子相当于输入维度为$d$、输出$d'$维的线性全连接层。
+
+Matmul分别对输入求偏导，有
+
+$$ \frac{\partial \mathrm{Y}}{\partial \mathrm{X}} = \frac{\partial \mathrm{X}\mathrm{W}}{\partial \mathrm{X}} = \mathrm{W}^T$$
+
+$$ \frac{\partial \mathrm{Y}}{\partial \mathrm{W}} = \frac{\partial \mathrm{X}\mathrm{W}}{\partial \mathrm{W}} = \mathrm{X}^T $$
+
+则根据链式法则，反向传播的计算公式为：
+
+$$ \triangledown{\mathrm{X}} = \triangledown{\mathrm{Y}} \times \mathrm{W}^T $$
+$$ \triangledown{\mathrm{W}} = \mathrm{X}^T \times \triangledown{\mathrm{Y}} $$
+
+### Relu 
+
+Relu函数对输入每一个元素的公式是：
+
+$$ \mathrm{Y}_{ij}=
+\begin{cases}
+\mathrm{X}_{ij} & \mathrm{X}_{ij} \ge 0 \\\\
+0 & \text{otherwise}
+\end{cases} 
+$$
+
+
+每一个输出 $\mathrm{Y}_{ij}$都只与输入$\mathrm{X}_{ij}$有关。则$\mathrm{X}$每一个元素的导数也只和对应的输出有关，为：
+
+$$ \frac{\partial \mathrm{Y}_{ij}}{\partial \mathrm{X}_{ij}} = 
+\begin{cases}
+1 & \mathrm{X}_{ij} \ge 0 \\\\
+0 & \text{otherwise}
+\end{cases}$$ 
+
+因此，根据链式法则，输入的梯度为：
+
+$$ \triangledown{\mathrm{X}_{ij}} = \triangledown{\mathrm{Y}_{ij}} \times \frac{\partial \mathrm{Y}_{ij}}{\partial \mathrm{X}_{ij}}$$
+
+### Log
+
+Log 函数公式：
+
+$$ \mathrm{Y}_{ij} = \log(\mathrm{X}_{ij} + \epsilon)$$
+
+$$ \frac{\partial \mathrm{Y}_{ij}}{\partial \mathrm{X}_{ij}} = \frac{1}{(\mathrm{X}_{ij} + \epsilon)} $$
+
+类似地，反向传播的计算公式为：
+
+$$ \triangledown{\mathrm{X}_{ij}} = \triangledown{\mathrm{Y}_{ij}} \times \frac{\partial \mathrm{Y}_{ij}}{\partial \mathrm{X}_{ij}}$$
+
+### Softmax
+
+Softmax对输入$\mathrm{X}$的最后一个维度进行计算。前向传播的计算公式为：
+
+$$ \mathrm{Y}_{ij} = \frac{\exp^{\mathrm{X}_{ij}}}{\sum_{k} \exp ^ {\mathrm{X}_{ik}}}$$
+
+从公式可知，Softmax的每一行输出都是独立计算的，与其它行的输入无关。而对于同一行，每一个输出都与每一个输入元素有关。以行$k$为例，可推得输出元素对输入元素求导的计算公式是：
+
+$$\frac{\partial Y_{ki}}{\partial X_{kj}} = \begin{cases}
+\frac{\exp ^ {X_{kj}} \times (\sum_{t \ne j}{\exp ^ {X_{kt}}}) }{(\sum_{t}{\exp ^ {X_{kt}}})^2} = Y_{kj}(1-Y_{kj}) & i = j \\\\
+-\frac{\exp^{X_{ki} }\exp^{X_{kj} }}{(\sum_t\exp^{X_{kt}})^2}=-Y_{ki} \times Y_{kj} & i\ne j
+\end{cases}$$
+
+可得每行输出$\mathrm{Y}_{k}$与每行输入$\mathrm{X}_{k}$的Jacob矩阵$\mathrm{J}_{k}$， $\mathrm{J_{k}}_{ij} = \frac{\partial \mathrm{Y}_{ki}}{\partial \mathrm{X}_{kj}}$.
+
+输出的一行对于输入$\mathrm{X}_{kj}$的导数，是输出每一行所有元素对其导数相加，即$\sum_{i} {\frac{\partial \mathrm{Y}_{ki}}{\partial \mathrm{X}_{kj}}}$ 的结果。
+
+因此，根据链式法则，可得到反向传播的计算公式为：
+$$ \triangledown \mathrm{X}_{kj} = \sum_{i} {\frac{\partial \mathrm{Y}_{ki} \times \triangledown \mathrm{Y}_{ki}}{\partial \mathrm{X}_{kj}}}$$
+
+相当于：
+
+$$ \triangledown \mathrm{X}_{k} = \mathrm{J}_{k} \times \triangledown \mathrm{Y}_{k} $$
+
+在实现时，可以用`numpy`的`matmul`操作实现对最后两个维度的矩阵相乘，得到的矩阵堆叠起来，得到最后的结果。
+
+
+## 模型构建与训练
+### 模型构建
+
+参照`torch_mnist.py`中的`torch_model`，`numpy`模型的构建只需要将其中的算子换成我们实现的算子：
+```
+def forward(self, x):
+    x = x.reshape(-1, 28 * 28)
+
+    x = self.relu_1.forward(self.matmul_1.forward(x, self.W1))
+    x = self.relu_2.forward(self.matmul_2.forward(x, self.W2))
+
+    x = self.matmul_3.forward(x, self.W3)
+
+    x = self.softmax.forward(x)
+    x = self.log.forward(x)
+
+    return x
+```
+
+模型的computation graph是：
+![compu_graph](img/compu_graph.png)
+
+根据计算图，可以应用链式法则，推导出各个叶子变量（$\mathrm{W}_{1}, \mathrm{W}_{2}, \mathrm{W}_{3}, \mathrm{X}$）以及中间变量的计算方法。
+
+反向传播的计算图为：
+![backpropagration](img/backgraph.png)
+
+可根据计算图完成梯度的计算：
+```
+def backward(self, y):
+    self.log_grad = self.log.backward(y)
+    self.softmax_grad = self.softmax.backward(self.log_grad)
+    self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad)
+    self.relu_2_grad = self.relu_2.backward(self.x3_grad)
+    self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad)
+    self.relu_1_grad = self.relu_1.backward(self.x2_grad)
+    self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad)
+```
+
+### MiniBatch
+
+在`utils`中的`mini_batch`方法，直接调用了`pytorch`的`DataLoader`。 `DataLoader`是一个负责从数据集中读取样本、组合成批次输出的方法。简单地使用`DataLoader`， 可以方便地多线程并行化预取数据，加快训练速度，且节省代码。`DataLoader`还可以自定义`Sampler`，以不同的方式从数据集中进行采样，以及`BatchSampler`以自定的方式将采集的样本组合成批，这样就可以实现在同一Batch内将数据补0、自定义Batch正负样本混合比例等操作。
+
+在这里，我们模仿`DataLoader`的默认行为实现`mini_batch`方法。
+```
+def mini_batch(dataset, batch_size=128):
+    data = np.array([each[0].numpy() for each in dataset]) # 需要先处理数据
+    label = np.array([each[1] for each in dataset])
+    
+    data_size = data.shape[0]
+    idx = np.array([i for i in range(data_size)])
+    np.random.shuffle(idx)   # 打乱顺序
+    
+    return [(data[idx[i: i+batch_size]], label[idx[i:i+batch_size]])  for i in range(0, data_size, batch_size)]  # 这里相当于DataLoader 的BatchSampler，但一次性调用
+```
+
+### 模型训练
+
+构建模型，设置`epoch=10`, `learning_rate=0.1`, `batch_size=128`后，开始训练。训练时每次fit一个batch的数据，前向传播计算输出，然后根据输出计算loss，再调用`loss.backward`计算loss对输出的求导，即模型输出的梯度，之后就可以调用模型的`backward`进行后向计算。 最后调用模型的`optimize`更新参数。
+
+训练过程：
+![train10](img/train10.png)
+ 
+各个epoch的测试准确率为：
+```
+[0] Test Accuracy: 0.9437
+[1] Test Accuracy: 0.9651
+[2] Test Accuracy: 0.9684
+[3] Test Accuracy: 0.9730
+[4] Test Accuracy: 0.9755
+[5] Test Accuracy: 0.9775
+[6] Test Accuracy: 0.9778
+[7] Test Accuracy: 0.9766
+[8] Test Accuracy: 0.9768
+[9] Test Accuracy: 0.9781
+```
+
+将`learning_rate` 调整到0.2，重新训练：
+![train02](img/train02.png)
+
+各个epoch的测试准确率为：
+```
+[0] Test Accuracy: 0.9621
+[1] Test Accuracy: 0.9703
+[2] Test Accuracy: 0.9753
+[3] Test Accuracy: 0.9740
+[4] Test Accuracy: 0.9787
+[5] Test Accuracy: 0.9756
+[6] Test Accuracy: 0.9807
+[7] Test Accuracy: 0.9795
+[8] Test Accuracy: 0.9814
+[9] Test Accuracy: 0.9825
+```
+
+可见，稍微提高学习率之后，训练前期参数更新的幅度更大，损失下降得更快，能够更早收敛。训练相同迭代数，现在的模型测试准确率更高。
+
+将`learning_rate` 提高到0.3，重新训练：
+![train03](img/train03.png)
+
+```
+[0] Test Accuracy: 0.9554
+[1] Test Accuracy: 0.9715
+[2] Test Accuracy: 0.9744
+[3] Test Accuracy: 0.9756
+[4] Test Accuracy: 0.9782
+[5] Test Accuracy: 0.9795
+[6] Test Accuracy: 0.9801
+[7] Test Accuracy: 0.9816
+[8] Test Accuracy: 0.9828
+[9] Test Accuracy: 0.9778
+```
+
+增大学习率到0.3之后，训练前期损失下降速度与上一次训练差不多，但是到了训练后期，过大的学习率导致权重在局部最小值的附近以过大的幅度移动，难以进入最低点，模型loss表现为振荡，难以收敛。本次训练的测试准确率先提高到0.9828，后反而下降。
+
+因此，可认为对于大小为128的batch，0.2是较为合适的学习率。
+
+之后，维持学习率为0.2， 修改batch_size 为256， 重新训练：
+![train256](img/train256.png)
+```
+[0] Test Accuracy: 0.9453
+[1] Test Accuracy: 0.9621
+[2] Test Accuracy: 0.9657
+[3] Test Accuracy: 0.9629
+[4] Test Accuracy: 0.9733
+[5] Test Accuracy: 0.9766
+[6] Test Accuracy: 0.9721
+[7] Test Accuracy: 0.9768
+[8] Test Accuracy: 0.9724
+[9] Test Accuracy: 0.9775
+```
+
+batch_size增大后，每个batch更新一次参数，参数更新的频率更低，从而收敛速度有所降低；但是对比本次实验与前几次实验loss的曲线图，可发现振荡幅度更小。
+
+将batch_size减小到64， 重新实验：
+![train64](img/train64.png)
+```
+[0] Test Accuracy: 0.9526
+[1] Test Accuracy: 0.9674
+[2] Test Accuracy: 0.9719
+[3] Test Accuracy: 0.9759
+[4] Test Accuracy: 0.9750
+[5] Test Accuracy: 0.9748
+[6] Test Accuracy: 0.9772
+[7] Test Accuracy: 0.9791
+[8] Test Accuracy: 0.9820
+[9] Test Accuracy: 0.9823
+```
+
+loss的下降速度增加，但是振荡幅度变大了。
+
+总结：在一定范围之内，随着学习率的增大，模型收敛速度增加；随着batch_size的减小，模型收敛速度也会有一定增加，但是振荡幅度增大。 学习率过大会导致后期loss振荡、难以收敛；学习率过小则会导致loss下降速度过慢，甚至可能陷入局部最小值而错过更好的最低点。
+
+## 其他优化方式实现
+
+### momentum
+
+普通梯度下降每次更新参数仅仅取决于当前batch的梯度，这可能会让梯度方向受到某些特殊的输入影响。Momentum引入了动量，让当前更新不仅取决于当前的梯度，还考虑到先前的梯度，能够在一定程度上保持一段时间的趋势。momentum的计算方式为：
+
+$$
+\begin{align}
+& v = \alpha v - \gamma \frac{\partial L}{\partial W} \\\\
+& W = W + v
+\end{align}
+$$
+
+我们在`numpy_fnn.py`的模型中实现了Momentum的优化方法。 设置学习率为0.02，batch_size为128， 继续实验：
+![momentum](img/momentum.png)
+```
+[0] Test Accuracy: 0.9586
+[1] Test Accuracy: 0.9717
+[2] Test Accuracy: 0.9743
+[3] Test Accuracy: 0.9769
+[4] Test Accuracy: 0.9778
+[5] Test Accuracy: 0.9786
+[6] Test Accuracy: 0.9782
+[7] Test Accuracy: 0.9809
+[8] Test Accuracy: 0.9790
+[9] Test Accuracy: 0.9818
+```
+
+momentum 相比传统梯度下降，不一定最后会得到更好的效果。当加入动量，当前梯度方向与动量方向相同时，参数就会得到更大幅度的调整，因此loss下降速度更快，并且前期动量基本上会积累起来，如果使用过大的学习率，很容易会溢出。所以momentum适合的学习率比普通梯度下降要小一个数量级。 而当梯度方向错误的时候，加入动量会使得参数来不及更新，从而错过最小值。
+
+### RMSProp
+
+
+RMSProp引入了自适应的学习率调节。 在训练前期，学习率应该较高，使得loss能快速下降；但随着训练迭代增加，学习率应该不断减小，使得模型能够更好地收敛。 自适应调整学习率的基本思路是根据梯度来调节，梯度越大，学习率就衰减得越快；后期梯度减小，学习率衰减就更加缓慢。
+
+而为了避免前期学习率衰减得过快，RMSProp还用了指数平均的方法，来缓慢丢弃原来的梯度历史。计算方法为：
+
+$$
+\begin{align}
+& h = \rho h + (1-\rho) \frac{\partial L}{\partial W} \odot \frac{\partial L}{\partial W} \\\\
+& W = W - \gamma \frac{1}{\sqrt{\delta + h}} \frac{\partial L}{\partial W}
+\end{align}$$
+
+设置梯度为0.001， weight_decay 为0.01， 进行训练和测试：
+![rmsprop](img/rmsprop.png)
+
+```
+[0] Test Accuracy: 0.9663
+[1] Test Accuracy: 0.9701
+[2] Test Accuracy: 0.9758
+[3] Test Accuracy: 0.9701
+[4] Test Accuracy: 0.9748
+[5] Test Accuracy: 0.9813
+[6] Test Accuracy: 0.9813
+[7] Test Accuracy: 0.9819
+[8] Test Accuracy: 0.9822
+[9] Test Accuracy: 0.9808
+```
+
+可见，在训练的中间部分，loss振荡幅度比普通梯度下降更小。训练前期，模型的收敛速度更快，但到后期比起普通梯度下降并无明显优势。
+
+### Adam
+
+Adam 同时结合了动量与自适应的学习率调节。Adam首先要计算梯度的一阶和二阶矩估计，分别代表了动量与自适应的部分：
+
+$$
+\begin{align}
+& \mathrm{m} = \beta_1 \mathrm{m} + (1-\beta_1) \frac{\partial L}{\partial W} \\\\
+& \mathrm{v} = \beta_2 \mathrm{v} + (1-\beta_2) \frac{\partial L}{\partial W} \odot \frac{\partial L}{\partial W}
+\end{align}
+$$
+
+然后进行修正：
+
+$$
+\begin{align}
+& \mathrm{\hat{m}} = \frac{\mathrm{m}}{1-\beta_1 ^ t }\\\\
+& \mathrm{\hat{v}} = \frac{\mathrm{v}}{1-\beta_2 ^ t}
+\end{align}
+$$
+
+最后，参数的更新为：
+$$ W = W - \gamma \frac{\mathrm{\hat m}}{\sqrt{\mathrm{\hat v}+ \delta}}$$
+
+
+设置学习率为0.001， batch_size为128， 开始训练：
+![adam](img/train_adam.png)
+```
+[0] Test Accuracy: 0.9611
+[1] Test Accuracy: 0.9701
+[2] Test Accuracy: 0.9735
+[3] Test Accuracy: 0.9752
+[4] Test Accuracy: 0.9787
+[5] Test Accuracy: 0.9788
+[6] Test Accuracy: 0.9763
+[7] Test Accuracy: 0.9790
+[8] Test Accuracy: 0.9752
+[9] Test Accuracy: 0.9806
+
+```
+
+相比传统梯度下降，loss振荡略微有所减小，前期loss下降速度略微更快，但是最后收敛的速度相当。
\ No newline at end of file
diff --git a/assignment-2/submission/17307130331/img/backgraph.png b/assignment-2/submission/17307130331/img/backgraph.png
new file mode 100644
index 0000000000000000000000000000000000000000..c4a70b28e869708641bd01dba83730ed62ab9c4d
Binary files /dev/null and b/assignment-2/submission/17307130331/img/backgraph.png differ
diff --git a/assignment-2/submission/17307130331/img/compu_graph.png b/assignment-2/submission/17307130331/img/compu_graph.png
new file mode 100644
index 0000000000000000000000000000000000000000..74f02ff1b4c4795c99600fb2e358d23a170f11c1
Binary files /dev/null and b/assignment-2/submission/17307130331/img/compu_graph.png differ
diff --git a/assignment-2/submission/17307130331/img/momentum.png b/assignment-2/submission/17307130331/img/momentum.png
new file mode 100644
index 0000000000000000000000000000000000000000..152bfe4eda8bf98cb271e9e3af3801f223273ec2
Binary files /dev/null and b/assignment-2/submission/17307130331/img/momentum.png differ
diff --git a/assignment-2/submission/17307130331/img/rmsprop.png b/assignment-2/submission/17307130331/img/rmsprop.png
new file mode 100644
index 0000000000000000000000000000000000000000..d4c9f6d651ea0dcac312c3a7dcb38266a477679c
Binary files /dev/null and b/assignment-2/submission/17307130331/img/rmsprop.png differ
diff --git a/assignment-2/submission/17307130331/img/train.png b/assignment-2/submission/17307130331/img/train.png
new file mode 100644
index 0000000000000000000000000000000000000000..618816332b78c4f0498444a42dd2a5028df91ef1
Binary files /dev/null and b/assignment-2/submission/17307130331/img/train.png differ
diff --git a/assignment-2/submission/17307130331/img/train02.png b/assignment-2/submission/17307130331/img/train02.png
new file mode 100644
index 0000000000000000000000000000000000000000..a2cbc7b9ccbf2f28955902b86881d7a640f50fa7
Binary files /dev/null and b/assignment-2/submission/17307130331/img/train02.png differ
diff --git a/assignment-2/submission/17307130331/img/train03.png b/assignment-2/submission/17307130331/img/train03.png
new file mode 100644
index 0000000000000000000000000000000000000000..41dd8fd9060e6774b983375f3b025ee6335b9f66
Binary files /dev/null and b/assignment-2/submission/17307130331/img/train03.png differ
diff --git a/assignment-2/submission/17307130331/img/train10.png b/assignment-2/submission/17307130331/img/train10.png
new file mode 100644
index 0000000000000000000000000000000000000000..a2056ba0d21f8f40fc0279e532fd6b9f1ff79cef
Binary files /dev/null and b/assignment-2/submission/17307130331/img/train10.png differ
diff --git a/assignment-2/submission/17307130331/img/train256.png b/assignment-2/submission/17307130331/img/train256.png
new file mode 100644
index 0000000000000000000000000000000000000000..81aa1b2bcc7f708607f8c402f9f41d579793f9e1
Binary files /dev/null and b/assignment-2/submission/17307130331/img/train256.png differ
diff --git a/assignment-2/submission/17307130331/img/train64.png b/assignment-2/submission/17307130331/img/train64.png
new file mode 100644
index 0000000000000000000000000000000000000000..8f34749c6fda428437ff3fe11292b0213eca0d7a
Binary files /dev/null and b/assignment-2/submission/17307130331/img/train64.png differ
diff --git a/assignment-2/submission/17307130331/img/train_adam.png b/assignment-2/submission/17307130331/img/train_adam.png
new file mode 100644
index 0000000000000000000000000000000000000000..eefa8b27deb6485f895033add750f018fd14e293
Binary files /dev/null and b/assignment-2/submission/17307130331/img/train_adam.png differ
diff --git a/assignment-2/submission/17307130331/img/trainloss.png b/assignment-2/submission/17307130331/img/trainloss.png
new file mode 100644
index 0000000000000000000000000000000000000000..b845297f03d5d6e6ae2b026b25554519a77f471b
Binary files /dev/null and b/assignment-2/submission/17307130331/img/trainloss.png differ
diff --git a/assignment-2/submission/17307130331/numpy_fnn.py b/assignment-2/submission/17307130331/numpy_fnn.py
new file mode 100644
index 0000000000000000000000000000000000000000..7b32d95b7825b4787f5d226ac058c0039aee4bba
--- /dev/null
+++ b/assignment-2/submission/17307130331/numpy_fnn.py
@@ -0,0 +1,208 @@
+import numpy as np
+
+
+class NumpyOp:
+    
+    def __init__(self):
+        self.memory = {}
+        self.epsilon = 1e-12
+
+
+class Matmul(NumpyOp):
+    
+    def forward(self, x, W):
+        """
+        x: shape(N, d)
+        w: shape(d, d')
+        """
+        self.memory['x'] = x
+        self.memory['W'] = W
+        h = np.matmul(x, W)
+        return h
+    
+    def backward(self, grad_y):
+        """
+        grad_y: shape(N, d')
+        """
+        
+        ####################
+        #      code 1      #
+        grad_W = np.matmul(self.memory['x'].T, grad_y)
+        grad_x = np.matmul(grad_y, self.memory['W'].T)
+        ####################
+        
+        return grad_x, grad_W
+
+
+class Relu(NumpyOp):
+    
+    def forward(self, x):
+        self.memory['x'] = x
+        return np.where(x > 0, x, np.zeros_like(x))
+    
+    def backward(self, grad_y):
+        """
+        grad_y: same shape as x
+        """
+        
+        ####################
+        #      code 2      #
+        ####################
+        grad_x = np.where(self.memory['x'] > 0, np.ones_like(self.memory['x']), np.zeros_like(self.memory['x'])) * grad_y # 元素乘积
+        
+        return grad_x
+
+
+class Log(NumpyOp):
+    
+    def forward(self, x):
+        """
+        x: shape(N, c)
+        """
+        
+        out = np.log(x + self.epsilon)
+        self.memory['x'] = x
+        
+        return out
+    
+    def backward(self, grad_y):
+        """
+        grad_y: same shape as x
+        """
+        
+        ####################
+        #      code 3      #
+        ####################
+        grad_x = (1/(self.memory['x'] + self.epsilon)) * grad_y
+        return grad_x
+
+
+class Softmax(NumpyOp):
+    """
+    softmax over last dimension
+    """
+    
+    def forward(self, x):
+        """
+        x: shape(N, c)
+        """
+        
+        ####################
+        #      code 4      #
+        ####################
+        exp_x = np.exp(x)
+        out = exp_x/np.sum(exp_x, axis=1, keepdims=True)
+        self.memory['x'] = x
+        self.memory['out'] = out
+        return out
+    
+    def backward(self, grad_y):
+        """
+        grad_y: same shape as x
+        """
+        o = self.memory['out']
+        Jacob = np.array([np.diag(r) - np.outer(r, r) for r in o]) 
+        # i!=j  - oi* oj
+        # i==j  oi*(1-oi)
+        grad_y = grad_y[:, np.newaxis, :]
+        grad_x = np.matmul(grad_y, Jacob).squeeze(1)
+        #print(grad_x.shape)
+        #print(grad_x)
+        return grad_x
+
+
+class NumpyLoss:
+    
+    def __init__(self):
+        self.target = None
+    
+    def get_loss(self, pred, target):
+        self.target = target
+        return (-pred * target).sum(axis=1).mean()
+    
+    def backward(self):
+        return -self.target / self.target.shape[0]
+
+
+class NumpyModel:
+    def __init__(self):
+        self.W1 = np.random.normal(size=(28 * 28, 256))
+        self.W2 = np.random.normal(size=(256, 64))
+        self.W3 = np.random.normal(size=(64, 10))
+        
+        # 以下算子会在 forward 和 backward 中使用
+        self.matmul_1 = Matmul()
+        self.relu_1 = Relu()
+        self.matmul_2 = Matmul()
+        self.relu_2 = Relu()
+        self.matmul_3 = Matmul()
+        self.softmax = Softmax()
+        self.log = Log()
+        
+        # 以下变量需要在 backward 中更新。 softmax_grad, log_grad 等为算子反向传播的梯度（ loss 关于算子输入的偏导）
+        self.x1_grad, self.W1_grad = None, None
+        self.relu_1_grad = None
+        self.x2_grad, self.W2_grad = None, None
+        self.relu_2_grad = None
+        self.x3_grad, self.W3_grad = None, None
+        self.softmax_grad = None
+        self.log_grad = None
+        
+        # 以下变量是在 momentum\rmsprop中使用的
+        self.v1 = np.zeros_like(self.W1)
+        self.v2 = np.zeros_like(self.W2)
+        self.v3 = np.zeros_like(self.W3)
+        
+    
+    def forward(self, x):
+        x = x.reshape(-1, 28 * 28)
+        
+        x = self.relu_1.forward(self.matmul_1.forward(x, self.W1))
+        x = self.relu_2.forward(self.matmul_2.forward(x, self.W2))
+        
+        x = self.matmul_3.forward(x, self.W3)
+        
+        x = self.softmax.forward(x)
+        x = self.log.forward(x)
+        
+        return x
+    
+    def backward(self, y):
+        self.log_grad = self.log.backward(y)
+        self.softmax_grad = self.softmax.backward(self.log_grad)
+        self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad)
+        self.relu_2_grad = self.relu_2.backward(self.x3_grad)
+        self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad)
+        self.relu_1_grad = self.relu_1.backward(self.x2_grad)
+        self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad)
+        
+    
+    def optimize(self, learning_rate):
+        self.W1 -= learning_rate * self.W1_grad
+        self.W2 -= learning_rate * self.W2_grad
+        self.W3 -= learning_rate * self.W3_grad
+        
+    def momentum(self, learning_rate, alpha=0.9):
+        self.v1 = self.v1 * alpha - learning_rate * self.W1_grad
+        self.v2 = self.v2 * alpha - learning_rate * self.W2_grad
+        self.v3 = self.v3 * alpha - learning_rate * self.W3_grad
+        
+        self.W1 += self.v1
+        self.W2 += self.v2
+        self.W3 += self.v3
+    
+    def RMSProp(self, learning_rate, weight_decay = 0.99):
+        self.v1 = self.v1 * weight_decay + (1-weight_decay) * self.W1_grad * self.W1_grad
+        self.v2 = self.v2 * weight_decay + (1-weight_decay) * self.W2_grad * self.W2_grad
+        self.v3 = self.v3 * weight_decay + (1-weight_decay) * self.W3_grad * self.W3_grad
+        
+        self.W1 = self.W1 - learning_rate * self.W1_grad / np.sqrt( self.v1 + 1e-7)
+        self.W2 = self.W2 - learning_rate * self.W2_grad / np.sqrt( self.v2 + 1e-7)
+        self.W3 = self.W3 - learning_rate * self.W3_grad / np.sqrt( self.v3 + 1e-7)
+    
+    
+
+        
+        
+        
+        
\ No newline at end of file
diff --git a/assignment-2/submission/17307130331/numpy_mnist.py b/assignment-2/submission/17307130331/numpy_mnist.py
new file mode 100644
index 0000000000000000000000000000000000000000..4187f01eeebbbcd6ab48bfacf8dedc37085e46e2
--- /dev/null
+++ b/assignment-2/submission/17307130331/numpy_mnist.py
@@ -0,0 +1,70 @@
+import numpy as np
+from numpy_fnn import NumpyModel, NumpyLoss
+from utils import download_mnist, batch, get_torch_initialization, plot_curve, one_hot
+
+def mini_batch(dataset, batch_size=128):
+    data = np.array([each[0].numpy() for each in dataset])
+    label = np.array([each[1] for each in dataset])
+
+    data_size = data.shape[0]
+    idx = np.array([i for i in range(data_size)])
+    np.random.shuffle(idx)
+    
+    return [(data[idx[i: i+batch_size]], label[idx[i:i+batch_size]])  for i in range(0, data_size, batch_size)]
+
+class Adam():
+    def __init__(self, param, learning_rate=0.001, beta_1=0.9, beta_2=0.999):
+        self.param = param
+        self.iter = 0
+        self.m = 0
+        self.v = 0
+        self.beta1 = beta_1
+        self.beta2 = beta_2
+        self.lr = learning_rate
+    def optimize(self, grad):
+        self.iter+=1
+        self.m = self.beta1 * self.m + (1 - self.beta1) * grad
+        self.v = self.beta2 * self.v + (1 - self.beta2) * grad * grad
+        m_hat = self.m / (1 - self.beta1 ** self.iter)
+        v_hat = self.v / (1 - self.beta2 ** self.iter)
+        self.param -= self.lr * m_hat / (v_hat ** 0.5 + 1e-8)
+        return self.param
+        
+def numpy_run():
+    train_dataset, test_dataset = download_mnist()
+    
+    model = NumpyModel()
+    numpy_loss = NumpyLoss()
+    model.W1, model.W2, model.W3 = get_torch_initialization()
+    
+    W1_opt, W2_opt, W3_opt = Adam(model.W1), Adam(model.W2), Adam(model.W3)
+    
+    train_loss = []
+    
+    epoch_number = 10
+    learning_rate = 0.0015
+    
+    for epoch in range(epoch_number):
+        for x, y in mini_batch(train_dataset, batch_size=128):
+            y = one_hot(y)
+            
+            y_pred = model.forward(x)
+            loss = numpy_loss.get_loss(y_pred, y)
+
+            model.backward(numpy_loss.backward())
+            #model.Adam(learning_rate)
+            W1_opt.optimize(model.W1_grad)
+            W2_opt.optimize(model.W2_grad)
+            W3_opt.optimize(model.W3_grad)
+            
+            train_loss.append(loss.item())
+        
+        x, y = batch(test_dataset)[0]
+        accuracy = np.mean((model.forward(x).argmax(axis=1) == y))
+        print('[{}] Test Accuracy: {:.4f}'.format(epoch, accuracy))
+    
+    plot_curve(train_loss)
+            
+
+if __name__ == "__main__":
+    numpy_run()
diff --git a/assignment-2/submission/17307130331/tester_demo.py b/assignment-2/submission/17307130331/tester_demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..515b86c1240eebad83287461548530c944f23bc8
--- /dev/null
+++ b/assignment-2/submission/17307130331/tester_demo.py
@@ -0,0 +1,182 @@
+import numpy as np
+import torch
+from torch import matmul as torch_matmul, relu as torch_relu, softmax as torch_softmax, log as torch_log
+
+from numpy_fnn import Matmul, Relu, Softmax, Log, NumpyModel, NumpyLoss
+from torch_mnist import TorchModel
+from utils import get_torch_initialization, one_hot
+
+err_epsilon = 1e-6
+err_p = 0.4
+
+
+def check_result(numpy_result, torch_result=None):
+    if isinstance(numpy_result, list) and torch_result is None:
+        flag = True
+        for (n, t) in numpy_result:
+            flag = flag and check_result(n, t)
+        return flag
+    # print((torch.from_numpy(numpy_result) - torch_result).abs().mean().item())
+    T = (torch_result * torch.from_numpy(numpy_result) < 0).sum().item()
+    direction = T / torch_result.numel() < err_p
+    return direction and ((torch.from_numpy(numpy_result) - torch_result).abs().mean() < err_epsilon).item()
+
+
+def case_1():
+    x = np.random.normal(size=[5, 6])
+    W = np.random.normal(size=[6, 4])
+    
+    numpy_matmul = Matmul()
+    numpy_out = numpy_matmul.forward(x, W)
+    numpy_x_grad, numpy_W_grad = numpy_matmul.backward(np.ones_like(numpy_out))
+    
+    torch_x = torch.from_numpy(x).clone().requires_grad_()
+    torch_W = torch.from_numpy(W).clone().requires_grad_()
+    
+    torch_out = torch_matmul(torch_x, torch_W)
+    torch_out.sum().backward()
+    
+    return check_result([
+        (numpy_out, torch_out),
+        (numpy_x_grad, torch_x.grad),
+        (numpy_W_grad, torch_W.grad)
+    ])
+
+
+def case_2():
+    x = np.random.normal(size=[5, 6])
+    
+    numpy_relu = Relu()
+    numpy_out = numpy_relu.forward(x)
+    numpy_x_grad = numpy_relu.backward(np.ones_like(numpy_out))
+    
+    torch_x = torch.from_numpy(x).clone().requires_grad_()
+    
+    torch_out = torch_relu(torch_x)
+    torch_out.sum().backward()
+    
+    return check_result([
+        (numpy_out, torch_out),
+        (numpy_x_grad, torch_x.grad),
+    ])
+
+
+def case_3():
+    x = np.random.uniform(low=0.0, high=1.0, size=[3, 4])
+    
+    numpy_log = Log()
+    numpy_out = numpy_log.forward(x)
+    numpy_x_grad = numpy_log.backward(np.ones_like(numpy_out))
+    
+    torch_x = torch.from_numpy(x).clone().requires_grad_()
+    
+    torch_out = torch_log(torch_x)
+    torch_out.sum().backward()
+    
+    return check_result([
+        (numpy_out, torch_out),
+        
+        (numpy_x_grad, torch_x.grad),
+    ])
+
+
+def case_4():
+    x = np.random.normal(size=[4, 5])
+    
+    numpy_softmax = Softmax()
+    numpy_out = numpy_softmax.forward(x)
+    
+    torch_x = torch.from_numpy(x).clone().requires_grad_()
+    
+    torch_out = torch_softmax(torch_x, 1)
+    
+    return check_result(numpy_out, torch_out)
+
+
+def case_5():
+    x = np.random.normal(size=[20, 25])
+    
+    numpy_softmax = Softmax()
+    numpy_out = numpy_softmax.forward(x)
+    numpy_x_grad = numpy_softmax.backward(np.ones_like(numpy_out))
+    
+    torch_x = torch.from_numpy(x).clone().requires_grad_()
+
+    torch_out = torch_softmax(torch_x, 1)
+    torch_out.sum().backward()
+
+    return check_result([
+        (numpy_out, torch_out),
+        (numpy_x_grad, torch_x.grad),
+    ])
+
+
+def test_model():
+    try:
+        numpy_loss = NumpyLoss()
+        numpy_model = NumpyModel()
+        torch_model = TorchModel()
+        torch_model.W1.data, torch_model.W2.data, torch_model.W3.data = get_torch_initialization(numpy=False)
+        numpy_model.W1 = torch_model.W1.detach().clone().numpy()
+        numpy_model.W2 = torch_model.W2.detach().clone().numpy()
+        numpy_model.W3 = torch_model.W3.detach().clone().numpy()
+        
+        x = torch.randn((10000, 28, 28))
+        y = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 0] * 1000)
+        
+        y = one_hot(y, numpy=False)
+        x2 = x.numpy()
+        y_pred = torch_model.forward(x)
+        loss = (-y_pred * y).sum(dim=1).mean()
+        loss.backward()
+        
+        y_pred_numpy = numpy_model.forward(x2)
+        numpy_loss.get_loss(y_pred_numpy, y.numpy())
+        
+        check_flag_1 = check_result(y_pred_numpy, y_pred)
+        print("+ {:12} {}/{}".format("forward", 10 * check_flag_1, 10))
+    except:
+        print("[Runtime Error in forward]")
+        print("+ {:12} {}/{}".format("forward", 0, 10))
+        return 0
+    
+    try:
+        
+        numpy_model.backward(numpy_loss.backward())
+        
+        check_flag_2 = [
+            check_result(numpy_model.log_grad, torch_model.log_input.grad),
+            check_result(numpy_model.softmax_grad, torch_model.softmax_input.grad),
+            check_result(numpy_model.W3_grad, torch_model.W3.grad),
+            check_result(numpy_model.W2_grad, torch_model.W2.grad),
+            check_result(numpy_model.W1_grad, torch_model.W1.grad)
+        ]
+        check_flag_2 = sum(check_flag_2) >= 4
+        print("+ {:12} {}/{}".format("backward", 20 * check_flag_2, 20))
+    except:
+        print("[Runtime Error in backward]")
+        print("+ {:12} {}/{}".format("backward", 0, 20))
+        check_flag_2 = False
+    
+    return 10 * check_flag_1 + 20 * check_flag_2
+
+
+if __name__ == "__main__":
+    testcases = [
+        ["matmul", case_1, 5],
+        ["relu", case_2, 5],
+        ["log", case_3, 5],
+        ["softmax_1", case_4, 5],
+        ["softmax_2", case_5, 10],
+    ]
+    score = 0
+    for case in testcases:
+        try:
+            res = case[2] if case[1]() else 0
+        except:
+            print("[Runtime Error in {}]".format(case[0]))
+            res = 0
+        score += res
+        print("+ {:12} {}/{}".format(case[0], res, case[2]))
+    score += test_model()
+    print("{:14} {}/60".format("FINAL SCORE", score))
diff --git a/assignment-2/submission/17307130331/torch_mnist.py b/assignment-2/submission/17307130331/torch_mnist.py
new file mode 100644
index 0000000000000000000000000000000000000000..6d3e214c7606e3d43dac4b94554f942508afffb3
--- /dev/null
+++ b/assignment-2/submission/17307130331/torch_mnist.py
@@ -0,0 +1,73 @@
+import torch
+from utils import mini_batch, batch, download_mnist, get_torch_initialization, one_hot, plot_curve
+
+
+class TorchModel:
+    
+    def __init__(self):
+        self.W1 = torch.randn((28 * 28, 256), requires_grad=True)
+        self.W2 = torch.randn((256, 64), requires_grad=True)
+        self.W3 = torch.randn((64, 10), requires_grad=True)
+        self.softmax_input = None
+        self.log_input = None
+    
+    def forward(self, x):
+        x = x.reshape(-1, 28 * 28)
+        x = torch.relu(torch.matmul(x, self.W1))
+        x = torch.relu(torch.matmul(x, self.W2))
+        x = torch.matmul(x, self.W3)
+        
+        self.softmax_input = x
+        self.softmax_input.retain_grad()
+        
+        x = torch.softmax(x, 1)
+        
+        self.log_input = x
+        self.log_input.retain_grad()
+        
+        x = torch.log(x)
+        
+        return x
+    
+    def optimize(self, learning_rate):
+        with torch.no_grad():
+            self.W1 -= learning_rate * self.W1.grad
+            self.W2 -= learning_rate * self.W2.grad
+            self.W3 -= learning_rate * self.W3.grad
+            
+            self.W1.grad = None
+            self.W2.grad = None
+            self.W3.grad = None
+
+
+def torch_run():
+    train_dataset, test_dataset = download_mnist()
+    
+    model = TorchModel()
+    model.W1.data, model.W2.data, model.W3.data = get_torch_initialization(numpy=False)
+    
+    train_loss = []
+    
+    epoch_number = 3
+    learning_rate = 0.1
+    
+    for epoch in range(epoch_number):
+        for x, y in mini_batch(train_dataset, numpy=False):
+            y = one_hot(y, numpy=False)
+            
+            y_pred = model.forward(x)
+            loss = (-y_pred * y).sum(dim=1).mean()
+            loss.backward()
+            model.optimize(learning_rate)
+            
+            train_loss.append(loss.item())
+        
+        x, y = batch(test_dataset, numpy=False)[0]
+        accuracy = model.forward(x).argmax(dim=1).eq(y).float().mean().item()
+        print('[{}] Accuracy: {:.4f}'.format(epoch, accuracy))
+    
+    plot_curve(train_loss)
+
+
+if __name__ == "__main__":
+    torch_run()
diff --git a/assignment-2/submission/17307130331/utils.py b/assignment-2/submission/17307130331/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..709220cfa7a924d914ec1c098c505f864bcd4cfc
--- /dev/null
+++ b/assignment-2/submission/17307130331/utils.py
@@ -0,0 +1,71 @@
+import torch
+import numpy as np
+from matplotlib import pyplot as plt
+
+
+def plot_curve(data):
+    plt.plot(range(len(data)), data, color='blue')
+    plt.legend(['loss_value'], loc='upper right')
+    plt.xlabel('step')
+    plt.ylabel('value')
+    plt.show()
+
+
+def download_mnist():
+    from torchvision import datasets, transforms
+    
+    transform = transforms.Compose([
+        transforms.ToTensor(),
+        transforms.Normalize(mean=(0.1307,), std=(0.3081,))
+    ])
+    
+    train_dataset = datasets.MNIST(root="./data/", transform=transform, train=True, download=True)
+    test_dataset = datasets.MNIST(root="./data/", transform=transform, train=False, download=True)
+    
+    return train_dataset, test_dataset
+
+
+def one_hot(y, numpy=True):
+    if numpy:
+        y_ = np.zeros((y.shape[0], 10))
+        y_[np.arange(y.shape[0], dtype=np.int32), y] = 1
+        return y_
+    else:
+        y_ = torch.zeros((y.shape[0], 10))
+        y_[torch.arange(y.shape[0], dtype=torch.long), y] = 1
+    return y_
+
+
+def batch(dataset, numpy=True):
+    data = []
+    label = []
+    for each in dataset:
+        data.append(each[0])
+        label.append(each[1])
+    data = torch.stack(data)
+    label = torch.LongTensor(label)
+    if numpy:
+        return [(data.numpy(), label.numpy())]
+    else:
+        return [(data, label)]
+
+
+def mini_batch(dataset, batch_size=128, numpy=False):
+    return torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)
+
+
+def get_torch_initialization(numpy=True):
+    fc1 = torch.nn.Linear(28 * 28, 256)
+    fc2 = torch.nn.Linear(256, 64)
+    fc3 = torch.nn.Linear(64, 10)
+    
+    if numpy:
+        W1 = fc1.weight.T.detach().clone().numpy()
+        W2 = fc2.weight.T.detach().clone().numpy()
+        W3 = fc3.weight.T.detach().clone().numpy()
+    else:
+        W1 = fc1.weight.T.detach().clone().data
+        W2 = fc2.weight.T.detach().clone().data
+        W3 = fc3.weight.T.detach().clone().data
+    
+    return W1, W2, W3