diff --git a/assignment-2/submission/18307130341/README.md b/assignment-2/submission/18307130341/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..a859dde38f467fbbb852f830eda2ded657e48952
--- /dev/null
+++ b/assignment-2/submission/18307130341/README.md
@@ -0,0 +1,441 @@
+# 实验报告：ASS2-选题1-FNN
+
+18307130341 黄韵澄
+
+[toc]
+
+### 1.实验概述
+
+​	实现一个前馈神经网络，利用MNIST数据集，解决手写体识别的分类问题。
+
+### 2. 算子前向传播和反向传播的推导
+
+​	搭建的FNN中会用到Matmul、Relu、Log、Softmax这几个算子。
+
+#### 2.1 Matmul层反向传播
+
+$$
+loss = f(X\times W) =f(Y)
+$$
+
+根据链式法则：
+$$
+\frac{\partial loss}{\partial X_{p,q}} = \sum_{i,j}{\frac{\partial loss}{\partial Y_{i,j}}\frac{\partial Y_{i,j}}{\partial X_{p,q}}}
+$$
+​	根据矩阵乘法定义：
+$$
+Y_{i,j} = \sum_{k}{X_{i,k}W_{k,j}}
+$$
+​	所以，但$i\neq p$时，$C_{i,j}$与$A_{p,q}$无关:
+$$
+\frac{\partial Y_{i,j}}{\partial X_{p,q}} =\begin{cases}W_{q,j}\quad i=p \\\\ 0 \quad i\neq p\end{cases}
+$$
+​	代入式子：
+$$
+\frac{\partial loss}{\partial X_{p,q}} = \sum_{i,j}\frac{\partial loss}{\partial Y_{i,j}}\frac{\partial Y_{i,j}}{\partial X_{p,q}}=\sum_{j}\frac{\partial loss}{\partial Y_{p,j}}\frac{\partial Y_{p,j}}{\partial X_{p,q}}=\sum_{j}\frac{\partial loss}{\partial Y_{p,j}}W_{q,j}=\sum_{j}\frac{\partial loss}{\partial Y_{p,j}}W_{j,q}^{T}
+$$
+​	
+
+​	所以：
+$$
+\frac{\partial loss}{\partial X} = \frac{\partial loss}{\partial Y}W^{T}
+$$
+​	同理：
+$$
+\frac{\partial loss}{\partial W} = X^{T}\frac{\partial loss}{\partial Y}
+$$
+
+#### 2.2 Relu层反向传播
+
+$$
+loss = f(Y) = f(Relu(X))
+$$
+
+其中：
+$$
+Relu(x) = \begin{cases}0 \quad x < 0 \\\\ x \quad x\geq 0\end{cases}
+$$
+则：
+$$
+\frac{\partial Y_{i,j}}{\partial X_{k,l}} = \begin{cases}1 \quad i=k\quad and\quad j=l\quad and\quad X_{k,l}>0 \\\\ 
+0\quad else \end{cases}
+$$
+由链式法则：
+$$
+\frac{\partial loss}{\partial X} = \frac{\partial loss}{\partial Y}\frac{\partial Y}{\partial X}
+$$
+代码：
+
+```python
+grad_x = grad_y * np.where(x > 0, 1, 0)
+```
+
+#### 2.3 Log层反向传播
+
+$$
+loss = f(Y) = f(ln(X))
+$$
+
+其中：
+$$
+\frac{\partial Y_{i,j}}{\partial X_{k,l}} = \begin{cases}\frac{1}{X_{k,j}} \quad i=k\quad and\quad j=l\quad \\\\
+0\quad else \end{cases}
+$$
+由链式法则：
+$$
+\frac{\partial loss}{\partial X} = \frac{\partial loss}{\partial Y}\cdot \frac{\partial Y}{\partial X}=\frac{\partial loss}{\partial Y}\cdot \frac{1}{X}
+$$
+代码：
+
+```python
+grad_x = grad_y / (x + self.epsilon)
+```
+
+#### 2.4 Softmax层正向传播和反向传播
+
+正向传播（只与行相关）：
+
+$$
+loss = f(Y) \\
+Y_{i,j} = \frac{e^{X_{i,j}}}{\sum_{k}e^{X_{i,k}}}
+$$
+反向传播：
+
+（1）当$j=l$时：
+$$
+\frac{\partial Y_{i,j}}{\partial X_{i,l}} = \frac{\partial Y_{i,j}}{\partial X_{i,j}}=\frac{\partial \frac{e^{X_{i,j}}}{\sum_{k}e^{X_{i,k}}}}{\partial X_{i,j}} = \frac{e^{X_{i,j}}}{\sum_{k}e^{X_{i,k}}}-(\frac{e^{X_{i,j}}}{\sum_{k}e^{X_{i,k}}})^{2} = Y_{i,j}-Y_{i,j}^2 \\
+$$
+（2）当$j \neq l$时：
+$$
+\frac{\partial Y_{i,j}}{\partial X_{i,l}} = \frac{\partial \frac{e^{X_{i,j}}}{\sum_{k}e^{X_{i,k}}}}{\partial X_{i,l}} = -\frac{e^{X_{i,j}}}{\sum_{k}e^{X_{i,k}}}\cdot \frac{e^{X_{i,l}}}{\sum_{k}e^{X_{i,k}}} = -Y_{i,j}\cdot Y_{i,l} \\
+$$
+（3）当$i \neq k$时，行不相关，梯度为0：
+$$
+\frac{\partial Y_{i,j}}{\partial X_{k,l}} = 0
+$$
+根据链式法则（公式中的 $\cdot $ 表示点积）：
+$$
+\frac{\partial loss}{\partial X_{k,l}} = \sum_{j}\frac{\partial loss}{\partial Y_{k,j}}\cdot \frac{\partial Y_{k,j}}{\partial X_{k,l}} = (\sum_{j}-\frac{\partial loss}{\partial Y_{k,j}}\cdot Y_j\cdot Y_l)+ \frac{\partial loss}{\partial Y_{k,l}}\cdot Y_l \\\\
+=Y_l\cdot( \frac{\partial loss}{\partial Y_{k,l}}-\sum_{j}\frac{\partial loss}{\partial Y_{k,j}}\cdot Y_j)
+$$
+简化成上式之后，可以用numpy方法的一行写完：
+
+```python
+grad_x = y * (grad_y -  (y * grad_y).sum(axis = 1).reshape(len(y),1))
+```
+
+### 3.FNN模型搭建
+
+#### 3.1 FNN模型
+
+​	FNN模型搭建如图所示：
+
+<img src="img/Fig1.png" style="zoom: 33%;" />
+
+- ​	输入层（$N\times28^2$）。与下层连接为全连接，参数为W1（$28^2\times256$）。
+- ​	隐藏层1（$N\times256$，激活函数Relu）。与下层连接为全连接，参数为W2（$256\times64$）。
+- ​	隐藏层2（$N\times64$，激活函数Relu）。与下层连接为全连接，参数为W3（$64\times10$）。
+- ​	隐藏层3（$N\times10$，激活函数Softmax）。直接输出到下层。
+- ​	输出层（$N\times10$，激活函数Log）。
+
+​	FNN模型用公式表示：
+$$
+a^{(0)} = X \\\\
+z^{(1)} = W_1\times a^{(0)} ,\quad a^{(1)} = Relu(z^{(1)}) \\\\
+z^{(2)} = W_2\times a^{(1)} ,\quad a^{(2)} = Relu(z^{(2)}) \\\\
+z^{(3)} = W_3\times a^{(2)} ,\quad a^{(3)} = Softmax(z^{(3)}) \\\\
+z^{(4)} = a^{(3)},\quad a^{(4)} = Log(z^{(4)}) \\\\
+Y = a^{(4)}
+$$
+
+​	损失值的定义：
+$$
+loss = -Y * \hat Y
+$$
+
+#### 3.2 FNN反向传播
+
+​	根据搭建的FNN模型，反方向链式求导。用上一层的求出的梯度作为下一层输入即可：
+
+```python
+self.log_grad = self.log.backward(y)
+
+self.softmax_grad = self.softmax.backward(self.log_grad)
+self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad)
+
+self.relu_2_grad = self.relu_2.backward(self.x3_grad)
+self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad)
+
+self.relu_1_grad = self.relu_1.backward(self.x2_grad)
+self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad)
+```
+
+### 3.3 FNN模型的测试结果
+
+​	直接运行`numpy_mnist.py`,即可自动下载MNIST数据集（手写体数字识别）进行自动测试。
+
+​	损失函数绘图：
+
+<img src="img/Fig2.png" style="zoom:80%;" />
+
+​		模型准确率：
+
+```
+[0] Accuracy: 0.9350
+[1] Accuracy: 0.9674
+[2] Accuracy: 0.9693
+```
+
+​	numpy中的epoch只设置了3，但模型准确率已经相对比较高了。然而，观察损失函数的波动图像，显然模型的损失函数仍在波动，并未收敛。后续可能需要增加epoch并调节学习率使得模型的损失收敛。
+
+### 4. mini_batch函数的实现
+
+#### 4.1 mini_batch函数
+
+​	`mini_batch`函数是将`dataset`分成不同批次，批处理大小为`batch_size`。
+
+​	处理步骤：
+
+- for循环提取dataset中的data和label，转换为ndarray格式：
+
+```python
+data = []
+label = []
+for x in dataset:
+   data.append(np.array(x[0]))
+   label.append(x[1])
+data = np.array(data)
+label = np.array(label)
+```
+
+- shuffle的方法有很多，我用的是`np.random.permutation`这个方法：
+
+```
+idx = np.random.permutation(len(dataset))
+data = data[idx]
+label = label[idx]
+```
+
+- 使用`np.split`方法将data和label划分。这个方法要求均等划分，为了解决dataset规模不能整除batch_size问题，代码单独处理最后一个区块：
+
+```
+split_num = len(dataset) // batch_size #均等划分的区块数量
+split_pos = split_num * batch_size # 均等划分的区块的最末位置
+# 划分data
+ret_data = np.split(data[:split_pos], split_num) 
+ret_data.append(data[split_pos+1:])
+# 划分label
+ret_label = np.split(label[:split_pos], split_num)
+ret_label.append(label[split_pos+1:])
+```
+
+- 最后使用`zip`将data和label组合成tuple：
+
+```
+ret = list(zip(ret_data, ret_label))
+```
+
+#### 4.2 mini_batch函数测试
+
+使用torch方法的mini_batch：
+
+```
+[0] Accuracy: 0.9473
+[1] Accuracy: 0.9648
+[2] Accuracy: 0.9680
+time = 73.32 s
+```
+
+只使用numpy方法的mini_batch：
+
+```
+[0] Accuracy: 0.9474
+[1] Accuracy: 0.9556
+[2] Accuracy: 0.9678
+time = 66.24 s
+```
+
+​	理论上是对正确率没有影响的。速度上比torch的快了7s左右。
+
+### 5.模型优化方法
+
+#### 5.1 Momentum方法
+
+​	Momentum算法又叫做动量梯度下降算法，使用原始的梯度下降有以下问题：
+
+> ​	梯度下降过程中有纵向波动，由于这种波动的存在，我们只能采取较小的学习率，否则波动会更大。而使用动量梯度下降法后，经过平均，抵消了上下波动，使波动趋近于零，这样就可以采用稍微大点的学习率加快梯度下降的速度。
+>
+
+​	Momentum公式：
+$$
+V_{dW}= \beta \cdot V_{dW} + (1-\beta)\cdot dW \\\\
+W = W - \alpha \cdot V_{dW}
+$$
+​	其中$\alpha$为学习率，$\beta$为动量系数。在实验中$\beta$取值0.9。
+
+​	分别使用原始梯度下降（绿色线）和Momentum优化方法（蓝色线）进行测试，绘制acc-epoch图，结果如下：
+
+<img src="img/Fig4.png" style="zoom:67%;" />
+
+可以看到Momentum优化在前期学习速度比原始方法慢，但随着动量累计，其模型精确度很快高于原始方法的精确度，且最终精确度收敛于更高的水平。
+
+#### 5.2 Adam方法
+
+​	Adam本质上实际是RMSProp优化+Momentum优化的结合：
+
+> **均方根传播(RMSProp)**：维护每个参数的学习速率，根据最近的权重梯度的平均值来调整。这意味着该算法在线上和非平稳问题上表现良好。
+
+​	Adam公式：
+$$
+V_{dW} = \beta_1\cdot V_{dW} + (1-\beta_1)\cdot dW \\\\
+V_{dW}^{corrected} = \frac{V_{dW}}{1-\beta_1^t} \\\\
+S_{dW} = \beta_2\cdot S_{dW} + (1-\beta_2)\cdot dW^2 \\\\
+S_{dW}^{corrected} = \frac{S_{dW}}{1-\beta_2^t} \\\\
+W = W - \alpha \frac{V_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}}+\epsilon}
+$$
+其中$\alpha$为学习率，$\beta_1$为动量系数，$\beta_2$为自适应学习系数。
+
+​	根据典型的参数设置，$\alpha$设为0.001，$\beta_1$设为0.9，$\beta_2$设为0.999。运行20个epoch，测试结果如图（与原始optimize、Momentum方法比较）：
+
+<img src="img/Fig6.png" style="zoom:67%;" />
+
+​	可以看到，Adam方法的loss波动比较大，20个epoch仍未到达收敛值。比较三种optimize方法每个epoch的acc（绿色线原始方法，蓝色线Momentum优化，紫色线Adam优化）：
+
+<img src="img/Fig8.png" style="zoom:67%;" />
+
+​	可以看到，在epoch比较小时，Adam优化并没有获得更高的精确度。在epoch结束时，acc仍处于上升趋势。可能是因为Adam方法的低学习率导致的收敛速度变慢。Momentum更适合本实验的模型。
+
+### 6. 权重初始化
+
+​	权值初始值是不能设置成0的，因为如果参数都为0，在第一遍前向计算时，所有的隐层神经元的激活值都相同。这样会导致深层神经元没有区分性。这种现象也称为对称权重现象。
+
+​	为了解决这个问题，一种比较直观的权重初始化方法就是给每个参数随机初始化一个值。然而，如果初始化的值太小，会导致神经元的输入过小，经过多层之后信号就消失了；初始化值设置过大会导致数据状态过大，激活值很快饱和了，梯度接近于0，也是不好训练的。
+
+​	因此一般而言，参数初始化的区间应该根据神经元的性质进行差异化的设置。
+
+​	下面分别介绍Xavier和Kaiming两种初始化方法。
+
+#### 6.1 Xavier初始化
+
+​	Xavier Glorot，在其论文中提出一个洞见：激活值的方差是逐层递减的，这导致反向传播中的梯度也逐层递减。要解决梯度消失，就要避免激活值方差的衰减，最理想的情况是，每层的输出值（激活值）保持高斯分布。
+
+​	Xavier Glorot为实现这种理想情况，设计了依托均匀分布和高斯分布两种初始化方式。分布的参数通过一个gain（增益值来计算）。
+
+​	`torch.nn.init.calculate_gain(nonlinearity, param=None)`提供了对非线性函数增益值的计算：
+
+<img src="img/Fig7.png" style="zoom:80%;" />
+
+​	均匀分布初始化$U(-a,a)$，其中：
+$$
+a = gain\times \sqrt{\frac{6}{fan\_in+fan\_out}}
+$$
+
+
+​	正态分布初始化$N(0,std^2)$，其中：
+$$
+std = gain\times \frac{2}{fan\_in+fan\_out}
+$$
+
+
+​	$fan\_in$和$fan\_out$表示输入和输出的规模。
+
+#### 6.2 Kaiming初始化
+
+> ​	Xavier初始化的问题在于，它只适用于线性激活函数，但实际上，对于深层神经网络来说，线性激活函数是没有价值，神经网络需要非线性激活函数来构建复杂的非线性系统。今天的神经网络普遍使用relu激活函数。aiming初始化的发明人kaiming he，在其论文中提出了针对relu的kaiming初始化。
+>
+> ​	因为relu会抛弃掉小于0的值，对于一个均值为0的data来说，这就相当于砍掉了一半的值，这样一来，均值就会变大，前面Xavier初始化公式中E(x)=mean=0的情况就不成立了。根据新公式的推导，最终得到新的rescale系数：![\sqrt {2/n}](https://math.jianshu.com/math?formula=%5Csqrt%20%7B2%2Fn%7D)。
+
+​	均匀分布初始化$U(-bound,bound)$，其中：
+$$
+bound  = \sqrt{\frac{6}{(1+a^2)\times fan\_in}}
+$$
+​	正态分布初始化$N(0,std^2)$，其中：
+$$
+bound  = \sqrt{\frac{2}{(1+a^2)\times fan\_in}}
+$$
+​	$a$为可设置的参数，$fan\_in$为输入层的规模。
+
+#### 6.3 初始化函数的实现
+
+​	首先看看`get_torch_initialization`这个函数用的是什么方式的初始化：
+
+```python
+fc1 = torch.nn.Linear(28 * 28, 256)
+W1 = fc1.weight.T.detach().clone().data
+```
+
+​	它定义了一个Linear层，直接取其Weight的值。找到Linear这个类的定义：
+
+```python
+init.kaiming_uniform_(self.weight, a=math.sqrt(5))
+```
+
+​	使用的是kaiming均匀分布，a设置为$\sqrt 5$。
+
+```python
+def kaiming_uniform_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu')
+```
+
+​	由于没有设置nonlinearity，其值为默认leaky_relu。
+
+```python
+gain = calculate_gain(nonlinearity, a)
+```
+
+​	这个传入的a即为Leaky ReLU中的`negative_slop`。
+
+```python
+gain = math.sqrt(2.0 / (1 + a ** 2))
+std = gain / math.sqrt(fan)
+bound = math.sqrt(3.0) * std
+return tensor.uniform_(-bound, bound)
+```
+
+​	代码阅读之后，可以看出，tensor中使用的是kaiming的均匀分布初始化。
+
+​	最终在代码实现均匀分布的kaiming分布，a设置为$\sqrt 5$ ：
+
+```python
+def get_torch_initialization_numpy(numpy=True):
+    fan_in_1 = 28 * 28
+    fan_in_2 = 256
+    fan_in_3 = 64
+
+    bound1 = 1 / np.sqrt(fan_in_1) #bound1 = np.sqrt(6) / np.sqrt(1+np.sqrt(5)**2) /np.sqrt(fan_in_1)
+    bound2 = 1 / np.sqrt(fan_in_2)
+    bound3 = 1 / np.sqrt(fan_in_3)
+
+    W1 = np.random.uniform(-bound1, bound1, (28*28, 256))
+    W2 = np.random.uniform(-bound2, bound2, (256, 64))
+    W3 = np.random.uniform(-bound3, bound3, (64, 10))
+
+    if numpy == False:
+        W1 = torch.Tensor(W1)
+        W2 = torch.Tensor(W2)
+        W3 = torch.Tensor(W3)
+    
+    return W1, W2, W3
+```
+
+​	torch.mnist运行结果：
+
+<img src="img/Fig5.png" style="zoom:67%;" />
+
+```
+[0] Accuracy: 0.9503
+[1] Accuracy: 0.9639
+[2] Accuracy: 0.9711
+```
+
+### 7.提交的代码说明
+
+- `numpy_fnn.py`：算子和FNN模型正向传播和反向传播的实现。`optimize_Momentum`方法实现Momentum优化，`optimize_Adam`方法实现Adam优化。可在`numpy_mnist.py`中修改optimize的调用改变优化方法。
+- `numpy_mnist.py`：`mini_batch_numpy`方法用numpy实现了mini_batch。
+- `utils.py`：`get_torch_initialization_numpy`方法用numpy实现了均匀分布的kaiming初始化。
+
+### 8.参考文献
+
+[1] [神经网络常见优化算法(Momentum, RMSprop, Adam)的原理及公式理解, 学习率衰减](https://blog.csdn.net/weixin_42561002/article/details/88036777)
+
+[2] [深度之眼【Pytorch】-Xavier、Kaiming初始化（附keras实现）](https://blog.csdn.net/weixin_42147780/article/details/103238195)
+
diff --git a/assignment-2/submission/18307130341/img/Fig1.png b/assignment-2/submission/18307130341/img/Fig1.png
new file mode 100644
index 0000000000000000000000000000000000000000..50b42797b50f8b745d7707a86e2644d84843d228
Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig1.png differ
diff --git a/assignment-2/submission/18307130341/img/Fig2.png b/assignment-2/submission/18307130341/img/Fig2.png
new file mode 100644
index 0000000000000000000000000000000000000000..f5dd7bdc2c2712953ddd2b990232d3a7a71b655b
Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig2.png differ
diff --git a/assignment-2/submission/18307130341/img/Fig3.png b/assignment-2/submission/18307130341/img/Fig3.png
new file mode 100644
index 0000000000000000000000000000000000000000..c440ff99663169e8b636ed9e3fc8b7cbdb6008f1
Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig3.png differ
diff --git a/assignment-2/submission/18307130341/img/Fig4.png b/assignment-2/submission/18307130341/img/Fig4.png
new file mode 100644
index 0000000000000000000000000000000000000000..05196c545d1d6da186cd8e301eff8fec10110060
Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig4.png differ
diff --git a/assignment-2/submission/18307130341/img/Fig5.png b/assignment-2/submission/18307130341/img/Fig5.png
new file mode 100644
index 0000000000000000000000000000000000000000..31658f4aa8db641225bf56c4ef54fb8c079d7ae2
Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig5.png differ
diff --git a/assignment-2/submission/18307130341/img/Fig6.png b/assignment-2/submission/18307130341/img/Fig6.png
new file mode 100644
index 0000000000000000000000000000000000000000..54721b598f39cfb996bfde8986077ca64836eb76
Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig6.png differ
diff --git a/assignment-2/submission/18307130341/img/Fig7.png b/assignment-2/submission/18307130341/img/Fig7.png
new file mode 100644
index 0000000000000000000000000000000000000000..1a3f1b1c91d8767838bd464ad291da558006c941
Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig7.png differ
diff --git a/assignment-2/submission/18307130341/img/Fig8.png b/assignment-2/submission/18307130341/img/Fig8.png
new file mode 100644
index 0000000000000000000000000000000000000000..081717ced38314a1b250daf2f527bce99313bc71
Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig8.png differ
diff --git a/assignment-2/submission/18307130341/numpy_fnn.py b/assignment-2/submission/18307130341/numpy_fnn.py
new file mode 100644
index 0000000000000000000000000000000000000000..bacac809951ecb357664479fa2f8f69e956fd8b8
--- /dev/null
+++ b/assignment-2/submission/18307130341/numpy_fnn.py
@@ -0,0 +1,240 @@
+import numpy as np
+
+
+class NumpyOp:
+    
+    def __init__(self):
+        self.memory = {}
+        self.epsilon = 1e-12
+
+
+class Matmul(NumpyOp):
+    
+    def forward(self, x, W):
+        """
+        x: shape(N, d)
+        w: shape(d, d')
+        """
+        self.memory['x'] = x
+        self.memory['W'] = W
+        h = np.matmul(x, W)
+        return h
+    
+    def backward(self, grad_y):
+        """
+        grad_y: shape(N, d')
+        """
+        
+        ####################
+        #      code 1      #    
+        ####################
+        xT = np.transpose(self.memory['x'])
+        WT = np.transpose(self.memory['W'])
+
+        grad_x = np.matmul(grad_y, WT)
+        grad_W = np.matmul(xT, grad_y)
+        
+        return grad_x, grad_W
+
+
+class Relu(NumpyOp):
+    
+    def forward(self, x):
+        self.memory['x'] = x
+        return np.where(x > 0, x, np.zeros_like(x))
+    
+    def backward(self, grad_y):
+        """
+        grad_y: same shape as x
+        """
+        
+        ####################
+        #      code 2      #
+        ####################
+        x = self.memory['x']
+        grad_x = grad_y * np.where(x > 0, 1, 0)
+        
+        return grad_x
+
+
+class Log(NumpyOp):
+    
+    def forward(self, x):
+        """
+        x: shape(N, c)
+        """
+        
+        out = np.log(x + self.epsilon)
+        self.memory['x'] = x
+        
+        return out
+    
+    def backward(self, grad_y):
+        """
+        grad_y: same shape as x
+        """
+        
+        ####################
+        #      code 3      #
+        ####################
+        x = self.memory['x']
+        grad_x = grad_y / (x + self.epsilon)
+        
+        return grad_x
+
+
+class Softmax(NumpyOp):
+    """
+    softmax over last dimension
+    """
+    
+    def forward(self, x):
+        """
+        x: shape(N, c)
+        """
+        
+        ####################
+        #      code 4      #
+        ####################
+        sum = np.exp(x).sum(axis = 1)
+        sum = sum.reshape(x.shape[0], 1)
+        out = np.exp(x) / sum
+
+        self.memory['y'] = out
+
+        return out
+    
+    def backward(self, grad_y):
+        """
+        grad_y: same shape as x
+        """
+        
+        ####################
+        #      code 5      #
+        ####################
+        y = self.memory['y']
+        
+        grad_x = y * (grad_y -  (y * grad_y).sum(axis = 1).reshape(len(y),1))
+
+        return grad_x
+
+
+class NumpyLoss:
+    
+    def __init__(self):
+        self.target = None
+    
+    def get_loss(self, pred, target):
+        self.target = target
+        return (-pred * target).sum(axis=1).mean()
+    
+    def backward(self):
+        return -self.target / self.target.shape[0]
+
+
+class NumpyModel:
+    def __init__(self):
+        self.W1 = np.random.normal(size=(28 * 28, 256))
+        self.W2 = np.random.normal(size=(256, 64))
+        self.W3 = np.random.normal(size=(64, 10))
+        
+        # 以下算子会在 forward 和 backward 中使用
+        self.matmul_1 = Matmul()
+        self.relu_1 = Relu()
+        self.matmul_2 = Matmul()
+        self.relu_2 = Relu()
+        self.matmul_3 = Matmul()
+        self.softmax = Softmax()
+        self.log = Log()
+        
+        # 以下变量需要在 backward 中更新。 softmax_grad, log_grad 等为算子反向传播的梯度（ loss 关于算子输入的偏导）
+        self.x1_grad, self.W1_grad = None, None
+        self.relu_1_grad = None
+        self.x2_grad, self.W2_grad = None, None
+        self.relu_2_grad = None
+        self.x3_grad, self.W3_grad = None, None
+        self.softmax_grad = None
+        self.log_grad = None
+
+        # Momentum优化
+        self.v_W1_grad = 0
+        self.v_W2_grad = 0
+        self.v_W3_grad = 0
+
+        # Adam优化
+        self.s_W1_grad = 0
+        self.s_W2_grad = 0
+        self.s_W3_grad = 0
+    
+    def forward(self, x):
+        x = x.reshape(-1, 28 * 28)
+        
+        ####################
+        #      code 6      #
+        ####################
+        x = self.matmul_1.forward(x, self.W1) 
+        x = self.relu_1.forward(x)
+
+        x = self.matmul_2.forward(x, self.W2)
+        x = self.relu_2.forward(x)
+
+        x = self.matmul_3.forward(x, self.W3)
+        x = self.softmax.forward(x)
+
+        x = self.log.forward(x)
+
+        return x
+    
+    def backward(self, y):
+        
+        ####################
+        #      code 7      #
+        ###################
+
+        self.log_grad = self.log.backward(y)
+
+        self.softmax_grad = self.softmax.backward(self.log_grad)
+        self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad)
+
+        self.relu_2_grad = self.relu_2.backward(self.x3_grad)
+        self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad)
+
+        self.relu_1_grad = self.relu_1.backward(self.x2_grad)
+        self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad)
+    
+    def optimize(self, learning_rate):
+        self.W1 -= learning_rate * self.W1_grad
+        self.W2 -= learning_rate * self.W2_grad
+        self.W3 -= learning_rate * self.W3_grad
+    
+    def optimize_Momentum(self, learning_rate, belta):
+        self.v_W1_grad = belta * self.v_W1_grad + (1 - belta) * self.W1_grad
+        self.v_W2_grad = belta * self.v_W2_grad + (1 - belta) * self.W2_grad
+        self.v_W3_grad = belta * self.v_W3_grad + (1 - belta) * self.W3_grad
+
+        self.W1 -= learning_rate * self.v_W1_grad
+        self.W2 -= learning_rate * self.v_W2_grad
+        self.W3 -= learning_rate * self.v_W3_grad
+
+    def optimize_Adam(self, learning_rate, beta1, beta2, beta1_t, beta2_t, eps):
+
+        self.v_W1_grad = beta1 * self.v_W1_grad + (1 - beta1) * self.W1_grad
+        self.v_W2_grad = beta1 * self.v_W2_grad + (1 - beta1) * self.W2_grad
+        self.v_W3_grad = beta1 * self.v_W3_grad + (1 - beta1) * self.W3_grad
+
+        v_W1_corr = self.v_W1_grad / (1 - beta1_t)
+        v_W2_corr = self.v_W2_grad / (1 - beta1_t)
+        v_W3_corr = self.v_W3_grad / (1 - beta1_t)
+
+        self.s_W1_grad = beta2 * self.s_W1_grad + (1 - beta2) * (self.W1_grad ** 2)
+        self.s_W2_grad = beta2 * self.s_W2_grad + (1 - beta2) * (self.W2_grad ** 2)
+        self.s_W3_grad = beta2 * self.s_W3_grad + (1 - beta2) * (self.W3_grad ** 2)
+
+        s_W1_corr = self.s_W1_grad / (1 - beta2_t)
+        s_W2_corr = self.s_W2_grad / (1 - beta2_t)
+        s_W3_corr = self.s_W3_grad / (1 - beta2_t)
+
+        self.W1 -= learning_rate * v_W1_corr / (np.sqrt(s_W1_corr) + eps)
+        self.W2 -= learning_rate * v_W2_corr / (np.sqrt(s_W2_corr) + eps)
+        self.W3 -= learning_rate * v_W3_corr / (np.sqrt(s_W3_corr) + eps)
+
diff --git a/assignment-2/submission/18307130341/numpy_mnist.py b/assignment-2/submission/18307130341/numpy_mnist.py
new file mode 100644
index 0000000000000000000000000000000000000000..87aa55fdfd8a765cbad3203b3a9082c28c5c6502
--- /dev/null
+++ b/assignment-2/submission/18307130341/numpy_mnist.py
@@ -0,0 +1,91 @@
+import numpy as np
+from numpy_fnn import NumpyModel, NumpyLoss
+from utils import download_mnist, mini_batch, batch, get_torch_initialization, plot_curve, one_hot
+
+def mini_batch_numpy(dataset, batch_size=128):
+    data = []
+    label = []
+
+    for x in dataset:
+        data.append(np.array(x[0]))
+        label.append(x[1])
+
+    data = np.array(data)
+    label = np.array(label)
+
+    idx = np.random.permutation(len(dataset))
+    data = data[idx]
+    label = label[idx]
+
+    split_num = len(dataset) // batch_size
+    split_pos = split_num * batch_size
+
+    ret_data = np.split(data[:split_pos], split_num)
+    ret_data.append(data[split_pos+1:])
+
+    ret_label = np.split(label[:split_pos], split_num)
+    ret_label.append(label[split_pos+1:])
+    
+    ret = list(zip(ret_data, ret_label))
+    return ret
+
+def numpy_run():
+    
+    import time
+    start = time.time()
+
+    train_dataset, test_dataset = download_mnist()
+    
+    model = NumpyModel()
+    numpy_loss = NumpyLoss()
+    model.W1, model.W2, model.W3 = get_torch_initialization()
+    
+    train_loss = []
+    
+    epoch_number = 3
+    learning_rate = 0.1
+
+    #Adam 优化
+    beta1 = 0.9
+    beta2 = 0.999
+    beta1_t = 1
+    beta2_t = 1
+    
+    for epoch in range(epoch_number):
+        #Adam 优化
+        beta1_t *= beta1
+        beta2_t *= beta2
+        
+        # for x, y in mini_batch_numpy(train_dataset): # mini_batch_numpy
+        for x, y in mini_batch(train_dataset):
+            y = one_hot(y)
+            
+            # y_pred = model.forward(x) # mini_batch_numpy
+            y_pred = model.forward(x.numpy())
+            loss = numpy_loss.get_loss(y_pred, y)
+
+            model.backward(numpy_loss.backward())
+
+            #原始optimize
+            model.optimize(learning_rate)
+
+            #Momentum 优化
+            # model.optimize_Momentum(learning_rate, 0.9)
+            
+            #Adam 优化
+            # model.optimize_Adam(learning_rate, beta1, beta2, beta1_t, beta2_t, 1e-8)
+
+            train_loss.append(loss.item())
+        
+        x, y = batch(test_dataset)[0]
+        accuracy = np.mean((model.forward(x).argmax(axis=1) == y))
+        print('[{}] Accuracy: {:.4f}'.format(epoch, accuracy))
+    
+    end = time.time()
+    print("time = %.2f s"%(end-start))
+
+    plot_curve(train_loss)
+
+
+if __name__ == "__main__":
+    numpy_run()
diff --git a/assignment-2/submission/18307130341/tester_demo.py b/assignment-2/submission/18307130341/tester_demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..504b3eef50a6df4d0aa433113136add50835e420
--- /dev/null
+++ b/assignment-2/submission/18307130341/tester_demo.py
@@ -0,0 +1,182 @@
+import numpy as np
+import torch
+from torch import matmul as torch_matmul, relu as torch_relu, softmax as torch_softmax, log as torch_log
+
+from numpy_fnn import Matmul, Relu, Softmax, Log, NumpyModel, NumpyLoss
+from torch_mnist import TorchModel
+from utils import get_torch_initialization, one_hot
+
+err_epsilon = 1e-6
+err_p = 0.4
+
+
+def check_result(numpy_result, torch_result=None):
+    if isinstance(numpy_result, list) and torch_result is None:
+        flag = True
+        for (n, t) in numpy_result:
+            flag = flag and check_result(n, t)
+        return flag
+    # print((torch.from_numpy(numpy_result) - torch_result).abs().mean().item())
+    T = (torch_result * torch.from_numpy(numpy_result) < 0).sum().item()
+    direction = T / torch_result.numel() < err_p
+    return direction and ((torch.from_numpy(numpy_result) - torch_result).abs().mean() < err_epsilon).item()
+
+
+def case_1():
+    x = np.random.normal(size=[5, 6])
+    W = np.random.normal(size=[6, 4])
+    
+    numpy_matmul = Matmul()
+    numpy_out = numpy_matmul.forward(x, W)
+    numpy_x_grad, numpy_W_grad = numpy_matmul.backward(np.ones_like(numpy_out))
+    
+    torch_x = torch.from_numpy(x).clone().requires_grad_()
+    torch_W = torch.from_numpy(W).clone().requires_grad_()
+    
+    torch_out = torch_matmul(torch_x, torch_W)
+    torch_out.sum().backward()
+    
+    return check_result([
+        (numpy_out, torch_out),
+        (numpy_x_grad, torch_x.grad),
+        (numpy_W_grad, torch_W.grad)
+    ])
+
+
+def case_2():
+    x = np.random.normal(size=[5, 6])
+    
+    numpy_relu = Relu()
+    numpy_out = numpy_relu.forward(x)
+    numpy_x_grad = numpy_relu.backward(np.ones_like(numpy_out))
+    
+    torch_x = torch.from_numpy(x).clone().requires_grad_()
+    
+    torch_out = torch_relu(torch_x)
+    torch_out.sum().backward()
+    
+    return check_result([
+        (numpy_out, torch_out),
+        (numpy_x_grad, torch_x.grad),
+    ])
+
+
+def case_3():
+    x = np.random.uniform(low=0.0, high=1.0, size=[3, 4])
+    
+    numpy_log = Log()
+    numpy_out = numpy_log.forward(x)
+    numpy_x_grad = numpy_log.backward(np.ones_like(numpy_out))
+    
+    torch_x = torch.from_numpy(x).clone().requires_grad_()
+    
+    torch_out = torch_log(torch_x)
+    torch_out.sum().backward()
+    
+    return check_result([
+        (numpy_out, torch_out),
+        
+        (numpy_x_grad, torch_x.grad),
+    ])
+
+
+def case_4():
+    x = np.random.normal(size=[4, 5])
+    
+    numpy_softmax = Softmax()
+    numpy_out = numpy_softmax.forward(x)
+    
+    torch_x = torch.from_numpy(x).clone().requires_grad_()
+    
+    torch_out = torch_softmax(torch_x, 1)
+    
+    return check_result(numpy_out, torch_out)
+
+
+def case_5():
+    x = np.random.normal(size=[20, 25])
+    
+    numpy_softmax = Softmax()
+    numpy_out = numpy_softmax.forward(x)
+    numpy_x_grad = numpy_softmax.backward(np.ones_like(numpy_out))
+    
+    torch_x = torch.from_numpy(x).clone().requires_grad_()
+    
+    torch_out = torch_softmax(torch_x, 1)
+    torch_out.sum().backward()
+    
+    return check_result([
+        (numpy_out, torch_out),
+        (numpy_x_grad, torch_x.grad),
+    ])
+
+
+def test_model():
+    try:
+        numpy_loss = NumpyLoss()
+        numpy_model = NumpyModel()
+        torch_model = TorchModel()
+        torch_model.W1.data, torch_model.W2.data, torch_model.W3.data = get_torch_initialization(numpy=False)
+        numpy_model.W1 = torch_model.W1.detach().clone().numpy()
+        numpy_model.W2 = torch_model.W2.detach().clone().numpy()
+        numpy_model.W3 = torch_model.W3.detach().clone().numpy()
+        
+        x = torch.randn((10000, 28, 28))
+        y = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 0] * 1000)
+        
+        y = one_hot(y, numpy=False)
+        x2 = x.numpy()
+        y_pred = torch_model.forward(x)
+        loss = (-y_pred * y).sum(dim=1).mean()
+        loss.backward()
+        
+        y_pred_numpy = numpy_model.forward(x2)
+        numpy_loss.get_loss(y_pred_numpy, y.numpy())
+        
+        check_flag_1 = check_result(y_pred_numpy, y_pred)
+        print("+ {:12} {}/{}".format("forward", 10 * check_flag_1, 10))
+    except:
+        print("[Runtime Error in forward]")
+        print("+ {:12} {}/{}".format("forward", 0, 10))
+        return 0
+    
+    try:
+        
+        numpy_model.backward(numpy_loss.backward())
+        
+        check_flag_2 = [
+            check_result(numpy_model.log_grad, torch_model.log_input.grad),
+            check_result(numpy_model.softmax_grad, torch_model.softmax_input.grad),
+            check_result(numpy_model.W3_grad, torch_model.W3.grad),
+            check_result(numpy_model.W2_grad, torch_model.W2.grad),
+            check_result(numpy_model.W1_grad, torch_model.W1.grad)
+        ]
+        check_flag_2 = sum(check_flag_2) >= 4
+        print("+ {:12} {}/{}".format("backward", 20 * check_flag_2, 20))
+    except:
+        print("[Runtime Error in backward]")
+        print("+ {:12} {}/{}".format("backward", 0, 20))
+        check_flag_2 = False
+    
+    return 10 * check_flag_1 + 20 * check_flag_2
+
+
+if __name__ == "__main__":
+    testcases = [
+        ["matmul", case_1, 5],
+        ["relu", case_2, 5],
+        ["log", case_3, 5],
+        ["softmax_1", case_4, 5],
+        ["softmax_2", case_5, 10],
+    ]
+    score = 0
+    for case in testcases:
+        try:
+            res = case[2] if case[1]() else 0
+        except:
+            print("[Runtime Error in {}]".format(case[0]))
+            res = 0
+        score += res
+        print("+ {:12} {}/{}".format(case[0], res, case[2]))
+    score += test_model()
+    print("{:14} {}/60".format("FINAL SCORE", score))
diff --git a/assignment-2/submission/18307130341/torch_mnist.py b/assignment-2/submission/18307130341/torch_mnist.py
new file mode 100644
index 0000000000000000000000000000000000000000..7bbcedcf8108c227e09c861a761c18e99a7f9429
--- /dev/null
+++ b/assignment-2/submission/18307130341/torch_mnist.py
@@ -0,0 +1,75 @@
+import torch
+from utils import mini_batch, batch, download_mnist, get_torch_initialization, one_hot, plot_curve
+
+
+class TorchModel:
+    
+    def __init__(self):
+        self.W1 = torch.randn((28 * 28, 256), requires_grad=True)
+        self.W2 = torch.randn((256, 64), requires_grad=True)
+        self.W3 = torch.randn((64, 10), requires_grad=True)
+        self.softmax_input = None
+        self.log_input = None
+    
+    def forward(self, x):
+        x = x.reshape(-1, 28 * 28)
+        x = torch.relu(torch.matmul(x, self.W1))
+        x = torch.relu(torch.matmul(x, self.W2))
+        x = torch.matmul(x, self.W3)
+        
+        self.softmax_input = x
+        self.softmax_input.retain_grad()
+        
+        x = torch.softmax(x, 1)
+        
+        self.log_input = x
+        self.log_input.retain_grad()
+        
+        x = torch.log(x)
+        
+        return x
+    
+    def optimize(self, learning_rate):
+        with torch.no_grad():
+            self.W1 -= learning_rate * self.W1.grad
+            self.W2 -= learning_rate * self.W2.grad
+            self.W3 -= learning_rate * self.W3.grad
+            
+            self.W1.grad = None
+            self.W2.grad = None
+            self.W3.grad = None
+
+
+def torch_run():
+    train_dataset, test_dataset = download_mnist()
+    
+    model = TorchModel()
+    # model.W1.data, model.W2.data, model.W3.data = get_torch_initialization(numpy=False)
+    from utils import get_torch_initialization_numpy
+    model.W1.data, model.W2.data, model.W3.data = get_torch_initialization_numpy(numpy=False)
+    
+    train_loss = []
+    
+    epoch_number = 3
+    learning_rate = 0.1
+    
+    for epoch in range(epoch_number):
+        for x, y in mini_batch(train_dataset, numpy=False):
+            y = one_hot(y, numpy=False)
+            
+            y_pred = model.forward(x)
+            loss = (-y_pred * y).sum(dim=1).mean()
+            loss.backward()
+            model.optimize(learning_rate)
+            
+            train_loss.append(loss.item())
+        
+        x, y = batch(test_dataset, numpy=False)[0]
+        accuracy = model.forward(x).argmax(dim=1).eq(y).float().mean().item()
+        print('[{}] Accuracy: {:.4f}'.format(epoch, accuracy))
+    
+    plot_curve(train_loss)
+
+
+if __name__ == "__main__":
+    torch_run()
diff --git a/assignment-2/submission/18307130341/utils.py b/assignment-2/submission/18307130341/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..5154f4970843623204198206ff0df1438bbee5df
--- /dev/null
+++ b/assignment-2/submission/18307130341/utils.py
@@ -0,0 +1,91 @@
+import torch
+import numpy as np
+from matplotlib import pyplot as plt
+
+
+def plot_curve(data):
+    plt.plot(range(len(data)), data, color='blue')
+    plt.legend(['loss_value'], loc='upper right')
+    plt.xlabel('step')
+    plt.ylabel('value')
+    plt.show()
+
+
+def download_mnist():
+    from torchvision import datasets, transforms
+    
+    transform = transforms.Compose([
+        transforms.ToTensor(),
+        transforms.Normalize(mean=(0.1307,), std=(0.3081,))
+    ])
+    
+    train_dataset = datasets.MNIST(root="./data/", transform=transform, train=True, download=True)
+    test_dataset = datasets.MNIST(root="./data/", transform=transform, train=False, download=True)
+    
+    return train_dataset, test_dataset
+
+
+def one_hot(y, numpy=True):
+    if numpy:
+        y_ = np.zeros((y.shape[0], 10))
+        y_[np.arange(y.shape[0], dtype=np.int32), y] = 1
+        return y_
+    else:
+        y_ = torch.zeros((y.shape[0], 10))
+        y_[torch.arange(y.shape[0], dtype=torch.long), y] = 1
+    return y_
+
+
+def batch(dataset, numpy=True):
+    data = []
+    label = []
+    for each in dataset:
+        data.append(each[0])
+        label.append(each[1])
+    data = torch.stack(data)
+    label = torch.LongTensor(label)
+    if numpy:
+        return [(data.numpy(), label.numpy())]
+    else:
+        return [(data, label)]
+
+
+def mini_batch(dataset, batch_size=128, numpy=False):
+    return torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)
+
+
+def get_torch_initialization(numpy=True):
+    fc1 = torch.nn.Linear(28 * 28, 256)
+    fc2 = torch.nn.Linear(256, 64)
+    fc3 = torch.nn.Linear(64, 10)
+    
+    if numpy:
+        W1 = fc1.weight.T.detach().clone().numpy()
+        W2 = fc2.weight.T.detach().clone().numpy()
+        W3 = fc3.weight.T.detach().clone().numpy()
+    else:
+        W1 = fc1.weight.T.detach().clone().data
+        W2 = fc2.weight.T.detach().clone().data
+        W3 = fc3.weight.T.detach().clone().data
+    
+    return W1, W2, W3
+
+def get_torch_initialization_numpy(numpy=True):
+    fan_in_1 = 28 * 28
+    fan_in_2 = 256
+    fan_in_3 = 64
+
+    bound1 = 1 / np.sqrt(fan_in_1)
+    bound2 = 1 / np.sqrt(fan_in_2)
+    bound3 = 1 / np.sqrt(fan_in_3)
+
+    W1 = np.random.uniform(-bound1, bound1, (28*28, 256))
+    W2 = np.random.uniform(-bound2, bound2, (256, 64))
+    W3 = np.random.uniform(-bound3, bound3, (64, 10))
+
+    if numpy == False:
+        W1 = torch.Tensor(W1)
+        W2 = torch.Tensor(W2)
+        W3 = torch.Tensor(W3)
+    
+    return W1, W2, W3
\ No newline at end of file