diff --git a/assignment-2/submission/18307130341/README.md b/assignment-2/submission/18307130341/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..a859dde38f467fbbb852f830eda2ded657e48952
--- /dev/null
+++ b/assignment-2/submission/18307130341/README.md
@@ -0,0 +1,441 @@
+# 实验报告:ASS2-选题1-FNN
+
+18307130341 黄韵澄
+
+[toc]
+
+### 1.实验概述
+
+ 实现一个前馈神经网络,利用MNIST数据集,解决手写体识别的分类问题。
+
+### 2. 算子前向传播和反向传播的推导
+
+ 搭建的FNN中会用到Matmul、Relu、Log、Softmax这几个算子。
+
+#### 2.1 Matmul层反向传播
+
+$$
+loss = f(X\times W) =f(Y)
+$$
+
+根据链式法则:
+$$
+\frac{\partial loss}{\partial X_{p,q}} = \sum_{i,j}{\frac{\partial loss}{\partial Y_{i,j}}\frac{\partial Y_{i,j}}{\partial X_{p,q}}}
+$$
+ 根据矩阵乘法定义:
+$$
+Y_{i,j} = \sum_{k}{X_{i,k}W_{k,j}}
+$$
+ 所以,但$i\neq p$时,$C_{i,j}$与$A_{p,q}$无关:
+$$
+\frac{\partial Y_{i,j}}{\partial X_{p,q}} =\begin{cases}W_{q,j}\quad i=p \\\\ 0 \quad i\neq p\end{cases}
+$$
+ 代入式子:
+$$
+\frac{\partial loss}{\partial X_{p,q}} = \sum_{i,j}\frac{\partial loss}{\partial Y_{i,j}}\frac{\partial Y_{i,j}}{\partial X_{p,q}}=\sum_{j}\frac{\partial loss}{\partial Y_{p,j}}\frac{\partial Y_{p,j}}{\partial X_{p,q}}=\sum_{j}\frac{\partial loss}{\partial Y_{p,j}}W_{q,j}=\sum_{j}\frac{\partial loss}{\partial Y_{p,j}}W_{j,q}^{T}
+$$
+
+
+ 所以:
+$$
+\frac{\partial loss}{\partial X} = \frac{\partial loss}{\partial Y}W^{T}
+$$
+ 同理:
+$$
+\frac{\partial loss}{\partial W} = X^{T}\frac{\partial loss}{\partial Y}
+$$
+
+#### 2.2 Relu层反向传播
+
+$$
+loss = f(Y) = f(Relu(X))
+$$
+
+其中:
+$$
+Relu(x) = \begin{cases}0 \quad x < 0 \\\\ x \quad x\geq 0\end{cases}
+$$
+则:
+$$
+\frac{\partial Y_{i,j}}{\partial X_{k,l}} = \begin{cases}1 \quad i=k\quad and\quad j=l\quad and\quad X_{k,l}>0 \\\\
+0\quad else \end{cases}
+$$
+由链式法则:
+$$
+\frac{\partial loss}{\partial X} = \frac{\partial loss}{\partial Y}\frac{\partial Y}{\partial X}
+$$
+代码:
+
+```python
+grad_x = grad_y * np.where(x > 0, 1, 0)
+```
+
+#### 2.3 Log层反向传播
+
+$$
+loss = f(Y) = f(ln(X))
+$$
+
+其中:
+$$
+\frac{\partial Y_{i,j}}{\partial X_{k,l}} = \begin{cases}\frac{1}{X_{k,j}} \quad i=k\quad and\quad j=l\quad \\\\
+0\quad else \end{cases}
+$$
+由链式法则:
+$$
+\frac{\partial loss}{\partial X} = \frac{\partial loss}{\partial Y}\cdot \frac{\partial Y}{\partial X}=\frac{\partial loss}{\partial Y}\cdot \frac{1}{X}
+$$
+代码:
+
+```python
+grad_x = grad_y / (x + self.epsilon)
+```
+
+#### 2.4 Softmax层正向传播和反向传播
+
+正向传播(只与行相关):
+
+$$
+loss = f(Y) \\
+Y_{i,j} = \frac{e^{X_{i,j}}}{\sum_{k}e^{X_{i,k}}}
+$$
+反向传播:
+
+(1)当$j=l$时:
+$$
+\frac{\partial Y_{i,j}}{\partial X_{i,l}} = \frac{\partial Y_{i,j}}{\partial X_{i,j}}=\frac{\partial \frac{e^{X_{i,j}}}{\sum_{k}e^{X_{i,k}}}}{\partial X_{i,j}} = \frac{e^{X_{i,j}}}{\sum_{k}e^{X_{i,k}}}-(\frac{e^{X_{i,j}}}{\sum_{k}e^{X_{i,k}}})^{2} = Y_{i,j}-Y_{i,j}^2 \\
+$$
+(2)当$j \neq l$时:
+$$
+\frac{\partial Y_{i,j}}{\partial X_{i,l}} = \frac{\partial \frac{e^{X_{i,j}}}{\sum_{k}e^{X_{i,k}}}}{\partial X_{i,l}} = -\frac{e^{X_{i,j}}}{\sum_{k}e^{X_{i,k}}}\cdot \frac{e^{X_{i,l}}}{\sum_{k}e^{X_{i,k}}} = -Y_{i,j}\cdot Y_{i,l} \\
+$$
+(3)当$i \neq k$时,行不相关,梯度为0:
+$$
+\frac{\partial Y_{i,j}}{\partial X_{k,l}} = 0
+$$
+根据链式法则(公式中的 $\cdot $ 表示点积):
+$$
+\frac{\partial loss}{\partial X_{k,l}} = \sum_{j}\frac{\partial loss}{\partial Y_{k,j}}\cdot \frac{\partial Y_{k,j}}{\partial X_{k,l}} = (\sum_{j}-\frac{\partial loss}{\partial Y_{k,j}}\cdot Y_j\cdot Y_l)+ \frac{\partial loss}{\partial Y_{k,l}}\cdot Y_l \\\\
+=Y_l\cdot( \frac{\partial loss}{\partial Y_{k,l}}-\sum_{j}\frac{\partial loss}{\partial Y_{k,j}}\cdot Y_j)
+$$
+简化成上式之后,可以用numpy方法的一行写完:
+
+```python
+grad_x = y * (grad_y - (y * grad_y).sum(axis = 1).reshape(len(y),1))
+```
+
+### 3.FNN模型搭建
+
+#### 3.1 FNN模型
+
+ FNN模型搭建如图所示:
+
+
+
+- 输入层($N\times28^2$)。与下层连接为全连接,参数为W1($28^2\times256$)。
+- 隐藏层1($N\times256$,激活函数Relu)。与下层连接为全连接,参数为W2($256\times64$)。
+- 隐藏层2($N\times64$,激活函数Relu)。与下层连接为全连接,参数为W3($64\times10$)。
+- 隐藏层3($N\times10$,激活函数Softmax)。直接输出到下层。
+- 输出层($N\times10$,激活函数Log)。
+
+ FNN模型用公式表示:
+$$
+a^{(0)} = X \\\\
+z^{(1)} = W_1\times a^{(0)} ,\quad a^{(1)} = Relu(z^{(1)}) \\\\
+z^{(2)} = W_2\times a^{(1)} ,\quad a^{(2)} = Relu(z^{(2)}) \\\\
+z^{(3)} = W_3\times a^{(2)} ,\quad a^{(3)} = Softmax(z^{(3)}) \\\\
+z^{(4)} = a^{(3)},\quad a^{(4)} = Log(z^{(4)}) \\\\
+Y = a^{(4)}
+$$
+
+ 损失值的定义:
+$$
+loss = -Y * \hat Y
+$$
+
+#### 3.2 FNN反向传播
+
+ 根据搭建的FNN模型,反方向链式求导。用上一层的求出的梯度作为下一层输入即可:
+
+```python
+self.log_grad = self.log.backward(y)
+
+self.softmax_grad = self.softmax.backward(self.log_grad)
+self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad)
+
+self.relu_2_grad = self.relu_2.backward(self.x3_grad)
+self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad)
+
+self.relu_1_grad = self.relu_1.backward(self.x2_grad)
+self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad)
+```
+
+### 3.3 FNN模型的测试结果
+
+ 直接运行`numpy_mnist.py`,即可自动下载MNIST数据集(手写体数字识别)进行自动测试。
+
+ 损失函数绘图:
+
+
+
+ 模型准确率:
+
+```
+[0] Accuracy: 0.9350
+[1] Accuracy: 0.9674
+[2] Accuracy: 0.9693
+```
+
+ numpy中的epoch只设置了3,但模型准确率已经相对比较高了。然而,观察损失函数的波动图像,显然模型的损失函数仍在波动,并未收敛。后续可能需要增加epoch并调节学习率使得模型的损失收敛。
+
+### 4. mini_batch函数的实现
+
+#### 4.1 mini_batch函数
+
+ `mini_batch`函数是将`dataset`分成不同批次,批处理大小为`batch_size`。
+
+ 处理步骤:
+
+- for循环提取dataset中的data和label,转换为ndarray格式:
+
+```python
+data = []
+label = []
+for x in dataset:
+ data.append(np.array(x[0]))
+ label.append(x[1])
+data = np.array(data)
+label = np.array(label)
+```
+
+- shuffle的方法有很多,我用的是`np.random.permutation`这个方法:
+
+```
+idx = np.random.permutation(len(dataset))
+data = data[idx]
+label = label[idx]
+```
+
+- 使用`np.split`方法将data和label划分。这个方法要求均等划分,为了解决dataset规模不能整除batch_size问题,代码单独处理最后一个区块:
+
+```
+split_num = len(dataset) // batch_size #均等划分的区块数量
+split_pos = split_num * batch_size # 均等划分的区块的最末位置
+# 划分data
+ret_data = np.split(data[:split_pos], split_num)
+ret_data.append(data[split_pos+1:])
+# 划分label
+ret_label = np.split(label[:split_pos], split_num)
+ret_label.append(label[split_pos+1:])
+```
+
+- 最后使用`zip`将data和label组合成tuple:
+
+```
+ret = list(zip(ret_data, ret_label))
+```
+
+#### 4.2 mini_batch函数测试
+
+使用torch方法的mini_batch:
+
+```
+[0] Accuracy: 0.9473
+[1] Accuracy: 0.9648
+[2] Accuracy: 0.9680
+time = 73.32 s
+```
+
+只使用numpy方法的mini_batch:
+
+```
+[0] Accuracy: 0.9474
+[1] Accuracy: 0.9556
+[2] Accuracy: 0.9678
+time = 66.24 s
+```
+
+ 理论上是对正确率没有影响的。速度上比torch的快了7s左右。
+
+### 5.模型优化方法
+
+#### 5.1 Momentum方法
+
+ Momentum算法又叫做动量梯度下降算法,使用原始的梯度下降有以下问题:
+
+> 梯度下降过程中有纵向波动,由于这种波动的存在,我们只能采取较小的学习率,否则波动会更大。而使用动量梯度下降法后,经过平均,抵消了上下波动,使波动趋近于零,这样就可以采用稍微大点的学习率加快梯度下降的速度。
+>
+
+ Momentum公式:
+$$
+V_{dW}= \beta \cdot V_{dW} + (1-\beta)\cdot dW \\\\
+W = W - \alpha \cdot V_{dW}
+$$
+ 其中$\alpha$为学习率,$\beta$为动量系数。在实验中$\beta$取值0.9。
+
+ 分别使用原始梯度下降(绿色线)和Momentum优化方法(蓝色线)进行测试,绘制acc-epoch图,结果如下:
+
+
+
+可以看到Momentum优化在前期学习速度比原始方法慢,但随着动量累计,其模型精确度很快高于原始方法的精确度,且最终精确度收敛于更高的水平。
+
+#### 5.2 Adam方法
+
+ Adam本质上实际是RMSProp优化+Momentum优化的结合:
+
+> **均方根传播(RMSProp)**:维护每个参数的学习速率,根据最近的权重梯度的平均值来调整。这意味着该算法在线上和非平稳问题上表现良好。
+
+ Adam公式:
+$$
+V_{dW} = \beta_1\cdot V_{dW} + (1-\beta_1)\cdot dW \\\\
+V_{dW}^{corrected} = \frac{V_{dW}}{1-\beta_1^t} \\\\
+S_{dW} = \beta_2\cdot S_{dW} + (1-\beta_2)\cdot dW^2 \\\\
+S_{dW}^{corrected} = \frac{S_{dW}}{1-\beta_2^t} \\\\
+W = W - \alpha \frac{V_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}}+\epsilon}
+$$
+其中$\alpha$为学习率,$\beta_1$为动量系数,$\beta_2$为自适应学习系数。
+
+ 根据典型的参数设置,$\alpha$设为0.001,$\beta_1$设为0.9,$\beta_2$设为0.999。运行20个epoch,测试结果如图(与原始optimize、Momentum方法比较):
+
+
+
+ 可以看到,Adam方法的loss波动比较大,20个epoch仍未到达收敛值。比较三种optimize方法每个epoch的acc(绿色线原始方法,蓝色线Momentum优化,紫色线Adam优化):
+
+
+
+ 可以看到,在epoch比较小时,Adam优化并没有获得更高的精确度。在epoch结束时,acc仍处于上升趋势。可能是因为Adam方法的低学习率导致的收敛速度变慢。Momentum更适合本实验的模型。
+
+### 6. 权重初始化
+
+ 权值初始值是不能设置成0的,因为如果参数都为0,在第一遍前向计算时,所有的隐层神经元的激活值都相同。这样会导致深层神经元没有区分性。这种现象也称为对称权重现象。
+
+ 为了解决这个问题,一种比较直观的权重初始化方法就是给每个参数随机初始化一个值。然而,如果初始化的值太小,会导致神经元的输入过小,经过多层之后信号就消失了;初始化值设置过大会导致数据状态过大,激活值很快饱和了,梯度接近于0,也是不好训练的。
+
+ 因此一般而言,参数初始化的区间应该根据神经元的性质进行差异化的设置。
+
+ 下面分别介绍Xavier和Kaiming两种初始化方法。
+
+#### 6.1 Xavier初始化
+
+ Xavier Glorot,在其论文中提出一个洞见:激活值的方差是逐层递减的,这导致反向传播中的梯度也逐层递减。要解决梯度消失,就要避免激活值方差的衰减,最理想的情况是,每层的输出值(激活值)保持高斯分布。
+
+ Xavier Glorot为实现这种理想情况,设计了依托均匀分布和高斯分布两种初始化方式。分布的参数通过一个gain(增益值来计算)。
+
+ `torch.nn.init.calculate_gain(nonlinearity, param=None)`提供了对非线性函数增益值的计算:
+
+
+
+ 均匀分布初始化$U(-a,a)$,其中:
+$$
+a = gain\times \sqrt{\frac{6}{fan\_in+fan\_out}}
+$$
+
+
+ 正态分布初始化$N(0,std^2)$,其中:
+$$
+std = gain\times \frac{2}{fan\_in+fan\_out}
+$$
+
+
+ $fan\_in$和$fan\_out$表示输入和输出的规模。
+
+#### 6.2 Kaiming初始化
+
+> Xavier初始化的问题在于,它只适用于线性激活函数,但实际上,对于深层神经网络来说,线性激活函数是没有价值,神经网络需要非线性激活函数来构建复杂的非线性系统。今天的神经网络普遍使用relu激活函数。aiming初始化的发明人kaiming he,在其论文中提出了针对relu的kaiming初始化。
+>
+> 因为relu会抛弃掉小于0的值,对于一个均值为0的data来说,这就相当于砍掉了一半的值,这样一来,均值就会变大,前面Xavier初始化公式中E(x)=mean=0的情况就不成立了。根据新公式的推导,最终得到新的rescale系数:。
+
+ 均匀分布初始化$U(-bound,bound)$,其中:
+$$
+bound = \sqrt{\frac{6}{(1+a^2)\times fan\_in}}
+$$
+ 正态分布初始化$N(0,std^2)$,其中:
+$$
+bound = \sqrt{\frac{2}{(1+a^2)\times fan\_in}}
+$$
+ $a$为可设置的参数,$fan\_in$为输入层的规模。
+
+#### 6.3 初始化函数的实现
+
+ 首先看看`get_torch_initialization`这个函数用的是什么方式的初始化:
+
+```python
+fc1 = torch.nn.Linear(28 * 28, 256)
+W1 = fc1.weight.T.detach().clone().data
+```
+
+ 它定义了一个Linear层,直接取其Weight的值。找到Linear这个类的定义:
+
+```python
+init.kaiming_uniform_(self.weight, a=math.sqrt(5))
+```
+
+ 使用的是kaiming均匀分布,a设置为$\sqrt 5$。
+
+```python
+def kaiming_uniform_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu')
+```
+
+ 由于没有设置nonlinearity,其值为默认leaky_relu。
+
+```python
+gain = calculate_gain(nonlinearity, a)
+```
+
+ 这个传入的a即为Leaky ReLU中的`negative_slop`。
+
+```python
+gain = math.sqrt(2.0 / (1 + a ** 2))
+std = gain / math.sqrt(fan)
+bound = math.sqrt(3.0) * std
+return tensor.uniform_(-bound, bound)
+```
+
+ 代码阅读之后,可以看出,tensor中使用的是kaiming的均匀分布初始化。
+
+ 最终在代码实现均匀分布的kaiming分布,a设置为$\sqrt 5$ :
+
+```python
+def get_torch_initialization_numpy(numpy=True):
+ fan_in_1 = 28 * 28
+ fan_in_2 = 256
+ fan_in_3 = 64
+
+ bound1 = 1 / np.sqrt(fan_in_1) #bound1 = np.sqrt(6) / np.sqrt(1+np.sqrt(5)**2) /np.sqrt(fan_in_1)
+ bound2 = 1 / np.sqrt(fan_in_2)
+ bound3 = 1 / np.sqrt(fan_in_3)
+
+ W1 = np.random.uniform(-bound1, bound1, (28*28, 256))
+ W2 = np.random.uniform(-bound2, bound2, (256, 64))
+ W3 = np.random.uniform(-bound3, bound3, (64, 10))
+
+ if numpy == False:
+ W1 = torch.Tensor(W1)
+ W2 = torch.Tensor(W2)
+ W3 = torch.Tensor(W3)
+
+ return W1, W2, W3
+```
+
+ torch.mnist运行结果:
+
+
+
+```
+[0] Accuracy: 0.9503
+[1] Accuracy: 0.9639
+[2] Accuracy: 0.9711
+```
+
+### 7.提交的代码说明
+
+- `numpy_fnn.py`:算子和FNN模型正向传播和反向传播的实现。`optimize_Momentum`方法实现Momentum优化,`optimize_Adam`方法实现Adam优化。可在`numpy_mnist.py`中修改optimize的调用改变优化方法。
+- `numpy_mnist.py`:`mini_batch_numpy`方法用numpy实现了mini_batch。
+- `utils.py`:`get_torch_initialization_numpy`方法用numpy实现了均匀分布的kaiming初始化。
+
+### 8.参考文献
+
+[1] [神经网络常见优化算法(Momentum, RMSprop, Adam)的原理及公式理解, 学习率衰减](https://blog.csdn.net/weixin_42561002/article/details/88036777)
+
+[2] [深度之眼【Pytorch】-Xavier、Kaiming初始化(附keras实现)](https://blog.csdn.net/weixin_42147780/article/details/103238195)
+
diff --git a/assignment-2/submission/18307130341/img/Fig1.png b/assignment-2/submission/18307130341/img/Fig1.png
new file mode 100644
index 0000000000000000000000000000000000000000..50b42797b50f8b745d7707a86e2644d84843d228
Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig1.png differ
diff --git a/assignment-2/submission/18307130341/img/Fig2.png b/assignment-2/submission/18307130341/img/Fig2.png
new file mode 100644
index 0000000000000000000000000000000000000000..f5dd7bdc2c2712953ddd2b990232d3a7a71b655b
Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig2.png differ
diff --git a/assignment-2/submission/18307130341/img/Fig3.png b/assignment-2/submission/18307130341/img/Fig3.png
new file mode 100644
index 0000000000000000000000000000000000000000..c440ff99663169e8b636ed9e3fc8b7cbdb6008f1
Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig3.png differ
diff --git a/assignment-2/submission/18307130341/img/Fig4.png b/assignment-2/submission/18307130341/img/Fig4.png
new file mode 100644
index 0000000000000000000000000000000000000000..05196c545d1d6da186cd8e301eff8fec10110060
Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig4.png differ
diff --git a/assignment-2/submission/18307130341/img/Fig5.png b/assignment-2/submission/18307130341/img/Fig5.png
new file mode 100644
index 0000000000000000000000000000000000000000..31658f4aa8db641225bf56c4ef54fb8c079d7ae2
Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig5.png differ
diff --git a/assignment-2/submission/18307130341/img/Fig6.png b/assignment-2/submission/18307130341/img/Fig6.png
new file mode 100644
index 0000000000000000000000000000000000000000..54721b598f39cfb996bfde8986077ca64836eb76
Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig6.png differ
diff --git a/assignment-2/submission/18307130341/img/Fig7.png b/assignment-2/submission/18307130341/img/Fig7.png
new file mode 100644
index 0000000000000000000000000000000000000000..1a3f1b1c91d8767838bd464ad291da558006c941
Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig7.png differ
diff --git a/assignment-2/submission/18307130341/img/Fig8.png b/assignment-2/submission/18307130341/img/Fig8.png
new file mode 100644
index 0000000000000000000000000000000000000000..081717ced38314a1b250daf2f527bce99313bc71
Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig8.png differ
diff --git a/assignment-2/submission/18307130341/numpy_fnn.py b/assignment-2/submission/18307130341/numpy_fnn.py
new file mode 100644
index 0000000000000000000000000000000000000000..bacac809951ecb357664479fa2f8f69e956fd8b8
--- /dev/null
+++ b/assignment-2/submission/18307130341/numpy_fnn.py
@@ -0,0 +1,240 @@
+import numpy as np
+
+
+class NumpyOp:
+
+ def __init__(self):
+ self.memory = {}
+ self.epsilon = 1e-12
+
+
+class Matmul(NumpyOp):
+
+ def forward(self, x, W):
+ """
+ x: shape(N, d)
+ w: shape(d, d')
+ """
+ self.memory['x'] = x
+ self.memory['W'] = W
+ h = np.matmul(x, W)
+ return h
+
+ def backward(self, grad_y):
+ """
+ grad_y: shape(N, d')
+ """
+
+ ####################
+ # code 1 #
+ ####################
+ xT = np.transpose(self.memory['x'])
+ WT = np.transpose(self.memory['W'])
+
+ grad_x = np.matmul(grad_y, WT)
+ grad_W = np.matmul(xT, grad_y)
+
+ return grad_x, grad_W
+
+
+class Relu(NumpyOp):
+
+ def forward(self, x):
+ self.memory['x'] = x
+ return np.where(x > 0, x, np.zeros_like(x))
+
+ def backward(self, grad_y):
+ """
+ grad_y: same shape as x
+ """
+
+ ####################
+ # code 2 #
+ ####################
+ x = self.memory['x']
+ grad_x = grad_y * np.where(x > 0, 1, 0)
+
+ return grad_x
+
+
+class Log(NumpyOp):
+
+ def forward(self, x):
+ """
+ x: shape(N, c)
+ """
+
+ out = np.log(x + self.epsilon)
+ self.memory['x'] = x
+
+ return out
+
+ def backward(self, grad_y):
+ """
+ grad_y: same shape as x
+ """
+
+ ####################
+ # code 3 #
+ ####################
+ x = self.memory['x']
+ grad_x = grad_y / (x + self.epsilon)
+
+ return grad_x
+
+
+class Softmax(NumpyOp):
+ """
+ softmax over last dimension
+ """
+
+ def forward(self, x):
+ """
+ x: shape(N, c)
+ """
+
+ ####################
+ # code 4 #
+ ####################
+ sum = np.exp(x).sum(axis = 1)
+ sum = sum.reshape(x.shape[0], 1)
+ out = np.exp(x) / sum
+
+ self.memory['y'] = out
+
+ return out
+
+ def backward(self, grad_y):
+ """
+ grad_y: same shape as x
+ """
+
+ ####################
+ # code 5 #
+ ####################
+ y = self.memory['y']
+
+ grad_x = y * (grad_y - (y * grad_y).sum(axis = 1).reshape(len(y),1))
+
+ return grad_x
+
+
+class NumpyLoss:
+
+ def __init__(self):
+ self.target = None
+
+ def get_loss(self, pred, target):
+ self.target = target
+ return (-pred * target).sum(axis=1).mean()
+
+ def backward(self):
+ return -self.target / self.target.shape[0]
+
+
+class NumpyModel:
+ def __init__(self):
+ self.W1 = np.random.normal(size=(28 * 28, 256))
+ self.W2 = np.random.normal(size=(256, 64))
+ self.W3 = np.random.normal(size=(64, 10))
+
+ # 以下算子会在 forward 和 backward 中使用
+ self.matmul_1 = Matmul()
+ self.relu_1 = Relu()
+ self.matmul_2 = Matmul()
+ self.relu_2 = Relu()
+ self.matmul_3 = Matmul()
+ self.softmax = Softmax()
+ self.log = Log()
+
+ # 以下变量需要在 backward 中更新。 softmax_grad, log_grad 等为算子反向传播的梯度( loss 关于算子输入的偏导)
+ self.x1_grad, self.W1_grad = None, None
+ self.relu_1_grad = None
+ self.x2_grad, self.W2_grad = None, None
+ self.relu_2_grad = None
+ self.x3_grad, self.W3_grad = None, None
+ self.softmax_grad = None
+ self.log_grad = None
+
+ # Momentum优化
+ self.v_W1_grad = 0
+ self.v_W2_grad = 0
+ self.v_W3_grad = 0
+
+ # Adam优化
+ self.s_W1_grad = 0
+ self.s_W2_grad = 0
+ self.s_W3_grad = 0
+
+ def forward(self, x):
+ x = x.reshape(-1, 28 * 28)
+
+ ####################
+ # code 6 #
+ ####################
+ x = self.matmul_1.forward(x, self.W1)
+ x = self.relu_1.forward(x)
+
+ x = self.matmul_2.forward(x, self.W2)
+ x = self.relu_2.forward(x)
+
+ x = self.matmul_3.forward(x, self.W3)
+ x = self.softmax.forward(x)
+
+ x = self.log.forward(x)
+
+ return x
+
+ def backward(self, y):
+
+ ####################
+ # code 7 #
+ ###################
+
+ self.log_grad = self.log.backward(y)
+
+ self.softmax_grad = self.softmax.backward(self.log_grad)
+ self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad)
+
+ self.relu_2_grad = self.relu_2.backward(self.x3_grad)
+ self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad)
+
+ self.relu_1_grad = self.relu_1.backward(self.x2_grad)
+ self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad)
+
+ def optimize(self, learning_rate):
+ self.W1 -= learning_rate * self.W1_grad
+ self.W2 -= learning_rate * self.W2_grad
+ self.W3 -= learning_rate * self.W3_grad
+
+ def optimize_Momentum(self, learning_rate, belta):
+ self.v_W1_grad = belta * self.v_W1_grad + (1 - belta) * self.W1_grad
+ self.v_W2_grad = belta * self.v_W2_grad + (1 - belta) * self.W2_grad
+ self.v_W3_grad = belta * self.v_W3_grad + (1 - belta) * self.W3_grad
+
+ self.W1 -= learning_rate * self.v_W1_grad
+ self.W2 -= learning_rate * self.v_W2_grad
+ self.W3 -= learning_rate * self.v_W3_grad
+
+ def optimize_Adam(self, learning_rate, beta1, beta2, beta1_t, beta2_t, eps):
+
+ self.v_W1_grad = beta1 * self.v_W1_grad + (1 - beta1) * self.W1_grad
+ self.v_W2_grad = beta1 * self.v_W2_grad + (1 - beta1) * self.W2_grad
+ self.v_W3_grad = beta1 * self.v_W3_grad + (1 - beta1) * self.W3_grad
+
+ v_W1_corr = self.v_W1_grad / (1 - beta1_t)
+ v_W2_corr = self.v_W2_grad / (1 - beta1_t)
+ v_W3_corr = self.v_W3_grad / (1 - beta1_t)
+
+ self.s_W1_grad = beta2 * self.s_W1_grad + (1 - beta2) * (self.W1_grad ** 2)
+ self.s_W2_grad = beta2 * self.s_W2_grad + (1 - beta2) * (self.W2_grad ** 2)
+ self.s_W3_grad = beta2 * self.s_W3_grad + (1 - beta2) * (self.W3_grad ** 2)
+
+ s_W1_corr = self.s_W1_grad / (1 - beta2_t)
+ s_W2_corr = self.s_W2_grad / (1 - beta2_t)
+ s_W3_corr = self.s_W3_grad / (1 - beta2_t)
+
+ self.W1 -= learning_rate * v_W1_corr / (np.sqrt(s_W1_corr) + eps)
+ self.W2 -= learning_rate * v_W2_corr / (np.sqrt(s_W2_corr) + eps)
+ self.W3 -= learning_rate * v_W3_corr / (np.sqrt(s_W3_corr) + eps)
+
diff --git a/assignment-2/submission/18307130341/numpy_mnist.py b/assignment-2/submission/18307130341/numpy_mnist.py
new file mode 100644
index 0000000000000000000000000000000000000000..87aa55fdfd8a765cbad3203b3a9082c28c5c6502
--- /dev/null
+++ b/assignment-2/submission/18307130341/numpy_mnist.py
@@ -0,0 +1,91 @@
+import numpy as np
+from numpy_fnn import NumpyModel, NumpyLoss
+from utils import download_mnist, mini_batch, batch, get_torch_initialization, plot_curve, one_hot
+
+def mini_batch_numpy(dataset, batch_size=128):
+ data = []
+ label = []
+
+ for x in dataset:
+ data.append(np.array(x[0]))
+ label.append(x[1])
+
+ data = np.array(data)
+ label = np.array(label)
+
+ idx = np.random.permutation(len(dataset))
+ data = data[idx]
+ label = label[idx]
+
+ split_num = len(dataset) // batch_size
+ split_pos = split_num * batch_size
+
+ ret_data = np.split(data[:split_pos], split_num)
+ ret_data.append(data[split_pos+1:])
+
+ ret_label = np.split(label[:split_pos], split_num)
+ ret_label.append(label[split_pos+1:])
+
+ ret = list(zip(ret_data, ret_label))
+ return ret
+
+def numpy_run():
+
+ import time
+ start = time.time()
+
+ train_dataset, test_dataset = download_mnist()
+
+ model = NumpyModel()
+ numpy_loss = NumpyLoss()
+ model.W1, model.W2, model.W3 = get_torch_initialization()
+
+ train_loss = []
+
+ epoch_number = 3
+ learning_rate = 0.1
+
+ #Adam 优化
+ beta1 = 0.9
+ beta2 = 0.999
+ beta1_t = 1
+ beta2_t = 1
+
+ for epoch in range(epoch_number):
+ #Adam 优化
+ beta1_t *= beta1
+ beta2_t *= beta2
+
+ # for x, y in mini_batch_numpy(train_dataset): # mini_batch_numpy
+ for x, y in mini_batch(train_dataset):
+ y = one_hot(y)
+
+ # y_pred = model.forward(x) # mini_batch_numpy
+ y_pred = model.forward(x.numpy())
+ loss = numpy_loss.get_loss(y_pred, y)
+
+ model.backward(numpy_loss.backward())
+
+ #原始optimize
+ model.optimize(learning_rate)
+
+ #Momentum 优化
+ # model.optimize_Momentum(learning_rate, 0.9)
+
+ #Adam 优化
+ # model.optimize_Adam(learning_rate, beta1, beta2, beta1_t, beta2_t, 1e-8)
+
+ train_loss.append(loss.item())
+
+ x, y = batch(test_dataset)[0]
+ accuracy = np.mean((model.forward(x).argmax(axis=1) == y))
+ print('[{}] Accuracy: {:.4f}'.format(epoch, accuracy))
+
+ end = time.time()
+ print("time = %.2f s"%(end-start))
+
+ plot_curve(train_loss)
+
+
+if __name__ == "__main__":
+ numpy_run()
diff --git a/assignment-2/submission/18307130341/tester_demo.py b/assignment-2/submission/18307130341/tester_demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..504b3eef50a6df4d0aa433113136add50835e420
--- /dev/null
+++ b/assignment-2/submission/18307130341/tester_demo.py
@@ -0,0 +1,182 @@
+import numpy as np
+import torch
+from torch import matmul as torch_matmul, relu as torch_relu, softmax as torch_softmax, log as torch_log
+
+from numpy_fnn import Matmul, Relu, Softmax, Log, NumpyModel, NumpyLoss
+from torch_mnist import TorchModel
+from utils import get_torch_initialization, one_hot
+
+err_epsilon = 1e-6
+err_p = 0.4
+
+
+def check_result(numpy_result, torch_result=None):
+ if isinstance(numpy_result, list) and torch_result is None:
+ flag = True
+ for (n, t) in numpy_result:
+ flag = flag and check_result(n, t)
+ return flag
+ # print((torch.from_numpy(numpy_result) - torch_result).abs().mean().item())
+ T = (torch_result * torch.from_numpy(numpy_result) < 0).sum().item()
+ direction = T / torch_result.numel() < err_p
+ return direction and ((torch.from_numpy(numpy_result) - torch_result).abs().mean() < err_epsilon).item()
+
+
+def case_1():
+ x = np.random.normal(size=[5, 6])
+ W = np.random.normal(size=[6, 4])
+
+ numpy_matmul = Matmul()
+ numpy_out = numpy_matmul.forward(x, W)
+ numpy_x_grad, numpy_W_grad = numpy_matmul.backward(np.ones_like(numpy_out))
+
+ torch_x = torch.from_numpy(x).clone().requires_grad_()
+ torch_W = torch.from_numpy(W).clone().requires_grad_()
+
+ torch_out = torch_matmul(torch_x, torch_W)
+ torch_out.sum().backward()
+
+ return check_result([
+ (numpy_out, torch_out),
+ (numpy_x_grad, torch_x.grad),
+ (numpy_W_grad, torch_W.grad)
+ ])
+
+
+def case_2():
+ x = np.random.normal(size=[5, 6])
+
+ numpy_relu = Relu()
+ numpy_out = numpy_relu.forward(x)
+ numpy_x_grad = numpy_relu.backward(np.ones_like(numpy_out))
+
+ torch_x = torch.from_numpy(x).clone().requires_grad_()
+
+ torch_out = torch_relu(torch_x)
+ torch_out.sum().backward()
+
+ return check_result([
+ (numpy_out, torch_out),
+ (numpy_x_grad, torch_x.grad),
+ ])
+
+
+def case_3():
+ x = np.random.uniform(low=0.0, high=1.0, size=[3, 4])
+
+ numpy_log = Log()
+ numpy_out = numpy_log.forward(x)
+ numpy_x_grad = numpy_log.backward(np.ones_like(numpy_out))
+
+ torch_x = torch.from_numpy(x).clone().requires_grad_()
+
+ torch_out = torch_log(torch_x)
+ torch_out.sum().backward()
+
+ return check_result([
+ (numpy_out, torch_out),
+
+ (numpy_x_grad, torch_x.grad),
+ ])
+
+
+def case_4():
+ x = np.random.normal(size=[4, 5])
+
+ numpy_softmax = Softmax()
+ numpy_out = numpy_softmax.forward(x)
+
+ torch_x = torch.from_numpy(x).clone().requires_grad_()
+
+ torch_out = torch_softmax(torch_x, 1)
+
+ return check_result(numpy_out, torch_out)
+
+
+def case_5():
+ x = np.random.normal(size=[20, 25])
+
+ numpy_softmax = Softmax()
+ numpy_out = numpy_softmax.forward(x)
+ numpy_x_grad = numpy_softmax.backward(np.ones_like(numpy_out))
+
+ torch_x = torch.from_numpy(x).clone().requires_grad_()
+
+ torch_out = torch_softmax(torch_x, 1)
+ torch_out.sum().backward()
+
+ return check_result([
+ (numpy_out, torch_out),
+ (numpy_x_grad, torch_x.grad),
+ ])
+
+
+def test_model():
+ try:
+ numpy_loss = NumpyLoss()
+ numpy_model = NumpyModel()
+ torch_model = TorchModel()
+ torch_model.W1.data, torch_model.W2.data, torch_model.W3.data = get_torch_initialization(numpy=False)
+ numpy_model.W1 = torch_model.W1.detach().clone().numpy()
+ numpy_model.W2 = torch_model.W2.detach().clone().numpy()
+ numpy_model.W3 = torch_model.W3.detach().clone().numpy()
+
+ x = torch.randn((10000, 28, 28))
+ y = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 0] * 1000)
+
+ y = one_hot(y, numpy=False)
+ x2 = x.numpy()
+ y_pred = torch_model.forward(x)
+ loss = (-y_pred * y).sum(dim=1).mean()
+ loss.backward()
+
+ y_pred_numpy = numpy_model.forward(x2)
+ numpy_loss.get_loss(y_pred_numpy, y.numpy())
+
+ check_flag_1 = check_result(y_pred_numpy, y_pred)
+ print("+ {:12} {}/{}".format("forward", 10 * check_flag_1, 10))
+ except:
+ print("[Runtime Error in forward]")
+ print("+ {:12} {}/{}".format("forward", 0, 10))
+ return 0
+
+ try:
+
+ numpy_model.backward(numpy_loss.backward())
+
+ check_flag_2 = [
+ check_result(numpy_model.log_grad, torch_model.log_input.grad),
+ check_result(numpy_model.softmax_grad, torch_model.softmax_input.grad),
+ check_result(numpy_model.W3_grad, torch_model.W3.grad),
+ check_result(numpy_model.W2_grad, torch_model.W2.grad),
+ check_result(numpy_model.W1_grad, torch_model.W1.grad)
+ ]
+ check_flag_2 = sum(check_flag_2) >= 4
+ print("+ {:12} {}/{}".format("backward", 20 * check_flag_2, 20))
+ except:
+ print("[Runtime Error in backward]")
+ print("+ {:12} {}/{}".format("backward", 0, 20))
+ check_flag_2 = False
+
+ return 10 * check_flag_1 + 20 * check_flag_2
+
+
+if __name__ == "__main__":
+ testcases = [
+ ["matmul", case_1, 5],
+ ["relu", case_2, 5],
+ ["log", case_3, 5],
+ ["softmax_1", case_4, 5],
+ ["softmax_2", case_5, 10],
+ ]
+ score = 0
+ for case in testcases:
+ try:
+ res = case[2] if case[1]() else 0
+ except:
+ print("[Runtime Error in {}]".format(case[0]))
+ res = 0
+ score += res
+ print("+ {:12} {}/{}".format(case[0], res, case[2]))
+ score += test_model()
+ print("{:14} {}/60".format("FINAL SCORE", score))
diff --git a/assignment-2/submission/18307130341/torch_mnist.py b/assignment-2/submission/18307130341/torch_mnist.py
new file mode 100644
index 0000000000000000000000000000000000000000..7bbcedcf8108c227e09c861a761c18e99a7f9429
--- /dev/null
+++ b/assignment-2/submission/18307130341/torch_mnist.py
@@ -0,0 +1,75 @@
+import torch
+from utils import mini_batch, batch, download_mnist, get_torch_initialization, one_hot, plot_curve
+
+
+class TorchModel:
+
+ def __init__(self):
+ self.W1 = torch.randn((28 * 28, 256), requires_grad=True)
+ self.W2 = torch.randn((256, 64), requires_grad=True)
+ self.W3 = torch.randn((64, 10), requires_grad=True)
+ self.softmax_input = None
+ self.log_input = None
+
+ def forward(self, x):
+ x = x.reshape(-1, 28 * 28)
+ x = torch.relu(torch.matmul(x, self.W1))
+ x = torch.relu(torch.matmul(x, self.W2))
+ x = torch.matmul(x, self.W3)
+
+ self.softmax_input = x
+ self.softmax_input.retain_grad()
+
+ x = torch.softmax(x, 1)
+
+ self.log_input = x
+ self.log_input.retain_grad()
+
+ x = torch.log(x)
+
+ return x
+
+ def optimize(self, learning_rate):
+ with torch.no_grad():
+ self.W1 -= learning_rate * self.W1.grad
+ self.W2 -= learning_rate * self.W2.grad
+ self.W3 -= learning_rate * self.W3.grad
+
+ self.W1.grad = None
+ self.W2.grad = None
+ self.W3.grad = None
+
+
+def torch_run():
+ train_dataset, test_dataset = download_mnist()
+
+ model = TorchModel()
+ # model.W1.data, model.W2.data, model.W3.data = get_torch_initialization(numpy=False)
+ from utils import get_torch_initialization_numpy
+ model.W1.data, model.W2.data, model.W3.data = get_torch_initialization_numpy(numpy=False)
+
+ train_loss = []
+
+ epoch_number = 3
+ learning_rate = 0.1
+
+ for epoch in range(epoch_number):
+ for x, y in mini_batch(train_dataset, numpy=False):
+ y = one_hot(y, numpy=False)
+
+ y_pred = model.forward(x)
+ loss = (-y_pred * y).sum(dim=1).mean()
+ loss.backward()
+ model.optimize(learning_rate)
+
+ train_loss.append(loss.item())
+
+ x, y = batch(test_dataset, numpy=False)[0]
+ accuracy = model.forward(x).argmax(dim=1).eq(y).float().mean().item()
+ print('[{}] Accuracy: {:.4f}'.format(epoch, accuracy))
+
+ plot_curve(train_loss)
+
+
+if __name__ == "__main__":
+ torch_run()
diff --git a/assignment-2/submission/18307130341/utils.py b/assignment-2/submission/18307130341/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..5154f4970843623204198206ff0df1438bbee5df
--- /dev/null
+++ b/assignment-2/submission/18307130341/utils.py
@@ -0,0 +1,91 @@
+import torch
+import numpy as np
+from matplotlib import pyplot as plt
+
+
+def plot_curve(data):
+ plt.plot(range(len(data)), data, color='blue')
+ plt.legend(['loss_value'], loc='upper right')
+ plt.xlabel('step')
+ plt.ylabel('value')
+ plt.show()
+
+
+def download_mnist():
+ from torchvision import datasets, transforms
+
+ transform = transforms.Compose([
+ transforms.ToTensor(),
+ transforms.Normalize(mean=(0.1307,), std=(0.3081,))
+ ])
+
+ train_dataset = datasets.MNIST(root="./data/", transform=transform, train=True, download=True)
+ test_dataset = datasets.MNIST(root="./data/", transform=transform, train=False, download=True)
+
+ return train_dataset, test_dataset
+
+
+def one_hot(y, numpy=True):
+ if numpy:
+ y_ = np.zeros((y.shape[0], 10))
+ y_[np.arange(y.shape[0], dtype=np.int32), y] = 1
+ return y_
+ else:
+ y_ = torch.zeros((y.shape[0], 10))
+ y_[torch.arange(y.shape[0], dtype=torch.long), y] = 1
+ return y_
+
+
+def batch(dataset, numpy=True):
+ data = []
+ label = []
+ for each in dataset:
+ data.append(each[0])
+ label.append(each[1])
+ data = torch.stack(data)
+ label = torch.LongTensor(label)
+ if numpy:
+ return [(data.numpy(), label.numpy())]
+ else:
+ return [(data, label)]
+
+
+def mini_batch(dataset, batch_size=128, numpy=False):
+ return torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)
+
+
+def get_torch_initialization(numpy=True):
+ fc1 = torch.nn.Linear(28 * 28, 256)
+ fc2 = torch.nn.Linear(256, 64)
+ fc3 = torch.nn.Linear(64, 10)
+
+ if numpy:
+ W1 = fc1.weight.T.detach().clone().numpy()
+ W2 = fc2.weight.T.detach().clone().numpy()
+ W3 = fc3.weight.T.detach().clone().numpy()
+ else:
+ W1 = fc1.weight.T.detach().clone().data
+ W2 = fc2.weight.T.detach().clone().data
+ W3 = fc3.weight.T.detach().clone().data
+
+ return W1, W2, W3
+
+def get_torch_initialization_numpy(numpy=True):
+ fan_in_1 = 28 * 28
+ fan_in_2 = 256
+ fan_in_3 = 64
+
+ bound1 = 1 / np.sqrt(fan_in_1)
+ bound2 = 1 / np.sqrt(fan_in_2)
+ bound3 = 1 / np.sqrt(fan_in_3)
+
+ W1 = np.random.uniform(-bound1, bound1, (28*28, 256))
+ W2 = np.random.uniform(-bound2, bound2, (256, 64))
+ W3 = np.random.uniform(-bound3, bound3, (64, 10))
+
+ if numpy == False:
+ W1 = torch.Tensor(W1)
+ W2 = torch.Tensor(W2)
+ W3 = torch.Tensor(W3)
+
+ return W1, W2, W3
\ No newline at end of file