diff --git a/assignment-2/submission/17307130331/README.md b/assignment-2/submission/17307130331/README.md new file mode 100644 index 0000000000000000000000000000000000000000..abd8de5834bacc838e1b813905da469a8d9168c3 --- /dev/null +++ b/assignment-2/submission/17307130331/README.md @@ -0,0 +1,343 @@ +# 实验报告 + +陈疏桐 17307130331 + +本次实验,我用numpy实现了Matmul、log、softmax和relu四个算子的前向计算与后向计算,用四个算子构建分类模型,通过了自动测试,并实现了mini_batch函数,在mnist数据集上用不同的学习率与Batch大小进行训练和测试,讨论学习率与Batch大小对模型训练效果的影响。最后,我还实现Momentum、RMSProp与Adam三种优化方法,与传统梯度下降进行比较。 + +## 算子的反向传播与实现 +### Matmul + +Matmul是矩阵的乘法,在模型中的作用相当于pytorch的一个线性层,前向传播的公式是: + +$$ \mathrm{Y} = \mathrm{X}\mathrm{W} $$ + +其中,$\mathrm{X}$是形状为 $N \times d$的输入矩阵,$\mathrm{W}$是形状为$d \times d'$的矩阵, $\mathrm{Y}$是形状为$N\times d'$的输出矩阵。Matmul算子相当于输入维度为$d$、输出$d'$维的线性全连接层。 + +Matmul分别对输入求偏导,有 + +$$ \frac{\partial \mathrm{Y}}{\partial \mathrm{X}} = \frac{\partial \mathrm{X}\mathrm{W}}{\partial \mathrm{X}} = \mathrm{W}^T$$ + +$$ \frac{\partial \mathrm{Y}}{\partial \mathrm{W}} = \frac{\partial \mathrm{X}\mathrm{W}}{\partial \mathrm{W}} = \mathrm{X}^T $$ + +则根据链式法则,反向传播的计算公式为: + +$$ \triangledown{\mathrm{X}} = \triangledown{\mathrm{Y}} \times \mathrm{W}^T $$ +$$ \triangledown{\mathrm{W}} = \mathrm{X}^T \times \triangledown{\mathrm{Y}} $$ + +### Relu + +Relu函数对输入每一个元素的公式是: + +$$ \mathrm{Y}_{ij}= +\begin{cases} +\mathrm{X}_{ij} & \mathrm{X}_{ij} \ge 0 \\\\ +0 & \text{otherwise} +\end{cases} +$$ + + +每一个输出 $\mathrm{Y}_{ij}$都只与输入$\mathrm{X}_{ij}$有关。则$\mathrm{X}$每一个元素的导数也只和对应的输出有关,为: + +$$ \frac{\partial \mathrm{Y}_{ij}}{\partial \mathrm{X}_{ij}} = +\begin{cases} +1 & \mathrm{X}_{ij} \ge 0 \\\\ +0 & \text{otherwise} +\end{cases}$$ + +因此,根据链式法则,输入的梯度为: + +$$ \triangledown{\mathrm{X}_{ij}} = \triangledown{\mathrm{Y}_{ij}} \times \frac{\partial \mathrm{Y}_{ij}}{\partial \mathrm{X}_{ij}}$$ + +### Log + +Log 函数公式: + +$$ \mathrm{Y}_{ij} = \log(\mathrm{X}_{ij} + \epsilon)$$ + +$$ \frac{\partial \mathrm{Y}_{ij}}{\partial \mathrm{X}_{ij}} = \frac{1}{(\mathrm{X}_{ij} + \epsilon)} $$ + +类似地,反向传播的计算公式为: + +$$ \triangledown{\mathrm{X}_{ij}} = \triangledown{\mathrm{Y}_{ij}} \times \frac{\partial \mathrm{Y}_{ij}}{\partial \mathrm{X}_{ij}}$$ + +### Softmax + +Softmax对输入$\mathrm{X}$的最后一个维度进行计算。前向传播的计算公式为: + +$$ \mathrm{Y}_{ij} = \frac{\exp^{\mathrm{X}_{ij}}}{\sum_{k} \exp ^ {\mathrm{X}_{ik}}}$$ + +从公式可知,Softmax的每一行输出都是独立计算的,与其它行的输入无关。而对于同一行,每一个输出都与每一个输入元素有关。以行$k$为例,可推得输出元素对输入元素求导的计算公式是: + +$$\frac{\partial Y_{ki}}{\partial X_{kj}} = \begin{cases} +\frac{\exp ^ {X_{kj}} \times (\sum_{t \ne j}{\exp ^ {X_{kt}}}) }{(\sum_{t}{\exp ^ {X_{kt}}})^2} = Y_{kj}(1-Y_{kj}) & i = j \\\\ +-\frac{\exp^{X_{ki} }\exp^{X_{kj} }}{(\sum_t\exp^{X_{kt}})^2}=-Y_{ki} \times Y_{kj} & i\ne j +\end{cases}$$ + +可得每行输出$\mathrm{Y}_{k}$与每行输入$\mathrm{X}_{k}$的Jacob矩阵$\mathrm{J}_{k}$, $\mathrm{J_{k}}_{ij} = \frac{\partial \mathrm{Y}_{ki}}{\partial \mathrm{X}_{kj}}$. + +输出的一行对于输入$\mathrm{X}_{kj}$的导数,是输出每一行所有元素对其导数相加,即$\sum_{i} {\frac{\partial \mathrm{Y}_{ki}}{\partial \mathrm{X}_{kj}}}$ 的结果。 + +因此,根据链式法则,可得到反向传播的计算公式为: +$$ \triangledown \mathrm{X}_{kj} = \sum_{i} {\frac{\partial \mathrm{Y}_{ki} \times \triangledown \mathrm{Y}_{ki}}{\partial \mathrm{X}_{kj}}}$$ + +相当于: + +$$ \triangledown \mathrm{X}_{k} = \mathrm{J}_{k} \times \triangledown \mathrm{Y}_{k} $$ + +在实现时,可以用`numpy`的`matmul`操作实现对最后两个维度的矩阵相乘,得到的矩阵堆叠起来,得到最后的结果。 + + +## 模型构建与训练 +### 模型构建 + +参照`torch_mnist.py`中的`torch_model`,`numpy`模型的构建只需要将其中的算子换成我们实现的算子: +``` +def forward(self, x): + x = x.reshape(-1, 28 * 28) + + x = self.relu_1.forward(self.matmul_1.forward(x, self.W1)) + x = self.relu_2.forward(self.matmul_2.forward(x, self.W2)) + + x = self.matmul_3.forward(x, self.W3) + + x = self.softmax.forward(x) + x = self.log.forward(x) + + return x +``` + +模型的computation graph是: +![compu_graph](img/compu_graph.png) + +根据计算图,可以应用链式法则,推导出各个叶子变量($\mathrm{W}_{1}, \mathrm{W}_{2}, \mathrm{W}_{3}, \mathrm{X}$)以及中间变量的计算方法。 + +反向传播的计算图为: +![backpropagration](img/backgraph.png) + +可根据计算图完成梯度的计算: +``` +def backward(self, y): + self.log_grad = self.log.backward(y) + self.softmax_grad = self.softmax.backward(self.log_grad) + self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad) + self.relu_2_grad = self.relu_2.backward(self.x3_grad) + self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad) + self.relu_1_grad = self.relu_1.backward(self.x2_grad) + self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad) +``` + +### MiniBatch + +在`utils`中的`mini_batch`方法,直接调用了`pytorch`的`DataLoader`。 `DataLoader`是一个负责从数据集中读取样本、组合成批次输出的方法。简单地使用`DataLoader`, 可以方便地多线程并行化预取数据,加快训练速度,且节省代码。`DataLoader`还可以自定义`Sampler`,以不同的方式从数据集中进行采样,以及`BatchSampler`以自定的方式将采集的样本组合成批,这样就可以实现在同一Batch内将数据补0、自定义Batch正负样本混合比例等操作。 + +在这里,我们模仿`DataLoader`的默认行为实现`mini_batch`方法。 +``` +def mini_batch(dataset, batch_size=128): + data = np.array([each[0].numpy() for each in dataset]) # 需要先处理数据 + label = np.array([each[1] for each in dataset]) + + data_size = data.shape[0] + idx = np.array([i for i in range(data_size)]) + np.random.shuffle(idx) # 打乱顺序 + + return [(data[idx[i: i+batch_size]], label[idx[i:i+batch_size]]) for i in range(0, data_size, batch_size)] # 这里相当于DataLoader 的BatchSampler,但一次性调用 +``` + +### 模型训练 + +构建模型,设置`epoch=10`, `learning_rate=0.1`, `batch_size=128`后,开始训练。训练时每次fit一个batch的数据,前向传播计算输出,然后根据输出计算loss,再调用`loss.backward`计算loss对输出的求导,即模型输出的梯度,之后就可以调用模型的`backward`进行后向计算。 最后调用模型的`optimize`更新参数。 + +训练过程: +![train10](img/train10.png) + +各个epoch的测试准确率为: +``` +[0] Test Accuracy: 0.9437 +[1] Test Accuracy: 0.9651 +[2] Test Accuracy: 0.9684 +[3] Test Accuracy: 0.9730 +[4] Test Accuracy: 0.9755 +[5] Test Accuracy: 0.9775 +[6] Test Accuracy: 0.9778 +[7] Test Accuracy: 0.9766 +[8] Test Accuracy: 0.9768 +[9] Test Accuracy: 0.9781 +``` + +将`learning_rate` 调整到0.2,重新训练: +![train02](img/train02.png) + +各个epoch的测试准确率为: +``` +[0] Test Accuracy: 0.9621 +[1] Test Accuracy: 0.9703 +[2] Test Accuracy: 0.9753 +[3] Test Accuracy: 0.9740 +[4] Test Accuracy: 0.9787 +[5] Test Accuracy: 0.9756 +[6] Test Accuracy: 0.9807 +[7] Test Accuracy: 0.9795 +[8] Test Accuracy: 0.9814 +[9] Test Accuracy: 0.9825 +``` + +可见,稍微提高学习率之后,训练前期参数更新的幅度更大,损失下降得更快,能够更早收敛。训练相同迭代数,现在的模型测试准确率更高。 + +将`learning_rate` 提高到0.3,重新训练: +![train03](img/train03.png) + +``` +[0] Test Accuracy: 0.9554 +[1] Test Accuracy: 0.9715 +[2] Test Accuracy: 0.9744 +[3] Test Accuracy: 0.9756 +[4] Test Accuracy: 0.9782 +[5] Test Accuracy: 0.9795 +[6] Test Accuracy: 0.9801 +[7] Test Accuracy: 0.9816 +[8] Test Accuracy: 0.9828 +[9] Test Accuracy: 0.9778 +``` + +增大学习率到0.3之后,训练前期损失下降速度与上一次训练差不多,但是到了训练后期,过大的学习率导致权重在局部最小值的附近以过大的幅度移动,难以进入最低点,模型loss表现为振荡,难以收敛。本次训练的测试准确率先提高到0.9828,后反而下降。 + +因此,可认为对于大小为128的batch,0.2是较为合适的学习率。 + +之后,维持学习率为0.2, 修改batch_size 为256, 重新训练: +![train256](img/train256.png) +``` +[0] Test Accuracy: 0.9453 +[1] Test Accuracy: 0.9621 +[2] Test Accuracy: 0.9657 +[3] Test Accuracy: 0.9629 +[4] Test Accuracy: 0.9733 +[5] Test Accuracy: 0.9766 +[6] Test Accuracy: 0.9721 +[7] Test Accuracy: 0.9768 +[8] Test Accuracy: 0.9724 +[9] Test Accuracy: 0.9775 +``` + +batch_size增大后,每个batch更新一次参数,参数更新的频率更低,从而收敛速度有所降低;但是对比本次实验与前几次实验loss的曲线图,可发现振荡幅度更小。 + +将batch_size减小到64, 重新实验: +![train64](img/train64.png) +``` +[0] Test Accuracy: 0.9526 +[1] Test Accuracy: 0.9674 +[2] Test Accuracy: 0.9719 +[3] Test Accuracy: 0.9759 +[4] Test Accuracy: 0.9750 +[5] Test Accuracy: 0.9748 +[6] Test Accuracy: 0.9772 +[7] Test Accuracy: 0.9791 +[8] Test Accuracy: 0.9820 +[9] Test Accuracy: 0.9823 +``` + +loss的下降速度增加,但是振荡幅度变大了。 + +总结:在一定范围之内,随着学习率的增大,模型收敛速度增加;随着batch_size的减小,模型收敛速度也会有一定增加,但是振荡幅度增大。 学习率过大会导致后期loss振荡、难以收敛;学习率过小则会导致loss下降速度过慢,甚至可能陷入局部最小值而错过更好的最低点。 + +## 其他优化方式实现 + +### momentum + +普通梯度下降每次更新参数仅仅取决于当前batch的梯度,这可能会让梯度方向受到某些特殊的输入影响。Momentum引入了动量,让当前更新不仅取决于当前的梯度,还考虑到先前的梯度,能够在一定程度上保持一段时间的趋势。momentum的计算方式为: + +$$ +\begin{align} +& v = \alpha v - \gamma \frac{\partial L}{\partial W} \\\\ +& W = W + v +\end{align} +$$ + +我们在`numpy_fnn.py`的模型中实现了Momentum的优化方法。 设置学习率为0.02,batch_size为128, 继续实验: +![momentum](img/momentum.png) +``` +[0] Test Accuracy: 0.9586 +[1] Test Accuracy: 0.9717 +[2] Test Accuracy: 0.9743 +[3] Test Accuracy: 0.9769 +[4] Test Accuracy: 0.9778 +[5] Test Accuracy: 0.9786 +[6] Test Accuracy: 0.9782 +[7] Test Accuracy: 0.9809 +[8] Test Accuracy: 0.9790 +[9] Test Accuracy: 0.9818 +``` + +momentum 相比传统梯度下降,不一定最后会得到更好的效果。当加入动量,当前梯度方向与动量方向相同时,参数就会得到更大幅度的调整,因此loss下降速度更快,并且前期动量基本上会积累起来,如果使用过大的学习率,很容易会溢出。所以momentum适合的学习率比普通梯度下降要小一个数量级。 而当梯度方向错误的时候,加入动量会使得参数来不及更新,从而错过最小值。 + +### RMSProp + + +RMSProp引入了自适应的学习率调节。 在训练前期,学习率应该较高,使得loss能快速下降;但随着训练迭代增加,学习率应该不断减小,使得模型能够更好地收敛。 自适应调整学习率的基本思路是根据梯度来调节,梯度越大,学习率就衰减得越快;后期梯度减小,学习率衰减就更加缓慢。 + +而为了避免前期学习率衰减得过快,RMSProp还用了指数平均的方法,来缓慢丢弃原来的梯度历史。计算方法为: + +$$ +\begin{align} +& h = \rho h + (1-\rho) \frac{\partial L}{\partial W} \odot \frac{\partial L}{\partial W} \\\\ +& W = W - \gamma \frac{1}{\sqrt{\delta + h}} \frac{\partial L}{\partial W} +\end{align}$$ + +设置梯度为0.001, weight_decay 为0.01, 进行训练和测试: +![rmsprop](img/rmsprop.png) + +``` +[0] Test Accuracy: 0.9663 +[1] Test Accuracy: 0.9701 +[2] Test Accuracy: 0.9758 +[3] Test Accuracy: 0.9701 +[4] Test Accuracy: 0.9748 +[5] Test Accuracy: 0.9813 +[6] Test Accuracy: 0.9813 +[7] Test Accuracy: 0.9819 +[8] Test Accuracy: 0.9822 +[9] Test Accuracy: 0.9808 +``` + +可见,在训练的中间部分,loss振荡幅度比普通梯度下降更小。训练前期,模型的收敛速度更快,但到后期比起普通梯度下降并无明显优势。 + +### Adam + +Adam 同时结合了动量与自适应的学习率调节。Adam首先要计算梯度的一阶和二阶矩估计,分别代表了动量与自适应的部分: + +$$ +\begin{align} +& \mathrm{m} = \beta_1 \mathrm{m} + (1-\beta_1) \frac{\partial L}{\partial W} \\\\ +& \mathrm{v} = \beta_2 \mathrm{v} + (1-\beta_2) \frac{\partial L}{\partial W} \odot \frac{\partial L}{\partial W} +\end{align} +$$ + +然后进行修正: + +$$ +\begin{align} +& \mathrm{\hat{m}} = \frac{\mathrm{m}}{1-\beta_1 ^ t }\\\\ +& \mathrm{\hat{v}} = \frac{\mathrm{v}}{1-\beta_2 ^ t} +\end{align} +$$ + +最后,参数的更新为: +$$ W = W - \gamma \frac{\mathrm{\hat m}}{\sqrt{\mathrm{\hat v}+ \delta}}$$ + + +设置学习率为0.001, batch_size为128, 开始训练: +![adam](img/train_adam.png) +``` +[0] Test Accuracy: 0.9611 +[1] Test Accuracy: 0.9701 +[2] Test Accuracy: 0.9735 +[3] Test Accuracy: 0.9752 +[4] Test Accuracy: 0.9787 +[5] Test Accuracy: 0.9788 +[6] Test Accuracy: 0.9763 +[7] Test Accuracy: 0.9790 +[8] Test Accuracy: 0.9752 +[9] Test Accuracy: 0.9806 + +``` + +相比传统梯度下降,loss振荡略微有所减小,前期loss下降速度略微更快,但是最后收敛的速度相当。 \ No newline at end of file diff --git a/assignment-2/submission/17307130331/img/backgraph.png b/assignment-2/submission/17307130331/img/backgraph.png new file mode 100644 index 0000000000000000000000000000000000000000..c4a70b28e869708641bd01dba83730ed62ab9c4d Binary files /dev/null and b/assignment-2/submission/17307130331/img/backgraph.png differ diff --git a/assignment-2/submission/17307130331/img/compu_graph.png b/assignment-2/submission/17307130331/img/compu_graph.png new file mode 100644 index 0000000000000000000000000000000000000000..74f02ff1b4c4795c99600fb2e358d23a170f11c1 Binary files /dev/null and b/assignment-2/submission/17307130331/img/compu_graph.png differ diff --git a/assignment-2/submission/17307130331/img/momentum.png b/assignment-2/submission/17307130331/img/momentum.png new file mode 100644 index 0000000000000000000000000000000000000000..152bfe4eda8bf98cb271e9e3af3801f223273ec2 Binary files /dev/null and b/assignment-2/submission/17307130331/img/momentum.png differ diff --git a/assignment-2/submission/17307130331/img/rmsprop.png b/assignment-2/submission/17307130331/img/rmsprop.png new file mode 100644 index 0000000000000000000000000000000000000000..d4c9f6d651ea0dcac312c3a7dcb38266a477679c Binary files /dev/null and b/assignment-2/submission/17307130331/img/rmsprop.png differ diff --git a/assignment-2/submission/17307130331/img/train.png b/assignment-2/submission/17307130331/img/train.png new file mode 100644 index 0000000000000000000000000000000000000000..618816332b78c4f0498444a42dd2a5028df91ef1 Binary files /dev/null and b/assignment-2/submission/17307130331/img/train.png differ diff --git a/assignment-2/submission/17307130331/img/train02.png b/assignment-2/submission/17307130331/img/train02.png new file mode 100644 index 0000000000000000000000000000000000000000..a2cbc7b9ccbf2f28955902b86881d7a640f50fa7 Binary files /dev/null and b/assignment-2/submission/17307130331/img/train02.png differ diff --git a/assignment-2/submission/17307130331/img/train03.png b/assignment-2/submission/17307130331/img/train03.png new file mode 100644 index 0000000000000000000000000000000000000000..41dd8fd9060e6774b983375f3b025ee6335b9f66 Binary files /dev/null and b/assignment-2/submission/17307130331/img/train03.png differ diff --git a/assignment-2/submission/17307130331/img/train10.png b/assignment-2/submission/17307130331/img/train10.png new file mode 100644 index 0000000000000000000000000000000000000000..a2056ba0d21f8f40fc0279e532fd6b9f1ff79cef Binary files /dev/null and b/assignment-2/submission/17307130331/img/train10.png differ diff --git a/assignment-2/submission/17307130331/img/train256.png b/assignment-2/submission/17307130331/img/train256.png new file mode 100644 index 0000000000000000000000000000000000000000..81aa1b2bcc7f708607f8c402f9f41d579793f9e1 Binary files /dev/null and b/assignment-2/submission/17307130331/img/train256.png differ diff --git a/assignment-2/submission/17307130331/img/train64.png b/assignment-2/submission/17307130331/img/train64.png new file mode 100644 index 0000000000000000000000000000000000000000..8f34749c6fda428437ff3fe11292b0213eca0d7a Binary files /dev/null and b/assignment-2/submission/17307130331/img/train64.png differ diff --git a/assignment-2/submission/17307130331/img/train_adam.png b/assignment-2/submission/17307130331/img/train_adam.png new file mode 100644 index 0000000000000000000000000000000000000000..eefa8b27deb6485f895033add750f018fd14e293 Binary files /dev/null and b/assignment-2/submission/17307130331/img/train_adam.png differ diff --git a/assignment-2/submission/17307130331/img/trainloss.png b/assignment-2/submission/17307130331/img/trainloss.png new file mode 100644 index 0000000000000000000000000000000000000000..b845297f03d5d6e6ae2b026b25554519a77f471b Binary files /dev/null and b/assignment-2/submission/17307130331/img/trainloss.png differ diff --git a/assignment-2/submission/17307130331/numpy_fnn.py b/assignment-2/submission/17307130331/numpy_fnn.py new file mode 100644 index 0000000000000000000000000000000000000000..7b32d95b7825b4787f5d226ac058c0039aee4bba --- /dev/null +++ b/assignment-2/submission/17307130331/numpy_fnn.py @@ -0,0 +1,208 @@ +import numpy as np + + +class NumpyOp: + + def __init__(self): + self.memory = {} + self.epsilon = 1e-12 + + +class Matmul(NumpyOp): + + def forward(self, x, W): + """ + x: shape(N, d) + w: shape(d, d') + """ + self.memory['x'] = x + self.memory['W'] = W + h = np.matmul(x, W) + return h + + def backward(self, grad_y): + """ + grad_y: shape(N, d') + """ + + #################### + # code 1 # + grad_W = np.matmul(self.memory['x'].T, grad_y) + grad_x = np.matmul(grad_y, self.memory['W'].T) + #################### + + return grad_x, grad_W + + +class Relu(NumpyOp): + + def forward(self, x): + self.memory['x'] = x + return np.where(x > 0, x, np.zeros_like(x)) + + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + + #################### + # code 2 # + #################### + grad_x = np.where(self.memory['x'] > 0, np.ones_like(self.memory['x']), np.zeros_like(self.memory['x'])) * grad_y # 元素乘积 + + return grad_x + + +class Log(NumpyOp): + + def forward(self, x): + """ + x: shape(N, c) + """ + + out = np.log(x + self.epsilon) + self.memory['x'] = x + + return out + + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + + #################### + # code 3 # + #################### + grad_x = (1/(self.memory['x'] + self.epsilon)) * grad_y + return grad_x + + +class Softmax(NumpyOp): + """ + softmax over last dimension + """ + + def forward(self, x): + """ + x: shape(N, c) + """ + + #################### + # code 4 # + #################### + exp_x = np.exp(x) + out = exp_x/np.sum(exp_x, axis=1, keepdims=True) + self.memory['x'] = x + self.memory['out'] = out + return out + + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + o = self.memory['out'] + Jacob = np.array([np.diag(r) - np.outer(r, r) for r in o]) + # i!=j - oi* oj + # i==j oi*(1-oi) + grad_y = grad_y[:, np.newaxis, :] + grad_x = np.matmul(grad_y, Jacob).squeeze(1) + #print(grad_x.shape) + #print(grad_x) + return grad_x + + +class NumpyLoss: + + def __init__(self): + self.target = None + + def get_loss(self, pred, target): + self.target = target + return (-pred * target).sum(axis=1).mean() + + def backward(self): + return -self.target / self.target.shape[0] + + +class NumpyModel: + def __init__(self): + self.W1 = np.random.normal(size=(28 * 28, 256)) + self.W2 = np.random.normal(size=(256, 64)) + self.W3 = np.random.normal(size=(64, 10)) + + # 以下算子会在 forward 和 backward 中使用 + self.matmul_1 = Matmul() + self.relu_1 = Relu() + self.matmul_2 = Matmul() + self.relu_2 = Relu() + self.matmul_3 = Matmul() + self.softmax = Softmax() + self.log = Log() + + # 以下变量需要在 backward 中更新。 softmax_grad, log_grad 等为算子反向传播的梯度( loss 关于算子输入的偏导) + self.x1_grad, self.W1_grad = None, None + self.relu_1_grad = None + self.x2_grad, self.W2_grad = None, None + self.relu_2_grad = None + self.x3_grad, self.W3_grad = None, None + self.softmax_grad = None + self.log_grad = None + + # 以下变量是在 momentum\rmsprop中使用的 + self.v1 = np.zeros_like(self.W1) + self.v2 = np.zeros_like(self.W2) + self.v3 = np.zeros_like(self.W3) + + + def forward(self, x): + x = x.reshape(-1, 28 * 28) + + x = self.relu_1.forward(self.matmul_1.forward(x, self.W1)) + x = self.relu_2.forward(self.matmul_2.forward(x, self.W2)) + + x = self.matmul_3.forward(x, self.W3) + + x = self.softmax.forward(x) + x = self.log.forward(x) + + return x + + def backward(self, y): + self.log_grad = self.log.backward(y) + self.softmax_grad = self.softmax.backward(self.log_grad) + self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad) + self.relu_2_grad = self.relu_2.backward(self.x3_grad) + self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad) + self.relu_1_grad = self.relu_1.backward(self.x2_grad) + self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad) + + + def optimize(self, learning_rate): + self.W1 -= learning_rate * self.W1_grad + self.W2 -= learning_rate * self.W2_grad + self.W3 -= learning_rate * self.W3_grad + + def momentum(self, learning_rate, alpha=0.9): + self.v1 = self.v1 * alpha - learning_rate * self.W1_grad + self.v2 = self.v2 * alpha - learning_rate * self.W2_grad + self.v3 = self.v3 * alpha - learning_rate * self.W3_grad + + self.W1 += self.v1 + self.W2 += self.v2 + self.W3 += self.v3 + + def RMSProp(self, learning_rate, weight_decay = 0.99): + self.v1 = self.v1 * weight_decay + (1-weight_decay) * self.W1_grad * self.W1_grad + self.v2 = self.v2 * weight_decay + (1-weight_decay) * self.W2_grad * self.W2_grad + self.v3 = self.v3 * weight_decay + (1-weight_decay) * self.W3_grad * self.W3_grad + + self.W1 = self.W1 - learning_rate * self.W1_grad / np.sqrt( self.v1 + 1e-7) + self.W2 = self.W2 - learning_rate * self.W2_grad / np.sqrt( self.v2 + 1e-7) + self.W3 = self.W3 - learning_rate * self.W3_grad / np.sqrt( self.v3 + 1e-7) + + + + + + + \ No newline at end of file diff --git a/assignment-2/submission/17307130331/numpy_mnist.py b/assignment-2/submission/17307130331/numpy_mnist.py new file mode 100644 index 0000000000000000000000000000000000000000..4187f01eeebbbcd6ab48bfacf8dedc37085e46e2 --- /dev/null +++ b/assignment-2/submission/17307130331/numpy_mnist.py @@ -0,0 +1,70 @@ +import numpy as np +from numpy_fnn import NumpyModel, NumpyLoss +from utils import download_mnist, batch, get_torch_initialization, plot_curve, one_hot + +def mini_batch(dataset, batch_size=128): + data = np.array([each[0].numpy() for each in dataset]) + label = np.array([each[1] for each in dataset]) + + data_size = data.shape[0] + idx = np.array([i for i in range(data_size)]) + np.random.shuffle(idx) + + return [(data[idx[i: i+batch_size]], label[idx[i:i+batch_size]]) for i in range(0, data_size, batch_size)] + +class Adam(): + def __init__(self, param, learning_rate=0.001, beta_1=0.9, beta_2=0.999): + self.param = param + self.iter = 0 + self.m = 0 + self.v = 0 + self.beta1 = beta_1 + self.beta2 = beta_2 + self.lr = learning_rate + def optimize(self, grad): + self.iter+=1 + self.m = self.beta1 * self.m + (1 - self.beta1) * grad + self.v = self.beta2 * self.v + (1 - self.beta2) * grad * grad + m_hat = self.m / (1 - self.beta1 ** self.iter) + v_hat = self.v / (1 - self.beta2 ** self.iter) + self.param -= self.lr * m_hat / (v_hat ** 0.5 + 1e-8) + return self.param + +def numpy_run(): + train_dataset, test_dataset = download_mnist() + + model = NumpyModel() + numpy_loss = NumpyLoss() + model.W1, model.W2, model.W3 = get_torch_initialization() + + W1_opt, W2_opt, W3_opt = Adam(model.W1), Adam(model.W2), Adam(model.W3) + + train_loss = [] + + epoch_number = 10 + learning_rate = 0.0015 + + for epoch in range(epoch_number): + for x, y in mini_batch(train_dataset, batch_size=128): + y = one_hot(y) + + y_pred = model.forward(x) + loss = numpy_loss.get_loss(y_pred, y) + + model.backward(numpy_loss.backward()) + #model.Adam(learning_rate) + W1_opt.optimize(model.W1_grad) + W2_opt.optimize(model.W2_grad) + W3_opt.optimize(model.W3_grad) + + train_loss.append(loss.item()) + + x, y = batch(test_dataset)[0] + accuracy = np.mean((model.forward(x).argmax(axis=1) == y)) + print('[{}] Test Accuracy: {:.4f}'.format(epoch, accuracy)) + + plot_curve(train_loss) + + +if __name__ == "__main__": + numpy_run() diff --git a/assignment-2/submission/17307130331/tester_demo.py b/assignment-2/submission/17307130331/tester_demo.py new file mode 100644 index 0000000000000000000000000000000000000000..515b86c1240eebad83287461548530c944f23bc8 --- /dev/null +++ b/assignment-2/submission/17307130331/tester_demo.py @@ -0,0 +1,182 @@ +import numpy as np +import torch +from torch import matmul as torch_matmul, relu as torch_relu, softmax as torch_softmax, log as torch_log + +from numpy_fnn import Matmul, Relu, Softmax, Log, NumpyModel, NumpyLoss +from torch_mnist import TorchModel +from utils import get_torch_initialization, one_hot + +err_epsilon = 1e-6 +err_p = 0.4 + + +def check_result(numpy_result, torch_result=None): + if isinstance(numpy_result, list) and torch_result is None: + flag = True + for (n, t) in numpy_result: + flag = flag and check_result(n, t) + return flag + # print((torch.from_numpy(numpy_result) - torch_result).abs().mean().item()) + T = (torch_result * torch.from_numpy(numpy_result) < 0).sum().item() + direction = T / torch_result.numel() < err_p + return direction and ((torch.from_numpy(numpy_result) - torch_result).abs().mean() < err_epsilon).item() + + +def case_1(): + x = np.random.normal(size=[5, 6]) + W = np.random.normal(size=[6, 4]) + + numpy_matmul = Matmul() + numpy_out = numpy_matmul.forward(x, W) + numpy_x_grad, numpy_W_grad = numpy_matmul.backward(np.ones_like(numpy_out)) + + torch_x = torch.from_numpy(x).clone().requires_grad_() + torch_W = torch.from_numpy(W).clone().requires_grad_() + + torch_out = torch_matmul(torch_x, torch_W) + torch_out.sum().backward() + + return check_result([ + (numpy_out, torch_out), + (numpy_x_grad, torch_x.grad), + (numpy_W_grad, torch_W.grad) + ]) + + +def case_2(): + x = np.random.normal(size=[5, 6]) + + numpy_relu = Relu() + numpy_out = numpy_relu.forward(x) + numpy_x_grad = numpy_relu.backward(np.ones_like(numpy_out)) + + torch_x = torch.from_numpy(x).clone().requires_grad_() + + torch_out = torch_relu(torch_x) + torch_out.sum().backward() + + return check_result([ + (numpy_out, torch_out), + (numpy_x_grad, torch_x.grad), + ]) + + +def case_3(): + x = np.random.uniform(low=0.0, high=1.0, size=[3, 4]) + + numpy_log = Log() + numpy_out = numpy_log.forward(x) + numpy_x_grad = numpy_log.backward(np.ones_like(numpy_out)) + + torch_x = torch.from_numpy(x).clone().requires_grad_() + + torch_out = torch_log(torch_x) + torch_out.sum().backward() + + return check_result([ + (numpy_out, torch_out), + + (numpy_x_grad, torch_x.grad), + ]) + + +def case_4(): + x = np.random.normal(size=[4, 5]) + + numpy_softmax = Softmax() + numpy_out = numpy_softmax.forward(x) + + torch_x = torch.from_numpy(x).clone().requires_grad_() + + torch_out = torch_softmax(torch_x, 1) + + return check_result(numpy_out, torch_out) + + +def case_5(): + x = np.random.normal(size=[20, 25]) + + numpy_softmax = Softmax() + numpy_out = numpy_softmax.forward(x) + numpy_x_grad = numpy_softmax.backward(np.ones_like(numpy_out)) + + torch_x = torch.from_numpy(x).clone().requires_grad_() + + torch_out = torch_softmax(torch_x, 1) + torch_out.sum().backward() + + return check_result([ + (numpy_out, torch_out), + (numpy_x_grad, torch_x.grad), + ]) + + +def test_model(): + try: + numpy_loss = NumpyLoss() + numpy_model = NumpyModel() + torch_model = TorchModel() + torch_model.W1.data, torch_model.W2.data, torch_model.W3.data = get_torch_initialization(numpy=False) + numpy_model.W1 = torch_model.W1.detach().clone().numpy() + numpy_model.W2 = torch_model.W2.detach().clone().numpy() + numpy_model.W3 = torch_model.W3.detach().clone().numpy() + + x = torch.randn((10000, 28, 28)) + y = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 0] * 1000) + + y = one_hot(y, numpy=False) + x2 = x.numpy() + y_pred = torch_model.forward(x) + loss = (-y_pred * y).sum(dim=1).mean() + loss.backward() + + y_pred_numpy = numpy_model.forward(x2) + numpy_loss.get_loss(y_pred_numpy, y.numpy()) + + check_flag_1 = check_result(y_pred_numpy, y_pred) + print("+ {:12} {}/{}".format("forward", 10 * check_flag_1, 10)) + except: + print("[Runtime Error in forward]") + print("+ {:12} {}/{}".format("forward", 0, 10)) + return 0 + + try: + + numpy_model.backward(numpy_loss.backward()) + + check_flag_2 = [ + check_result(numpy_model.log_grad, torch_model.log_input.grad), + check_result(numpy_model.softmax_grad, torch_model.softmax_input.grad), + check_result(numpy_model.W3_grad, torch_model.W3.grad), + check_result(numpy_model.W2_grad, torch_model.W2.grad), + check_result(numpy_model.W1_grad, torch_model.W1.grad) + ] + check_flag_2 = sum(check_flag_2) >= 4 + print("+ {:12} {}/{}".format("backward", 20 * check_flag_2, 20)) + except: + print("[Runtime Error in backward]") + print("+ {:12} {}/{}".format("backward", 0, 20)) + check_flag_2 = False + + return 10 * check_flag_1 + 20 * check_flag_2 + + +if __name__ == "__main__": + testcases = [ + ["matmul", case_1, 5], + ["relu", case_2, 5], + ["log", case_3, 5], + ["softmax_1", case_4, 5], + ["softmax_2", case_5, 10], + ] + score = 0 + for case in testcases: + try: + res = case[2] if case[1]() else 0 + except: + print("[Runtime Error in {}]".format(case[0])) + res = 0 + score += res + print("+ {:12} {}/{}".format(case[0], res, case[2])) + score += test_model() + print("{:14} {}/60".format("FINAL SCORE", score)) diff --git a/assignment-2/submission/17307130331/torch_mnist.py b/assignment-2/submission/17307130331/torch_mnist.py new file mode 100644 index 0000000000000000000000000000000000000000..6d3e214c7606e3d43dac4b94554f942508afffb3 --- /dev/null +++ b/assignment-2/submission/17307130331/torch_mnist.py @@ -0,0 +1,73 @@ +import torch +from utils import mini_batch, batch, download_mnist, get_torch_initialization, one_hot, plot_curve + + +class TorchModel: + + def __init__(self): + self.W1 = torch.randn((28 * 28, 256), requires_grad=True) + self.W2 = torch.randn((256, 64), requires_grad=True) + self.W3 = torch.randn((64, 10), requires_grad=True) + self.softmax_input = None + self.log_input = None + + def forward(self, x): + x = x.reshape(-1, 28 * 28) + x = torch.relu(torch.matmul(x, self.W1)) + x = torch.relu(torch.matmul(x, self.W2)) + x = torch.matmul(x, self.W3) + + self.softmax_input = x + self.softmax_input.retain_grad() + + x = torch.softmax(x, 1) + + self.log_input = x + self.log_input.retain_grad() + + x = torch.log(x) + + return x + + def optimize(self, learning_rate): + with torch.no_grad(): + self.W1 -= learning_rate * self.W1.grad + self.W2 -= learning_rate * self.W2.grad + self.W3 -= learning_rate * self.W3.grad + + self.W1.grad = None + self.W2.grad = None + self.W3.grad = None + + +def torch_run(): + train_dataset, test_dataset = download_mnist() + + model = TorchModel() + model.W1.data, model.W2.data, model.W3.data = get_torch_initialization(numpy=False) + + train_loss = [] + + epoch_number = 3 + learning_rate = 0.1 + + for epoch in range(epoch_number): + for x, y in mini_batch(train_dataset, numpy=False): + y = one_hot(y, numpy=False) + + y_pred = model.forward(x) + loss = (-y_pred * y).sum(dim=1).mean() + loss.backward() + model.optimize(learning_rate) + + train_loss.append(loss.item()) + + x, y = batch(test_dataset, numpy=False)[0] + accuracy = model.forward(x).argmax(dim=1).eq(y).float().mean().item() + print('[{}] Accuracy: {:.4f}'.format(epoch, accuracy)) + + plot_curve(train_loss) + + +if __name__ == "__main__": + torch_run() diff --git a/assignment-2/submission/17307130331/utils.py b/assignment-2/submission/17307130331/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..709220cfa7a924d914ec1c098c505f864bcd4cfc --- /dev/null +++ b/assignment-2/submission/17307130331/utils.py @@ -0,0 +1,71 @@ +import torch +import numpy as np +from matplotlib import pyplot as plt + + +def plot_curve(data): + plt.plot(range(len(data)), data, color='blue') + plt.legend(['loss_value'], loc='upper right') + plt.xlabel('step') + plt.ylabel('value') + plt.show() + + +def download_mnist(): + from torchvision import datasets, transforms + + transform = transforms.Compose([ + transforms.ToTensor(), + transforms.Normalize(mean=(0.1307,), std=(0.3081,)) + ]) + + train_dataset = datasets.MNIST(root="./data/", transform=transform, train=True, download=True) + test_dataset = datasets.MNIST(root="./data/", transform=transform, train=False, download=True) + + return train_dataset, test_dataset + + +def one_hot(y, numpy=True): + if numpy: + y_ = np.zeros((y.shape[0], 10)) + y_[np.arange(y.shape[0], dtype=np.int32), y] = 1 + return y_ + else: + y_ = torch.zeros((y.shape[0], 10)) + y_[torch.arange(y.shape[0], dtype=torch.long), y] = 1 + return y_ + + +def batch(dataset, numpy=True): + data = [] + label = [] + for each in dataset: + data.append(each[0]) + label.append(each[1]) + data = torch.stack(data) + label = torch.LongTensor(label) + if numpy: + return [(data.numpy(), label.numpy())] + else: + return [(data, label)] + + +def mini_batch(dataset, batch_size=128, numpy=False): + return torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True) + + +def get_torch_initialization(numpy=True): + fc1 = torch.nn.Linear(28 * 28, 256) + fc2 = torch.nn.Linear(256, 64) + fc3 = torch.nn.Linear(64, 10) + + if numpy: + W1 = fc1.weight.T.detach().clone().numpy() + W2 = fc2.weight.T.detach().clone().numpy() + W3 = fc3.weight.T.detach().clone().numpy() + else: + W1 = fc1.weight.T.detach().clone().data + W2 = fc2.weight.T.detach().clone().data + W3 = fc3.weight.T.detach().clone().data + + return W1, W2, W3