diff --git a/assignment-2/submission/18307130341/README.md b/assignment-2/submission/18307130341/README.md new file mode 100644 index 0000000000000000000000000000000000000000..a859dde38f467fbbb852f830eda2ded657e48952 --- /dev/null +++ b/assignment-2/submission/18307130341/README.md @@ -0,0 +1,441 @@ +# 实验报告:ASS2-选题1-FNN + +18307130341 黄韵澄 + +[toc] + +### 1.实验概述 + +​ 实现一个前馈神经网络,利用MNIST数据集,解决手写体识别的分类问题。 + +### 2. 算子前向传播和反向传播的推导 + +​ 搭建的FNN中会用到Matmul、Relu、Log、Softmax这几个算子。 + +#### 2.1 Matmul层反向传播 + +$$ +loss = f(X\times W) =f(Y) +$$ + +根据链式法则: +$$ +\frac{\partial loss}{\partial X_{p,q}} = \sum_{i,j}{\frac{\partial loss}{\partial Y_{i,j}}\frac{\partial Y_{i,j}}{\partial X_{p,q}}} +$$ +​ 根据矩阵乘法定义: +$$ +Y_{i,j} = \sum_{k}{X_{i,k}W_{k,j}} +$$ +​ 所以,但$i\neq p$时,$C_{i,j}$与$A_{p,q}$无关: +$$ +\frac{\partial Y_{i,j}}{\partial X_{p,q}} =\begin{cases}W_{q,j}\quad i=p \\\\ 0 \quad i\neq p\end{cases} +$$ +​ 代入式子: +$$ +\frac{\partial loss}{\partial X_{p,q}} = \sum_{i,j}\frac{\partial loss}{\partial Y_{i,j}}\frac{\partial Y_{i,j}}{\partial X_{p,q}}=\sum_{j}\frac{\partial loss}{\partial Y_{p,j}}\frac{\partial Y_{p,j}}{\partial X_{p,q}}=\sum_{j}\frac{\partial loss}{\partial Y_{p,j}}W_{q,j}=\sum_{j}\frac{\partial loss}{\partial Y_{p,j}}W_{j,q}^{T} +$$ +​ + +​ 所以: +$$ +\frac{\partial loss}{\partial X} = \frac{\partial loss}{\partial Y}W^{T} +$$ +​ 同理: +$$ +\frac{\partial loss}{\partial W} = X^{T}\frac{\partial loss}{\partial Y} +$$ + +#### 2.2 Relu层反向传播 + +$$ +loss = f(Y) = f(Relu(X)) +$$ + +其中: +$$ +Relu(x) = \begin{cases}0 \quad x < 0 \\\\ x \quad x\geq 0\end{cases} +$$ +则: +$$ +\frac{\partial Y_{i,j}}{\partial X_{k,l}} = \begin{cases}1 \quad i=k\quad and\quad j=l\quad and\quad X_{k,l}>0 \\\\ +0\quad else \end{cases} +$$ +由链式法则: +$$ +\frac{\partial loss}{\partial X} = \frac{\partial loss}{\partial Y}\frac{\partial Y}{\partial X} +$$ +代码: + +```python +grad_x = grad_y * np.where(x > 0, 1, 0) +``` + +#### 2.3 Log层反向传播 + +$$ +loss = f(Y) = f(ln(X)) +$$ + +其中: +$$ +\frac{\partial Y_{i,j}}{\partial X_{k,l}} = \begin{cases}\frac{1}{X_{k,j}} \quad i=k\quad and\quad j=l\quad \\\\ +0\quad else \end{cases} +$$ +由链式法则: +$$ +\frac{\partial loss}{\partial X} = \frac{\partial loss}{\partial Y}\cdot \frac{\partial Y}{\partial X}=\frac{\partial loss}{\partial Y}\cdot \frac{1}{X} +$$ +代码: + +```python +grad_x = grad_y / (x + self.epsilon) +``` + +#### 2.4 Softmax层正向传播和反向传播 + +正向传播(只与行相关): + +$$ +loss = f(Y) \\ +Y_{i,j} = \frac{e^{X_{i,j}}}{\sum_{k}e^{X_{i,k}}} +$$ +反向传播: + +(1)当$j=l$时: +$$ +\frac{\partial Y_{i,j}}{\partial X_{i,l}} = \frac{\partial Y_{i,j}}{\partial X_{i,j}}=\frac{\partial \frac{e^{X_{i,j}}}{\sum_{k}e^{X_{i,k}}}}{\partial X_{i,j}} = \frac{e^{X_{i,j}}}{\sum_{k}e^{X_{i,k}}}-(\frac{e^{X_{i,j}}}{\sum_{k}e^{X_{i,k}}})^{2} = Y_{i,j}-Y_{i,j}^2 \\ +$$ +(2)当$j \neq l$时: +$$ +\frac{\partial Y_{i,j}}{\partial X_{i,l}} = \frac{\partial \frac{e^{X_{i,j}}}{\sum_{k}e^{X_{i,k}}}}{\partial X_{i,l}} = -\frac{e^{X_{i,j}}}{\sum_{k}e^{X_{i,k}}}\cdot \frac{e^{X_{i,l}}}{\sum_{k}e^{X_{i,k}}} = -Y_{i,j}\cdot Y_{i,l} \\ +$$ +(3)当$i \neq k$时,行不相关,梯度为0: +$$ +\frac{\partial Y_{i,j}}{\partial X_{k,l}} = 0 +$$ +根据链式法则(公式中的 $\cdot $ 表示点积): +$$ +\frac{\partial loss}{\partial X_{k,l}} = \sum_{j}\frac{\partial loss}{\partial Y_{k,j}}\cdot \frac{\partial Y_{k,j}}{\partial X_{k,l}} = (\sum_{j}-\frac{\partial loss}{\partial Y_{k,j}}\cdot Y_j\cdot Y_l)+ \frac{\partial loss}{\partial Y_{k,l}}\cdot Y_l \\\\ +=Y_l\cdot( \frac{\partial loss}{\partial Y_{k,l}}-\sum_{j}\frac{\partial loss}{\partial Y_{k,j}}\cdot Y_j) +$$ +简化成上式之后,可以用numpy方法的一行写完: + +```python +grad_x = y * (grad_y - (y * grad_y).sum(axis = 1).reshape(len(y),1)) +``` + +### 3.FNN模型搭建 + +#### 3.1 FNN模型 + +​ FNN模型搭建如图所示: + + + +- ​ 输入层($N\times28^2$)。与下层连接为全连接,参数为W1($28^2\times256$)。 +- ​ 隐藏层1($N\times256$,激活函数Relu)。与下层连接为全连接,参数为W2($256\times64$)。 +- ​ 隐藏层2($N\times64$,激活函数Relu)。与下层连接为全连接,参数为W3($64\times10$)。 +- ​ 隐藏层3($N\times10$,激活函数Softmax)。直接输出到下层。 +- ​ 输出层($N\times10$,激活函数Log)。 + +​ FNN模型用公式表示: +$$ +a^{(0)} = X \\\\ +z^{(1)} = W_1\times a^{(0)} ,\quad a^{(1)} = Relu(z^{(1)}) \\\\ +z^{(2)} = W_2\times a^{(1)} ,\quad a^{(2)} = Relu(z^{(2)}) \\\\ +z^{(3)} = W_3\times a^{(2)} ,\quad a^{(3)} = Softmax(z^{(3)}) \\\\ +z^{(4)} = a^{(3)},\quad a^{(4)} = Log(z^{(4)}) \\\\ +Y = a^{(4)} +$$ + +​ 损失值的定义: +$$ +loss = -Y * \hat Y +$$ + +#### 3.2 FNN反向传播 + +​ 根据搭建的FNN模型,反方向链式求导。用上一层的求出的梯度作为下一层输入即可: + +```python +self.log_grad = self.log.backward(y) + +self.softmax_grad = self.softmax.backward(self.log_grad) +self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad) + +self.relu_2_grad = self.relu_2.backward(self.x3_grad) +self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad) + +self.relu_1_grad = self.relu_1.backward(self.x2_grad) +self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad) +``` + +### 3.3 FNN模型的测试结果 + +​ 直接运行`numpy_mnist.py`,即可自动下载MNIST数据集(手写体数字识别)进行自动测试。 + +​ 损失函数绘图: + + + +​ 模型准确率: + +``` +[0] Accuracy: 0.9350 +[1] Accuracy: 0.9674 +[2] Accuracy: 0.9693 +``` + +​ numpy中的epoch只设置了3,但模型准确率已经相对比较高了。然而,观察损失函数的波动图像,显然模型的损失函数仍在波动,并未收敛。后续可能需要增加epoch并调节学习率使得模型的损失收敛。 + +### 4. mini_batch函数的实现 + +#### 4.1 mini_batch函数 + +​ `mini_batch`函数是将`dataset`分成不同批次,批处理大小为`batch_size`。 + +​ 处理步骤: + +- for循环提取dataset中的data和label,转换为ndarray格式: + +```python +data = [] +label = [] +for x in dataset: + data.append(np.array(x[0])) + label.append(x[1]) +data = np.array(data) +label = np.array(label) +``` + +- shuffle的方法有很多,我用的是`np.random.permutation`这个方法: + +``` +idx = np.random.permutation(len(dataset)) +data = data[idx] +label = label[idx] +``` + +- 使用`np.split`方法将data和label划分。这个方法要求均等划分,为了解决dataset规模不能整除batch_size问题,代码单独处理最后一个区块: + +``` +split_num = len(dataset) // batch_size #均等划分的区块数量 +split_pos = split_num * batch_size # 均等划分的区块的最末位置 +# 划分data +ret_data = np.split(data[:split_pos], split_num) +ret_data.append(data[split_pos+1:]) +# 划分label +ret_label = np.split(label[:split_pos], split_num) +ret_label.append(label[split_pos+1:]) +``` + +- 最后使用`zip`将data和label组合成tuple: + +``` +ret = list(zip(ret_data, ret_label)) +``` + +#### 4.2 mini_batch函数测试 + +使用torch方法的mini_batch: + +``` +[0] Accuracy: 0.9473 +[1] Accuracy: 0.9648 +[2] Accuracy: 0.9680 +time = 73.32 s +``` + +只使用numpy方法的mini_batch: + +``` +[0] Accuracy: 0.9474 +[1] Accuracy: 0.9556 +[2] Accuracy: 0.9678 +time = 66.24 s +``` + +​ 理论上是对正确率没有影响的。速度上比torch的快了7s左右。 + +### 5.模型优化方法 + +#### 5.1 Momentum方法 + +​ Momentum算法又叫做动量梯度下降算法,使用原始的梯度下降有以下问题: + +> ​ 梯度下降过程中有纵向波动,由于这种波动的存在,我们只能采取较小的学习率,否则波动会更大。而使用动量梯度下降法后,经过平均,抵消了上下波动,使波动趋近于零,这样就可以采用稍微大点的学习率加快梯度下降的速度。 +> + +​ Momentum公式: +$$ +V_{dW}= \beta \cdot V_{dW} + (1-\beta)\cdot dW \\\\ +W = W - \alpha \cdot V_{dW} +$$ +​ 其中$\alpha$为学习率,$\beta$为动量系数。在实验中$\beta$取值0.9。 + +​ 分别使用原始梯度下降(绿色线)和Momentum优化方法(蓝色线)进行测试,绘制acc-epoch图,结果如下: + + + +可以看到Momentum优化在前期学习速度比原始方法慢,但随着动量累计,其模型精确度很快高于原始方法的精确度,且最终精确度收敛于更高的水平。 + +#### 5.2 Adam方法 + +​ Adam本质上实际是RMSProp优化+Momentum优化的结合: + +> **均方根传播(RMSProp)**:维护每个参数的学习速率,根据最近的权重梯度的平均值来调整。这意味着该算法在线上和非平稳问题上表现良好。 + +​ Adam公式: +$$ +V_{dW} = \beta_1\cdot V_{dW} + (1-\beta_1)\cdot dW \\\\ +V_{dW}^{corrected} = \frac{V_{dW}}{1-\beta_1^t} \\\\ +S_{dW} = \beta_2\cdot S_{dW} + (1-\beta_2)\cdot dW^2 \\\\ +S_{dW}^{corrected} = \frac{S_{dW}}{1-\beta_2^t} \\\\ +W = W - \alpha \frac{V_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}}+\epsilon} +$$ +其中$\alpha$为学习率,$\beta_1$为动量系数,$\beta_2$为自适应学习系数。 + +​ 根据典型的参数设置,$\alpha$设为0.001,$\beta_1$设为0.9,$\beta_2$设为0.999。运行20个epoch,测试结果如图(与原始optimize、Momentum方法比较): + + + +​ 可以看到,Adam方法的loss波动比较大,20个epoch仍未到达收敛值。比较三种optimize方法每个epoch的acc(绿色线原始方法,蓝色线Momentum优化,紫色线Adam优化): + + + +​ 可以看到,在epoch比较小时,Adam优化并没有获得更高的精确度。在epoch结束时,acc仍处于上升趋势。可能是因为Adam方法的低学习率导致的收敛速度变慢。Momentum更适合本实验的模型。 + +### 6. 权重初始化 + +​ 权值初始值是不能设置成0的,因为如果参数都为0,在第一遍前向计算时,所有的隐层神经元的激活值都相同。这样会导致深层神经元没有区分性。这种现象也称为对称权重现象。 + +​ 为了解决这个问题,一种比较直观的权重初始化方法就是给每个参数随机初始化一个值。然而,如果初始化的值太小,会导致神经元的输入过小,经过多层之后信号就消失了;初始化值设置过大会导致数据状态过大,激活值很快饱和了,梯度接近于0,也是不好训练的。 + +​ 因此一般而言,参数初始化的区间应该根据神经元的性质进行差异化的设置。 + +​ 下面分别介绍Xavier和Kaiming两种初始化方法。 + +#### 6.1 Xavier初始化 + +​ Xavier Glorot,在其论文中提出一个洞见:激活值的方差是逐层递减的,这导致反向传播中的梯度也逐层递减。要解决梯度消失,就要避免激活值方差的衰减,最理想的情况是,每层的输出值(激活值)保持高斯分布。 + +​ Xavier Glorot为实现这种理想情况,设计了依托均匀分布和高斯分布两种初始化方式。分布的参数通过一个gain(增益值来计算)。 + +​ `torch.nn.init.calculate_gain(nonlinearity, param=None)`提供了对非线性函数增益值的计算: + + + +​ 均匀分布初始化$U(-a,a)$,其中: +$$ +a = gain\times \sqrt{\frac{6}{fan\_in+fan\_out}} +$$ + + +​ 正态分布初始化$N(0,std^2)$,其中: +$$ +std = gain\times \frac{2}{fan\_in+fan\_out} +$$ + + +​ $fan\_in$和$fan\_out$表示输入和输出的规模。 + +#### 6.2 Kaiming初始化 + +> ​ Xavier初始化的问题在于,它只适用于线性激活函数,但实际上,对于深层神经网络来说,线性激活函数是没有价值,神经网络需要非线性激活函数来构建复杂的非线性系统。今天的神经网络普遍使用relu激活函数。aiming初始化的发明人kaiming he,在其论文中提出了针对relu的kaiming初始化。 +> +> ​ 因为relu会抛弃掉小于0的值,对于一个均值为0的data来说,这就相当于砍掉了一半的值,这样一来,均值就会变大,前面Xavier初始化公式中E(x)=mean=0的情况就不成立了。根据新公式的推导,最终得到新的rescale系数:![\sqrt {2/n}](https://math.jianshu.com/math?formula=%5Csqrt%20%7B2%2Fn%7D)。 + +​ 均匀分布初始化$U(-bound,bound)$,其中: +$$ +bound = \sqrt{\frac{6}{(1+a^2)\times fan\_in}} +$$ +​ 正态分布初始化$N(0,std^2)$,其中: +$$ +bound = \sqrt{\frac{2}{(1+a^2)\times fan\_in}} +$$ +​ $a$为可设置的参数,$fan\_in$为输入层的规模。 + +#### 6.3 初始化函数的实现 + +​ 首先看看`get_torch_initialization`这个函数用的是什么方式的初始化: + +```python +fc1 = torch.nn.Linear(28 * 28, 256) +W1 = fc1.weight.T.detach().clone().data +``` + +​ 它定义了一个Linear层,直接取其Weight的值。找到Linear这个类的定义: + +```python +init.kaiming_uniform_(self.weight, a=math.sqrt(5)) +``` + +​ 使用的是kaiming均匀分布,a设置为$\sqrt 5$。 + +```python +def kaiming_uniform_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu') +``` + +​ 由于没有设置nonlinearity,其值为默认leaky_relu。 + +```python +gain = calculate_gain(nonlinearity, a) +``` + +​ 这个传入的a即为Leaky ReLU中的`negative_slop`。 + +```python +gain = math.sqrt(2.0 / (1 + a ** 2)) +std = gain / math.sqrt(fan) +bound = math.sqrt(3.0) * std +return tensor.uniform_(-bound, bound) +``` + +​ 代码阅读之后,可以看出,tensor中使用的是kaiming的均匀分布初始化。 + +​ 最终在代码实现均匀分布的kaiming分布,a设置为$\sqrt 5$ : + +```python +def get_torch_initialization_numpy(numpy=True): + fan_in_1 = 28 * 28 + fan_in_2 = 256 + fan_in_3 = 64 + + bound1 = 1 / np.sqrt(fan_in_1) #bound1 = np.sqrt(6) / np.sqrt(1+np.sqrt(5)**2) /np.sqrt(fan_in_1) + bound2 = 1 / np.sqrt(fan_in_2) + bound3 = 1 / np.sqrt(fan_in_3) + + W1 = np.random.uniform(-bound1, bound1, (28*28, 256)) + W2 = np.random.uniform(-bound2, bound2, (256, 64)) + W3 = np.random.uniform(-bound3, bound3, (64, 10)) + + if numpy == False: + W1 = torch.Tensor(W1) + W2 = torch.Tensor(W2) + W3 = torch.Tensor(W3) + + return W1, W2, W3 +``` + +​ torch.mnist运行结果: + + + +``` +[0] Accuracy: 0.9503 +[1] Accuracy: 0.9639 +[2] Accuracy: 0.9711 +``` + +### 7.提交的代码说明 + +- `numpy_fnn.py`:算子和FNN模型正向传播和反向传播的实现。`optimize_Momentum`方法实现Momentum优化,`optimize_Adam`方法实现Adam优化。可在`numpy_mnist.py`中修改optimize的调用改变优化方法。 +- `numpy_mnist.py`:`mini_batch_numpy`方法用numpy实现了mini_batch。 +- `utils.py`:`get_torch_initialization_numpy`方法用numpy实现了均匀分布的kaiming初始化。 + +### 8.参考文献 + +[1] [神经网络常见优化算法(Momentum, RMSprop, Adam)的原理及公式理解, 学习率衰减](https://blog.csdn.net/weixin_42561002/article/details/88036777) + +[2] [深度之眼【Pytorch】-Xavier、Kaiming初始化(附keras实现)](https://blog.csdn.net/weixin_42147780/article/details/103238195) + diff --git a/assignment-2/submission/18307130341/img/Fig1.png b/assignment-2/submission/18307130341/img/Fig1.png new file mode 100644 index 0000000000000000000000000000000000000000..50b42797b50f8b745d7707a86e2644d84843d228 Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig1.png differ diff --git a/assignment-2/submission/18307130341/img/Fig2.png b/assignment-2/submission/18307130341/img/Fig2.png new file mode 100644 index 0000000000000000000000000000000000000000..f5dd7bdc2c2712953ddd2b990232d3a7a71b655b Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig2.png differ diff --git a/assignment-2/submission/18307130341/img/Fig3.png b/assignment-2/submission/18307130341/img/Fig3.png new file mode 100644 index 0000000000000000000000000000000000000000..c440ff99663169e8b636ed9e3fc8b7cbdb6008f1 Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig3.png differ diff --git a/assignment-2/submission/18307130341/img/Fig4.png b/assignment-2/submission/18307130341/img/Fig4.png new file mode 100644 index 0000000000000000000000000000000000000000..05196c545d1d6da186cd8e301eff8fec10110060 Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig4.png differ diff --git a/assignment-2/submission/18307130341/img/Fig5.png b/assignment-2/submission/18307130341/img/Fig5.png new file mode 100644 index 0000000000000000000000000000000000000000..31658f4aa8db641225bf56c4ef54fb8c079d7ae2 Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig5.png differ diff --git a/assignment-2/submission/18307130341/img/Fig6.png b/assignment-2/submission/18307130341/img/Fig6.png new file mode 100644 index 0000000000000000000000000000000000000000..54721b598f39cfb996bfde8986077ca64836eb76 Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig6.png differ diff --git a/assignment-2/submission/18307130341/img/Fig7.png b/assignment-2/submission/18307130341/img/Fig7.png new file mode 100644 index 0000000000000000000000000000000000000000..1a3f1b1c91d8767838bd464ad291da558006c941 Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig7.png differ diff --git a/assignment-2/submission/18307130341/img/Fig8.png b/assignment-2/submission/18307130341/img/Fig8.png new file mode 100644 index 0000000000000000000000000000000000000000..081717ced38314a1b250daf2f527bce99313bc71 Binary files /dev/null and b/assignment-2/submission/18307130341/img/Fig8.png differ diff --git a/assignment-2/submission/18307130341/numpy_fnn.py b/assignment-2/submission/18307130341/numpy_fnn.py new file mode 100644 index 0000000000000000000000000000000000000000..bacac809951ecb357664479fa2f8f69e956fd8b8 --- /dev/null +++ b/assignment-2/submission/18307130341/numpy_fnn.py @@ -0,0 +1,240 @@ +import numpy as np + + +class NumpyOp: + + def __init__(self): + self.memory = {} + self.epsilon = 1e-12 + + +class Matmul(NumpyOp): + + def forward(self, x, W): + """ + x: shape(N, d) + w: shape(d, d') + """ + self.memory['x'] = x + self.memory['W'] = W + h = np.matmul(x, W) + return h + + def backward(self, grad_y): + """ + grad_y: shape(N, d') + """ + + #################### + # code 1 # + #################### + xT = np.transpose(self.memory['x']) + WT = np.transpose(self.memory['W']) + + grad_x = np.matmul(grad_y, WT) + grad_W = np.matmul(xT, grad_y) + + return grad_x, grad_W + + +class Relu(NumpyOp): + + def forward(self, x): + self.memory['x'] = x + return np.where(x > 0, x, np.zeros_like(x)) + + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + + #################### + # code 2 # + #################### + x = self.memory['x'] + grad_x = grad_y * np.where(x > 0, 1, 0) + + return grad_x + + +class Log(NumpyOp): + + def forward(self, x): + """ + x: shape(N, c) + """ + + out = np.log(x + self.epsilon) + self.memory['x'] = x + + return out + + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + + #################### + # code 3 # + #################### + x = self.memory['x'] + grad_x = grad_y / (x + self.epsilon) + + return grad_x + + +class Softmax(NumpyOp): + """ + softmax over last dimension + """ + + def forward(self, x): + """ + x: shape(N, c) + """ + + #################### + # code 4 # + #################### + sum = np.exp(x).sum(axis = 1) + sum = sum.reshape(x.shape[0], 1) + out = np.exp(x) / sum + + self.memory['y'] = out + + return out + + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + + #################### + # code 5 # + #################### + y = self.memory['y'] + + grad_x = y * (grad_y - (y * grad_y).sum(axis = 1).reshape(len(y),1)) + + return grad_x + + +class NumpyLoss: + + def __init__(self): + self.target = None + + def get_loss(self, pred, target): + self.target = target + return (-pred * target).sum(axis=1).mean() + + def backward(self): + return -self.target / self.target.shape[0] + + +class NumpyModel: + def __init__(self): + self.W1 = np.random.normal(size=(28 * 28, 256)) + self.W2 = np.random.normal(size=(256, 64)) + self.W3 = np.random.normal(size=(64, 10)) + + # 以下算子会在 forward 和 backward 中使用 + self.matmul_1 = Matmul() + self.relu_1 = Relu() + self.matmul_2 = Matmul() + self.relu_2 = Relu() + self.matmul_3 = Matmul() + self.softmax = Softmax() + self.log = Log() + + # 以下变量需要在 backward 中更新。 softmax_grad, log_grad 等为算子反向传播的梯度( loss 关于算子输入的偏导) + self.x1_grad, self.W1_grad = None, None + self.relu_1_grad = None + self.x2_grad, self.W2_grad = None, None + self.relu_2_grad = None + self.x3_grad, self.W3_grad = None, None + self.softmax_grad = None + self.log_grad = None + + # Momentum优化 + self.v_W1_grad = 0 + self.v_W2_grad = 0 + self.v_W3_grad = 0 + + # Adam优化 + self.s_W1_grad = 0 + self.s_W2_grad = 0 + self.s_W3_grad = 0 + + def forward(self, x): + x = x.reshape(-1, 28 * 28) + + #################### + # code 6 # + #################### + x = self.matmul_1.forward(x, self.W1) + x = self.relu_1.forward(x) + + x = self.matmul_2.forward(x, self.W2) + x = self.relu_2.forward(x) + + x = self.matmul_3.forward(x, self.W3) + x = self.softmax.forward(x) + + x = self.log.forward(x) + + return x + + def backward(self, y): + + #################### + # code 7 # + ################### + + self.log_grad = self.log.backward(y) + + self.softmax_grad = self.softmax.backward(self.log_grad) + self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad) + + self.relu_2_grad = self.relu_2.backward(self.x3_grad) + self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad) + + self.relu_1_grad = self.relu_1.backward(self.x2_grad) + self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad) + + def optimize(self, learning_rate): + self.W1 -= learning_rate * self.W1_grad + self.W2 -= learning_rate * self.W2_grad + self.W3 -= learning_rate * self.W3_grad + + def optimize_Momentum(self, learning_rate, belta): + self.v_W1_grad = belta * self.v_W1_grad + (1 - belta) * self.W1_grad + self.v_W2_grad = belta * self.v_W2_grad + (1 - belta) * self.W2_grad + self.v_W3_grad = belta * self.v_W3_grad + (1 - belta) * self.W3_grad + + self.W1 -= learning_rate * self.v_W1_grad + self.W2 -= learning_rate * self.v_W2_grad + self.W3 -= learning_rate * self.v_W3_grad + + def optimize_Adam(self, learning_rate, beta1, beta2, beta1_t, beta2_t, eps): + + self.v_W1_grad = beta1 * self.v_W1_grad + (1 - beta1) * self.W1_grad + self.v_W2_grad = beta1 * self.v_W2_grad + (1 - beta1) * self.W2_grad + self.v_W3_grad = beta1 * self.v_W3_grad + (1 - beta1) * self.W3_grad + + v_W1_corr = self.v_W1_grad / (1 - beta1_t) + v_W2_corr = self.v_W2_grad / (1 - beta1_t) + v_W3_corr = self.v_W3_grad / (1 - beta1_t) + + self.s_W1_grad = beta2 * self.s_W1_grad + (1 - beta2) * (self.W1_grad ** 2) + self.s_W2_grad = beta2 * self.s_W2_grad + (1 - beta2) * (self.W2_grad ** 2) + self.s_W3_grad = beta2 * self.s_W3_grad + (1 - beta2) * (self.W3_grad ** 2) + + s_W1_corr = self.s_W1_grad / (1 - beta2_t) + s_W2_corr = self.s_W2_grad / (1 - beta2_t) + s_W3_corr = self.s_W3_grad / (1 - beta2_t) + + self.W1 -= learning_rate * v_W1_corr / (np.sqrt(s_W1_corr) + eps) + self.W2 -= learning_rate * v_W2_corr / (np.sqrt(s_W2_corr) + eps) + self.W3 -= learning_rate * v_W3_corr / (np.sqrt(s_W3_corr) + eps) + diff --git a/assignment-2/submission/18307130341/numpy_mnist.py b/assignment-2/submission/18307130341/numpy_mnist.py new file mode 100644 index 0000000000000000000000000000000000000000..87aa55fdfd8a765cbad3203b3a9082c28c5c6502 --- /dev/null +++ b/assignment-2/submission/18307130341/numpy_mnist.py @@ -0,0 +1,91 @@ +import numpy as np +from numpy_fnn import NumpyModel, NumpyLoss +from utils import download_mnist, mini_batch, batch, get_torch_initialization, plot_curve, one_hot + +def mini_batch_numpy(dataset, batch_size=128): + data = [] + label = [] + + for x in dataset: + data.append(np.array(x[0])) + label.append(x[1]) + + data = np.array(data) + label = np.array(label) + + idx = np.random.permutation(len(dataset)) + data = data[idx] + label = label[idx] + + split_num = len(dataset) // batch_size + split_pos = split_num * batch_size + + ret_data = np.split(data[:split_pos], split_num) + ret_data.append(data[split_pos+1:]) + + ret_label = np.split(label[:split_pos], split_num) + ret_label.append(label[split_pos+1:]) + + ret = list(zip(ret_data, ret_label)) + return ret + +def numpy_run(): + + import time + start = time.time() + + train_dataset, test_dataset = download_mnist() + + model = NumpyModel() + numpy_loss = NumpyLoss() + model.W1, model.W2, model.W3 = get_torch_initialization() + + train_loss = [] + + epoch_number = 3 + learning_rate = 0.1 + + #Adam 优化 + beta1 = 0.9 + beta2 = 0.999 + beta1_t = 1 + beta2_t = 1 + + for epoch in range(epoch_number): + #Adam 优化 + beta1_t *= beta1 + beta2_t *= beta2 + + # for x, y in mini_batch_numpy(train_dataset): # mini_batch_numpy + for x, y in mini_batch(train_dataset): + y = one_hot(y) + + # y_pred = model.forward(x) # mini_batch_numpy + y_pred = model.forward(x.numpy()) + loss = numpy_loss.get_loss(y_pred, y) + + model.backward(numpy_loss.backward()) + + #原始optimize + model.optimize(learning_rate) + + #Momentum 优化 + # model.optimize_Momentum(learning_rate, 0.9) + + #Adam 优化 + # model.optimize_Adam(learning_rate, beta1, beta2, beta1_t, beta2_t, 1e-8) + + train_loss.append(loss.item()) + + x, y = batch(test_dataset)[0] + accuracy = np.mean((model.forward(x).argmax(axis=1) == y)) + print('[{}] Accuracy: {:.4f}'.format(epoch, accuracy)) + + end = time.time() + print("time = %.2f s"%(end-start)) + + plot_curve(train_loss) + + +if __name__ == "__main__": + numpy_run() diff --git a/assignment-2/submission/18307130341/tester_demo.py b/assignment-2/submission/18307130341/tester_demo.py new file mode 100644 index 0000000000000000000000000000000000000000..504b3eef50a6df4d0aa433113136add50835e420 --- /dev/null +++ b/assignment-2/submission/18307130341/tester_demo.py @@ -0,0 +1,182 @@ +import numpy as np +import torch +from torch import matmul as torch_matmul, relu as torch_relu, softmax as torch_softmax, log as torch_log + +from numpy_fnn import Matmul, Relu, Softmax, Log, NumpyModel, NumpyLoss +from torch_mnist import TorchModel +from utils import get_torch_initialization, one_hot + +err_epsilon = 1e-6 +err_p = 0.4 + + +def check_result(numpy_result, torch_result=None): + if isinstance(numpy_result, list) and torch_result is None: + flag = True + for (n, t) in numpy_result: + flag = flag and check_result(n, t) + return flag + # print((torch.from_numpy(numpy_result) - torch_result).abs().mean().item()) + T = (torch_result * torch.from_numpy(numpy_result) < 0).sum().item() + direction = T / torch_result.numel() < err_p + return direction and ((torch.from_numpy(numpy_result) - torch_result).abs().mean() < err_epsilon).item() + + +def case_1(): + x = np.random.normal(size=[5, 6]) + W = np.random.normal(size=[6, 4]) + + numpy_matmul = Matmul() + numpy_out = numpy_matmul.forward(x, W) + numpy_x_grad, numpy_W_grad = numpy_matmul.backward(np.ones_like(numpy_out)) + + torch_x = torch.from_numpy(x).clone().requires_grad_() + torch_W = torch.from_numpy(W).clone().requires_grad_() + + torch_out = torch_matmul(torch_x, torch_W) + torch_out.sum().backward() + + return check_result([ + (numpy_out, torch_out), + (numpy_x_grad, torch_x.grad), + (numpy_W_grad, torch_W.grad) + ]) + + +def case_2(): + x = np.random.normal(size=[5, 6]) + + numpy_relu = Relu() + numpy_out = numpy_relu.forward(x) + numpy_x_grad = numpy_relu.backward(np.ones_like(numpy_out)) + + torch_x = torch.from_numpy(x).clone().requires_grad_() + + torch_out = torch_relu(torch_x) + torch_out.sum().backward() + + return check_result([ + (numpy_out, torch_out), + (numpy_x_grad, torch_x.grad), + ]) + + +def case_3(): + x = np.random.uniform(low=0.0, high=1.0, size=[3, 4]) + + numpy_log = Log() + numpy_out = numpy_log.forward(x) + numpy_x_grad = numpy_log.backward(np.ones_like(numpy_out)) + + torch_x = torch.from_numpy(x).clone().requires_grad_() + + torch_out = torch_log(torch_x) + torch_out.sum().backward() + + return check_result([ + (numpy_out, torch_out), + + (numpy_x_grad, torch_x.grad), + ]) + + +def case_4(): + x = np.random.normal(size=[4, 5]) + + numpy_softmax = Softmax() + numpy_out = numpy_softmax.forward(x) + + torch_x = torch.from_numpy(x).clone().requires_grad_() + + torch_out = torch_softmax(torch_x, 1) + + return check_result(numpy_out, torch_out) + + +def case_5(): + x = np.random.normal(size=[20, 25]) + + numpy_softmax = Softmax() + numpy_out = numpy_softmax.forward(x) + numpy_x_grad = numpy_softmax.backward(np.ones_like(numpy_out)) + + torch_x = torch.from_numpy(x).clone().requires_grad_() + + torch_out = torch_softmax(torch_x, 1) + torch_out.sum().backward() + + return check_result([ + (numpy_out, torch_out), + (numpy_x_grad, torch_x.grad), + ]) + + +def test_model(): + try: + numpy_loss = NumpyLoss() + numpy_model = NumpyModel() + torch_model = TorchModel() + torch_model.W1.data, torch_model.W2.data, torch_model.W3.data = get_torch_initialization(numpy=False) + numpy_model.W1 = torch_model.W1.detach().clone().numpy() + numpy_model.W2 = torch_model.W2.detach().clone().numpy() + numpy_model.W3 = torch_model.W3.detach().clone().numpy() + + x = torch.randn((10000, 28, 28)) + y = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 0] * 1000) + + y = one_hot(y, numpy=False) + x2 = x.numpy() + y_pred = torch_model.forward(x) + loss = (-y_pred * y).sum(dim=1).mean() + loss.backward() + + y_pred_numpy = numpy_model.forward(x2) + numpy_loss.get_loss(y_pred_numpy, y.numpy()) + + check_flag_1 = check_result(y_pred_numpy, y_pred) + print("+ {:12} {}/{}".format("forward", 10 * check_flag_1, 10)) + except: + print("[Runtime Error in forward]") + print("+ {:12} {}/{}".format("forward", 0, 10)) + return 0 + + try: + + numpy_model.backward(numpy_loss.backward()) + + check_flag_2 = [ + check_result(numpy_model.log_grad, torch_model.log_input.grad), + check_result(numpy_model.softmax_grad, torch_model.softmax_input.grad), + check_result(numpy_model.W3_grad, torch_model.W3.grad), + check_result(numpy_model.W2_grad, torch_model.W2.grad), + check_result(numpy_model.W1_grad, torch_model.W1.grad) + ] + check_flag_2 = sum(check_flag_2) >= 4 + print("+ {:12} {}/{}".format("backward", 20 * check_flag_2, 20)) + except: + print("[Runtime Error in backward]") + print("+ {:12} {}/{}".format("backward", 0, 20)) + check_flag_2 = False + + return 10 * check_flag_1 + 20 * check_flag_2 + + +if __name__ == "__main__": + testcases = [ + ["matmul", case_1, 5], + ["relu", case_2, 5], + ["log", case_3, 5], + ["softmax_1", case_4, 5], + ["softmax_2", case_5, 10], + ] + score = 0 + for case in testcases: + try: + res = case[2] if case[1]() else 0 + except: + print("[Runtime Error in {}]".format(case[0])) + res = 0 + score += res + print("+ {:12} {}/{}".format(case[0], res, case[2])) + score += test_model() + print("{:14} {}/60".format("FINAL SCORE", score)) diff --git a/assignment-2/submission/18307130341/torch_mnist.py b/assignment-2/submission/18307130341/torch_mnist.py new file mode 100644 index 0000000000000000000000000000000000000000..7bbcedcf8108c227e09c861a761c18e99a7f9429 --- /dev/null +++ b/assignment-2/submission/18307130341/torch_mnist.py @@ -0,0 +1,75 @@ +import torch +from utils import mini_batch, batch, download_mnist, get_torch_initialization, one_hot, plot_curve + + +class TorchModel: + + def __init__(self): + self.W1 = torch.randn((28 * 28, 256), requires_grad=True) + self.W2 = torch.randn((256, 64), requires_grad=True) + self.W3 = torch.randn((64, 10), requires_grad=True) + self.softmax_input = None + self.log_input = None + + def forward(self, x): + x = x.reshape(-1, 28 * 28) + x = torch.relu(torch.matmul(x, self.W1)) + x = torch.relu(torch.matmul(x, self.W2)) + x = torch.matmul(x, self.W3) + + self.softmax_input = x + self.softmax_input.retain_grad() + + x = torch.softmax(x, 1) + + self.log_input = x + self.log_input.retain_grad() + + x = torch.log(x) + + return x + + def optimize(self, learning_rate): + with torch.no_grad(): + self.W1 -= learning_rate * self.W1.grad + self.W2 -= learning_rate * self.W2.grad + self.W3 -= learning_rate * self.W3.grad + + self.W1.grad = None + self.W2.grad = None + self.W3.grad = None + + +def torch_run(): + train_dataset, test_dataset = download_mnist() + + model = TorchModel() + # model.W1.data, model.W2.data, model.W3.data = get_torch_initialization(numpy=False) + from utils import get_torch_initialization_numpy + model.W1.data, model.W2.data, model.W3.data = get_torch_initialization_numpy(numpy=False) + + train_loss = [] + + epoch_number = 3 + learning_rate = 0.1 + + for epoch in range(epoch_number): + for x, y in mini_batch(train_dataset, numpy=False): + y = one_hot(y, numpy=False) + + y_pred = model.forward(x) + loss = (-y_pred * y).sum(dim=1).mean() + loss.backward() + model.optimize(learning_rate) + + train_loss.append(loss.item()) + + x, y = batch(test_dataset, numpy=False)[0] + accuracy = model.forward(x).argmax(dim=1).eq(y).float().mean().item() + print('[{}] Accuracy: {:.4f}'.format(epoch, accuracy)) + + plot_curve(train_loss) + + +if __name__ == "__main__": + torch_run() diff --git a/assignment-2/submission/18307130341/utils.py b/assignment-2/submission/18307130341/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..5154f4970843623204198206ff0df1438bbee5df --- /dev/null +++ b/assignment-2/submission/18307130341/utils.py @@ -0,0 +1,91 @@ +import torch +import numpy as np +from matplotlib import pyplot as plt + + +def plot_curve(data): + plt.plot(range(len(data)), data, color='blue') + plt.legend(['loss_value'], loc='upper right') + plt.xlabel('step') + plt.ylabel('value') + plt.show() + + +def download_mnist(): + from torchvision import datasets, transforms + + transform = transforms.Compose([ + transforms.ToTensor(), + transforms.Normalize(mean=(0.1307,), std=(0.3081,)) + ]) + + train_dataset = datasets.MNIST(root="./data/", transform=transform, train=True, download=True) + test_dataset = datasets.MNIST(root="./data/", transform=transform, train=False, download=True) + + return train_dataset, test_dataset + + +def one_hot(y, numpy=True): + if numpy: + y_ = np.zeros((y.shape[0], 10)) + y_[np.arange(y.shape[0], dtype=np.int32), y] = 1 + return y_ + else: + y_ = torch.zeros((y.shape[0], 10)) + y_[torch.arange(y.shape[0], dtype=torch.long), y] = 1 + return y_ + + +def batch(dataset, numpy=True): + data = [] + label = [] + for each in dataset: + data.append(each[0]) + label.append(each[1]) + data = torch.stack(data) + label = torch.LongTensor(label) + if numpy: + return [(data.numpy(), label.numpy())] + else: + return [(data, label)] + + +def mini_batch(dataset, batch_size=128, numpy=False): + return torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True) + + +def get_torch_initialization(numpy=True): + fc1 = torch.nn.Linear(28 * 28, 256) + fc2 = torch.nn.Linear(256, 64) + fc3 = torch.nn.Linear(64, 10) + + if numpy: + W1 = fc1.weight.T.detach().clone().numpy() + W2 = fc2.weight.T.detach().clone().numpy() + W3 = fc3.weight.T.detach().clone().numpy() + else: + W1 = fc1.weight.T.detach().clone().data + W2 = fc2.weight.T.detach().clone().data + W3 = fc3.weight.T.detach().clone().data + + return W1, W2, W3 + +def get_torch_initialization_numpy(numpy=True): + fan_in_1 = 28 * 28 + fan_in_2 = 256 + fan_in_3 = 64 + + bound1 = 1 / np.sqrt(fan_in_1) + bound2 = 1 / np.sqrt(fan_in_2) + bound3 = 1 / np.sqrt(fan_in_3) + + W1 = np.random.uniform(-bound1, bound1, (28*28, 256)) + W2 = np.random.uniform(-bound2, bound2, (256, 64)) + W3 = np.random.uniform(-bound3, bound3, (64, 10)) + + if numpy == False: + W1 = torch.Tensor(W1) + W2 = torch.Tensor(W2) + W3 = torch.Tensor(W3) + + return W1, W2, W3 \ No newline at end of file