diff --git a/assignment-2/submission/19307130062/README.md b/assignment-2/submission/19307130062/README.md index 02a93eb112820dff16d39f05398e6a648fea011b..6615e86187728bd84f94ff46c3ef55d6595e1b00 100644 --- a/assignment-2/submission/19307130062/README.md +++ b/assignment-2/submission/19307130062/README.md @@ -9,31 +9,31 @@ ### Matmul -考虑 $Y = XW$,其中 $Y \in \R^{n\times d_2},\ X \in \R^{n \times d_1},\ W \in \R^{d_1 \times d_2}$ +考虑 $Y = XW$,其中 $Y \in \mathbb R^{n\times d\_2},\ X \in \mathbb R^{n \times d\_1},\ W \in \mathbb R^{d\_1 \times d\_2}$ -设损失函数为 $\mathcal L(\boldsymbol y,\ \boldsymbol {\hat y})$ ,且 $\Delta_Y$ (即 $Y$ 相对于损失函数的梯度)已知,希望得到 $\Delta_X,\ \Delta_W$ +设损失函数为 $\mathcal L(\boldsymbol y,\ \boldsymbol {\hat y})$ ,且 $\Delta\_Y$ (即 $Y$ 相对于损失函数的梯度)已知,希望得到 $\Delta\_X,\ \Delta\_W$ 推导如下: -#### $\Delta_X$ 的推导 +#### $\Delta\_X$ 的推导 -我们考虑 $Y$ 的每一位对 $X$ 贡献的偏导,即 $\frac{\partial Y_{ij}}{\partial X}$ +我们考虑 $Y$ 的每一位对 $X$ 贡献的偏导,即 $\frac{\partial Y\_{ij}}{\partial X}$ -由于 $Y_{ij} = \sum_{k = 1}^{d_1}X_{ik}W_{kj}$ ,$X$ 各位独立,且 +由于 $Y\_{ij} = \sum\_{k = 1}^{d\_1}X\_{ik}W\_{kj}$ ,$X$ 各位独立,且 $$ -\frac{\partial Y_{ij}}{\partial X} = +\frac{\partial Y\_{ij}}{\partial X} = \begin{bmatrix} -\frac{\partial Y_{ij}}{\partial X_{11}} & \frac{\partial Y_{ij}}{\partial X_{12}} & \cdots & \frac{\partial Y_{ij}}{\partial X_{1d_1}} \\ -\frac{\partial Y_{ij}}{\partial X_{21}} & \frac{\partial Y_{ij}}{\partial X_{22}} & \cdots & \frac{\partial Y_{ij}}{\partial X_{2d_1}} \\ -\vdots & \vdots & \ddots & \vdots \\ -\frac{\partial Y_{ij}}{\partial X_{n1}} & \frac{\partial Y_{ij}}{\partial X_{n2}} & \cdots & \frac{\partial Y_{ij}}{\partial X_{nd_1}} \\ +\frac{\partial Y\_{ij}}{\partial X\_{11}} & \frac{\partial Y\_{ij}}{\partial X\_{12}} & \cdots & \frac{\partial Y\_{ij}}{\partial X\_{1d\_1}} \\\\ +\frac{\partial Y\_{ij}}{\partial X\_{21}} & \frac{\partial Y\_{ij}}{\partial X\_{22}} & \cdots & \frac{\partial Y\_{ij}}{\partial X\_{2d\_1}} \\\\ +\vdots & \vdots & \ddots & \vdots \\\\ +\frac{\partial Y\_{ij}}{\partial X\_{n1}} & \frac{\partial Y\_{ij}}{\partial X\_{n2}} & \cdots & \frac{\partial Y\_{ij}}{\partial X\_{nd\_1}} \\\\ \end{bmatrix} $$ -故 $\left[\frac{\partial Y_{ij}}{\partial X}\right]_{ik} = W_{kj},\ k \in [1,\ d_1] \cap\Z$ ,其余项为 $0$ +故 $\left[\frac{\partial Y\_{ij}}{\partial X}\right]\_{ik} = W\_{kj},\ k \in [1,\ d\_1] \cap\mathbb Z$ ,其余项为 $0$ -由于 $\Delta_Y$ 已知,即 $\frac{\partial{\mathcal L}}{\partial Y_{ij}}$ 已知,则有 +由于 $\Delta\_Y$ 已知,即 $\frac{\partial{\mathcal L}}{\partial Y\_{ij}}$ 已知,则有 $$ -\frac{\partial{\mathcal L}}{\partial X_{ij}} = \sum_{s = 1}^n\sum_{t = 1}^{d_2} \frac{\partial{\mathcal L}}{\partial Y_{st}}\frac{\partial{Y_{st}}}{\partial X_{ij}} = \sum_{s = 1}^n\sum_{t = 1}^{d_2} \frac{\partial{\mathcal L}}{\partial Y_{st}}\left[\frac{\partial{Y_{st}}}{\partial X}\right]_{ij} = \sum_{t = 1}^{d_2} \frac{\partial{\mathcal L}}{\partial Y_{it}}\left[\frac{\partial{Y_{it}}}{\partial X}\right]_{ij} = \sum_{t = 1}^{d_2} \frac{\partial{\mathcal L}}{\partial Y_{it}}W_{jt} = \sum_{t = 1}^{d_2} \frac{\partial{\mathcal L}}{\partial Y_{it}}W^T_{tj} +\frac{\partial{\mathcal L}}{\partial X\_{ij}} = \sum\_{s = 1}^n\sum\_{t = 1}^{d\_2} \frac{\partial{\mathcal L}}{\partial Y\_{st}}\frac{\partial{Y\_{st}}}{\partial X\_{ij}} = \sum\_{s = 1}^n\sum\_{t = 1}^{d\_2} \frac{\partial{\mathcal L}}{\partial Y\_{st}}\left[\frac{\partial{Y\_{st}}}{\partial X}\right]\_{ij} = \sum\_{t = 1}^{d\_2} \frac{\partial{\mathcal L}}{\partial Y\_{it}}\left[\frac{\partial{Y\_{it}}}{\partial X}\right]\_{ij} = \sum\_{t = 1}^{d\_2} \frac{\partial{\mathcal L}}{\partial Y\_{it}}W\_{jt} = \sum\_{t = 1}^{d\_2} \frac{\partial{\mathcal L}}{\partial Y\_{it}}W^T\_{tj} $$ 即 $$ @@ -41,14 +41,14 @@ $$ $$ 可知 $$ -\Delta_X = \Delta_YW^T +\Delta\_X = \Delta\_YW^T $$ -#### $\Delta_W$ 的推导 +#### $\Delta\_W$ 的推导 -其次,对于 $\Delta_W$ ,我们用类似的方法进行计算,有 $\left[\frac{\partial Y_{ij}}{\partial W}\right]_{kj} = X_{ik},\ k \in [1,\ d_1] \cap\Z$ ,其余项为 $0$ ,则有 +其次,对于 $\Delta\_W$ ,我们用类似的方法进行计算,有 $\left[\frac{\partial Y\_{ij}}{\partial W}\right]\_{kj} = X\_{ik},\ k \in [1,\ d\_1] \cap\mathbb Z$ ,其余项为 $0$ ,则有 $$ -\frac{\partial{\mathcal L}}{\partial W_{ij}} = \sum_{s = 1}^{n}\sum_{t = 1}^{d_2} \frac{\partial{\mathcal L}}{\partial Y_{st}}\left[\frac{\partial{Y_{st}}}{\partial W}\right]_{ij} = \sum_{s = 1}^{n} \frac{\partial{\mathcal L}}{\partial Y_{sj}}\left[\frac{\partial{Y_{sj}}}{\partial W}\right]_{ij} = \sum_{s = 1}^{n} \frac{\partial{\mathcal L}}{\partial Y_{sj}}X_{si} = \sum_{s = 1}^{n} X_{is}^T\frac{\partial{\mathcal L}}{\partial Y_{sj}} +\frac{\partial{\mathcal L}}{\partial W\_{ij}} = \sum\_{s = 1}^{n}\sum\_{t = 1}^{d\_2} \frac{\partial{\mathcal L}}{\partial Y\_{st}}\left[\frac{\partial{Y\_{st}}}{\partial W}\right]\_{ij} = \sum\_{s = 1}^{n} \frac{\partial{\mathcal L}}{\partial Y\_{sj}}\left[\frac{\partial{Y\_{sj}}}{\partial W}\right]\_{ij} = \sum\_{s = 1}^{n} \frac{\partial{\mathcal L}}{\partial Y\_{sj}}X\_{si} = \sum\_{s = 1}^{n} X\_{is}^T\frac{\partial{\mathcal L}}{\partial Y\_{sj}} $$ 即 $$ @@ -56,7 +56,7 @@ $$ $$ 可知 $$ -\Delta_W = X^T\Delta_Y +\Delta\_W = X^T\Delta\_Y $$ $\square$ @@ -64,25 +64,25 @@ $\square$ ### ReLU -考虑 $Y = \mathrm{ReLU}(X)$ ,其中 $Y,\ X \in \R^{n\times m}$ +考虑 $Y = \mathrm{ReLU}(X)$ ,其中 $Y,\ X \in \mathbb R^{n\times m}$ -设损失函数为 $\mathcal L(\boldsymbol y,\ \boldsymbol {\hat y})$ ,且 $\Delta_Y$ (即 $Y$ 相对于损失函数的梯度)已知,希望得到 $\Delta_X$ +设损失函数为 $\mathcal L(\boldsymbol y,\ \boldsymbol {\hat y})$ ,且 $\Delta\_Y$ (即 $Y$ 相对于损失函数的梯度)已知,希望得到 $\Delta\_X$ 推导如下: -#### $\Delta_X$ 的推导 +#### $\Delta\_X$ 的推导 方法类似于上文,只要注意这里的 $\mathrm{ReLU}$ 是一个逐元素函数 -考虑 $Y$ 的每一位对 $X$ 贡献的导数,即 $\frac{\mathrm{d}Y_{ij}}{\mathrm{d} X}$ +考虑 $Y$ 的每一位对 $X$ 贡献的导数,即 $\frac{\mathrm{d}Y\_{ij}}{\mathrm{d} X}$ -由于 $Y_{ij} = \mathrm{ReLU}(X_{ij})$ ,故 $\left[\frac{\mathrm{d}Y_{ij}}{\mathrm{d}X}\right]_{ij} = \mathrm{ReLU}'(X_{ij})$ ,其余项为 $0$ +由于 $Y\_{ij} = \mathrm{ReLU}(X\_{ij})$ ,故 $\left[\frac{\mathrm{d}Y\_{ij}}{\mathrm{d}X}\right]\_{ij} = \mathrm{ReLU}'(X\_{ij})$ ,其余项为 $0$ 显然 $$ \mathrm{ReLU}'(x) = \begin{cases} -0, & n < 0 \\ -1, & n > 0 \\ +0, & n < 0 \\\\ +1, & n > 0 \\\\ \mathrm{Undefined}, & n = 0 \end{cases} $$ @@ -90,7 +90,7 @@ $$ 则有 $$ -\frac{\partial{\mathcal L}}{\partial X_{ij}} = \sum_{s = 1}^n\sum_{t = 1}^{m} \frac{\partial{\mathcal L}}{\partial Y_{st}}\frac{\mathrm{d}{Y_{st}}}{\mathrm{d} X_{ij}} = \frac{\partial{\mathcal L}}{\partial Y_{ij}}\left[\frac{\partial{Y_{ij}}}{\partial X}\right]_{ij} =\frac{\partial{\mathcal L}}{\partial Y_{ij}}\mathrm{ReLU}'(X_{ij}) +\frac{\partial{\mathcal L}}{\partial X\_{ij}} = \sum\_{s = 1}^n\sum\_{t = 1}^{m} \frac{\partial{\mathcal L}}{\partial Y\_{st}}\frac{\mathrm{d}{Y\_{st}}}{\mathrm{d} X\_{ij}} = \frac{\partial{\mathcal L}}{\partial Y\_{ij}}\left[\frac{\partial{Y\_{ij}}}{\partial X}\right]\_{ij} =\frac{\partial{\mathcal L}}{\partial Y\_{ij}}\mathrm{ReLU}'(X\_{ij}) $$ 即(此处 $\odot$ 表示矩阵的哈达玛积,即对应位乘积) $$ @@ -98,7 +98,7 @@ $$ $$ 可知 $$ -\Delta_X = \Delta_Y\odot\mathrm{ReLU}'(X) +\Delta\_X = \Delta\_Y\odot\mathrm{ReLU}'(X) $$ $\square$ @@ -106,23 +106,23 @@ $\square$ ### Log -考虑 $Y = \mathrm{Log}(X)$ ,其中 $Y,\ X \in \R^{n\times m}$ +考虑 $Y = \mathrm{Log}(X)$ ,其中 $Y,\ X \in \mathbb R^{n\times m}$ -设损失函数为 $\mathcal L(\boldsymbol y,\ \boldsymbol {\hat y})$ ,且 $\Delta_Y$ (即 $Y$ 相对于损失函数的梯度)已知,希望得到 $\Delta_X$ +设损失函数为 $\mathcal L(\boldsymbol y,\ \boldsymbol {\hat y})$ ,且 $\Delta\_Y$ (即 $Y$ 相对于损失函数的梯度)已知,希望得到 $\Delta\_X$ 推导如下: -#### $\Delta_X$ 的推导 +#### $\Delta\_X$ 的推导 方法类似于上文,只要注意这里的 $\mathrm{Log}$ 是一个逐元素函数 -考虑 $Y$ 的每一位对 $X$ 贡献的导数,即 $\frac{\mathrm{d}Y_{ij}}{\mathrm{d} X}$ +考虑 $Y$ 的每一位对 $X$ 贡献的导数,即 $\frac{\mathrm{d}Y\_{ij}}{\mathrm{d} X}$ -由于 $Y_{ij} = \mathrm{Log}(X_{ij})$ ,故 $\left[\frac{\mathrm{d}Y_{ij}}{\mathrm{d}X}\right]_{ij} = \mathrm{Log}'(X_{ij}) = \frac{1}{X_{ij}}$ ,其余项为 $0$ +由于 $Y\_{ij} = \mathrm{Log}(X\_{ij})$ ,故 $\left[\frac{\mathrm{d}Y\_{ij}}{\mathrm{d}X}\right]\_{ij} = \mathrm{Log}'(X\_{ij}) = \frac{1}{X\_{ij}}$ ,其余项为 $0$ 则有 $$ -\frac{\partial{\mathcal L}}{\partial X_{ij}} = \sum_{s = 1}^n\sum_{t = 1}^{m} \frac{\partial{\mathcal L}}{\partial Y_{st}}\frac{\mathrm{d}{Y_{st}}}{\mathrm{d} X_{ij}} = \frac{\partial{\mathcal L}}{\partial Y_{ij}}\left[\frac{\partial{Y_{ij}}}{\partial X}\right]_{ij} =\frac{\partial{\mathcal L}}{\partial Y_{ij}}\frac{1}{X_{ij}} +\frac{\partial{\mathcal L}}{\partial X\_{ij}} = \sum\_{s = 1}^n\sum\_{t = 1}^{m} \frac{\partial{\mathcal L}}{\partial Y\_{st}}\frac{\mathrm{d}{Y\_{st}}}{\mathrm{d} X\_{ij}} = \frac{\partial{\mathcal L}}{\partial Y\_{ij}}\left[\frac{\partial{Y\_{ij}}}{\partial X}\right]\_{ij} =\frac{\partial{\mathcal L}}{\partial Y\_{ij}}\frac{1}{X\_{ij}} $$ 即(其中 $\frac{1}{X}$ 表示 $X$ 的每一位取倒数后的结果) $$ @@ -130,7 +130,7 @@ $$ $$ 可知 $$ -\Delta_X = \Delta_Y\odot\frac{1}{X} +\Delta\_X = \Delta\_Y\odot\frac{1}{X} $$ $\square$ @@ -138,9 +138,9 @@ $\square$ ### Softmax -考虑 $\boldsymbol y = \mathrm{Softmax}(\boldsymbol x)$ ,其中 $\boldsymbol y,\ \boldsymbol x \in \R^{1 \times c}$ +考虑 $\boldsymbol y = \mathrm{Softmax}(\boldsymbol x)$ ,其中 $\boldsymbol y,\ \boldsymbol x \in \mathbb R^{1 \times c}$ -设损失函数为 $\mathcal L(\boldsymbol y,\ \boldsymbol {\hat y})$ ,且 $\Delta_{\boldsymbol y}$ (即 $Y$ 相对于损失函数的梯度)已知,希望得到 $\boldsymbol y$ 的表达(前向计算)及 $\Delta_{\boldsymbol x}$ +设损失函数为 $\mathcal L(\boldsymbol y,\ \boldsymbol {\hat y})$ ,且 $\Delta\_{\boldsymbol y}$ (即 $Y$ 相对于损失函数的梯度)已知,希望得到 $\boldsymbol y$ 的表达(前向计算)及 $\Delta\_{\boldsymbol x}$ 推导如下: @@ -148,36 +148,36 @@ $\square$ 根据 $\mathrm{Softmax}$ 的定义,可以得到 $$ -\boldsymbol y_i = \frac{e^{\boldsymbol x_i}}{\sum_{j = 1}^ce^{\boldsymbol x_j}} +\boldsymbol y\_i = \frac{e^{\boldsymbol x\_i}}{\sum\_{j = 1}^ce^{\boldsymbol x\_j}} $$ -#### $\Delta_{\boldsymbol x}$ 的推导 +#### $\Delta\_{\boldsymbol x}$ 的推导 由于 $$ -\boldsymbol y_i = \frac{e^{\boldsymbol x_i}}{\sum_{j = 1}^ce^{\boldsymbol x_j}} +\boldsymbol y\_i = \frac{e^{\boldsymbol x\_i}}{\sum\_{j = 1}^ce^{\boldsymbol x\_j}} $$ 且 $$ \frac{\partial \boldsymbol y}{\partial\boldsymbol x} = \begin{bmatrix} -\frac{\partial \boldsymbol y_1}{\partial \boldsymbol x_1} & \frac{\partial \boldsymbol y_1}{\partial \boldsymbol x_2} & \cdots & \frac{\partial \boldsymbol y_1}{\partial \boldsymbol x_c} \\ -\frac{\partial \boldsymbol y_2}{\partial \boldsymbol x_1} & \frac{\partial \boldsymbol y_2}{\partial \boldsymbol x_2} & \cdots & \frac{\partial \boldsymbol y_2}{\partial \boldsymbol x_c} \\ -\vdots & \vdots & \ddots & \vdots \\ -\frac{\partial \boldsymbol y_c}{\partial \boldsymbol x_1} & \frac{\partial \boldsymbol y_c}{\partial \boldsymbol x_2} & \cdots & \frac{\partial \boldsymbol y_c}{\partial \boldsymbol x_c} \\ +\frac{\partial \boldsymbol y\_1}{\partial \boldsymbol x\_1} & \frac{\partial \boldsymbol y\_1}{\partial \boldsymbol x\_2} & \cdots & \frac{\partial \boldsymbol y\_1}{\partial \boldsymbol x\_c} \\\\ +\frac{\partial \boldsymbol y\_2}{\partial \boldsymbol x\_1} & \frac{\partial \boldsymbol y\_2}{\partial \boldsymbol x\_2} & \cdots & \frac{\partial \boldsymbol y\_2}{\partial \boldsymbol x\_c} \\\\ +\vdots & \vdots & \ddots & \vdots \\\\ +\frac{\partial \boldsymbol y\_c}{\partial \boldsymbol x\_1} & \frac{\partial \boldsymbol y\_c}{\partial \boldsymbol x\_2} & \cdots & \frac{\partial \boldsymbol y\_c}{\partial \boldsymbol x\_c} \\\\ \end{bmatrix} $$ 故当 $i = j$ 时,有 $$ -\left[\frac{\partial \boldsymbol y}{\partial\boldsymbol x}\right]_{ii} = \frac{\partial \boldsymbol y_i}{\partial \boldsymbol x_i} = \frac{\partial\left( \frac{e^{\boldsymbol x_i}}{\sum_{j = 1}^ce^{\boldsymbol x_j}}\right)}{\partial \boldsymbol x_i} = \frac{e^{\boldsymbol x_i}(\sum_{j = 1}^ce^{\boldsymbol x_j}) - e^{\boldsymbol x_i}e^{\boldsymbol x_i}}{\left(\sum_{j = 1}^ce^{\boldsymbol x_j}\right)^2} = \frac{e^{\boldsymbol x_i}}{\sum_{j = 1}^ce^{\boldsymbol x_j}}\frac{\left(\sum_{j = 1}^ce^{\boldsymbol x_j}\right) - e^{\boldsymbol x_i}}{\sum_{j = 1}^ce^{\boldsymbol x_j}} = \boldsymbol y_i(1 - \boldsymbol y_i) +\left[\frac{\partial \boldsymbol y}{\partial\boldsymbol x}\right]\_{ii} = \frac{\partial \boldsymbol y\_i}{\partial \boldsymbol x\_i} = \frac{\partial\left( \frac{e^{\boldsymbol x\_i}}{\sum\_{j = 1}^ce^{\boldsymbol x\_j}}\right)}{\partial \boldsymbol x\_i} = \frac{e^{\boldsymbol x\_i}(\sum\_{j = 1}^ce^{\boldsymbol x\_j}) - e^{\boldsymbol x\_i}e^{\boldsymbol x\_i}}{\left(\sum\_{j = 1}^ce^{\boldsymbol x\_j}\right)^2} = \frac{e^{\boldsymbol x\_i}}{\sum\_{j = 1}^ce^{\boldsymbol x\_j}}\frac{\left(\sum\_{j = 1}^ce^{\boldsymbol x\_j}\right) - e^{\boldsymbol x\_i}}{\sum\_{j = 1}^ce^{\boldsymbol x\_j}} = \boldsymbol y\_i(1 - \boldsymbol y\_i) $$ 当 $i \neq j$ 时,有 $$ -\left[\frac{\partial \boldsymbol y}{\partial\boldsymbol x}\right]_{ij} = \frac{\partial \boldsymbol y_i}{\partial \boldsymbol x_j} = \frac{\partial\left( \frac{e^{\boldsymbol x_i}}{\sum_{j = 1}^ce^{\boldsymbol x_j}}\right)}{\partial \boldsymbol x_j} = \frac{-e^{\boldsymbol x_i}e^{\boldsymbol x_j}}{\left(\sum_{j = 1}^ce^{\boldsymbol x_j}\right)^2} = -\boldsymbol y_i\boldsymbol y_j +\left[\frac{\partial \boldsymbol y}{\partial\boldsymbol x}\right]\_{ij} = \frac{\partial \boldsymbol y\_i}{\partial \boldsymbol x\_j} = \frac{\partial\left( \frac{e^{\boldsymbol x\_i}}{\sum\_{j = 1}^ce^{\boldsymbol x\_j}}\right)}{\partial \boldsymbol x\_j} = \frac{-e^{\boldsymbol x\_i}e^{\boldsymbol x\_j}}{\left(\sum\_{j = 1}^ce^{\boldsymbol x\_j}\right)^2} = -\boldsymbol y\_i\boldsymbol y\_j $$ 则有 $$ -\frac{\partial{\mathcal L}}{\partial\boldsymbol x_{j}} = \sum_{i = 1}^c\frac{\partial{\mathcal L}}{\partial\boldsymbol y_{i}}\frac{\partial\boldsymbol y_i}{\partial \boldsymbol x_j} = \sum_{i = 1}^c\frac{\partial{\mathcal L}}{\partial\boldsymbol y_{i}}\left[\frac{\partial\boldsymbol y}{\partial \boldsymbol x}\right]_{ij} +\frac{\partial{\mathcal L}}{\partial\boldsymbol x\_{j}} = \sum\_{i = 1}^c\frac{\partial{\mathcal L}}{\partial\boldsymbol y\_{i}}\frac{\partial\boldsymbol y\_i}{\partial \boldsymbol x\_j} = \sum\_{i = 1}^c\frac{\partial{\mathcal L}}{\partial\boldsymbol y\_{i}}\left[\frac{\partial\boldsymbol y}{\partial \boldsymbol x}\right]\_{ij} $$ 即 $$ @@ -185,7 +185,7 @@ $$ $$ 可知(其中 $\left[\frac{\partial\boldsymbol y}{\partial \boldsymbol x}\right]$ 已经在上文中求出) $$ -\Delta_{\boldsymbol x} = \Delta_{\boldsymbol y}\left[\frac{\partial\boldsymbol y}{\partial \boldsymbol x}\right] +\Delta\_{\boldsymbol x} = \Delta\_{\boldsymbol y}\left[\frac{\partial\boldsymbol y}{\partial \boldsymbol x}\right] $$ 由于原代码中给出的是一组行向量构成的矩阵,因此,我们可以对每一行分别进行如上操作,且行与行之间互不干扰,由此完成 $\mathrm{Softmax}$ 的反向传播 @@ -203,27 +203,27 @@ $\square$ #### $F$ 的推导(前向计算) -$F(X) = \mathrm{Log}(\mathrm{Softmax}(\mathrm{ReLU}(\mathrm{ReLU}(X\cdot W_1)\cdot W_2)\cdot W_3))$ +$F(X) = \mathrm{Log}(\mathrm{Softmax}(\mathrm{ReLU}(\mathrm{ReLU}(X\cdot W\_1)\cdot W\_2)\cdot W\_3))$ -其中 $X \in \R^{n \times 784},\ W_1 \in \R^{784\times 256},\ W_2 \in \R^{256\times 64},\ W_3 \in \R^{64\times 10},\ F(X) \in \R^{n \times 10}$ ,$n$ 为数据条数 +其中 $X \in \mathbb R^{n \times 784},\ W\_1 \in \mathbb R^{784\times 256},\ W\_2 \in \mathbb R^{256\times 64},\ W\_3 \in \mathbb R^{64\times 10},\ F(X) \in \mathbb R^{n \times 10}$ ,$n$ 为数据条数 #### FNN 后向传播的推导 根据代码,我们可以得到模型定义的损失函数为 $\mathcal L(\boldsymbol y,\ \boldsymbol {\hat y}) = -\boldsymbol {\hat y}\boldsymbol y^T$ ($\boldsymbol {\hat y}$ 表示预测向量),而在整个数据集上的定义为 -$\mathcal L(Y,\ \hat Y) = -\frac{1}{n}\sum_{i = 1}^n{\hat Y_{i:}}Y_{i:}^T = -\frac{1}{n}\sum_{i = 1}^n\sum_{j = 1}^d{\hat Y_{ij}}Y_{ij}$ +$\mathcal L(Y,\ \hat Y) = -\frac{1}{n}\sum\_{i = 1}^n{\hat Y\_{i:}}Y\_{i:}^T = -\frac{1}{n}\sum\_{i = 1}^n\sum\_{j = 1}^d{\hat Y\_{ij}}Y\_{ij}$ -由此我们需要计算 $\Delta_{\hat Y}$ (即 $\hat Y$ 相对于该损失函数的梯度,$\hat Y = F(X)$ ) +由此我们需要计算 $\Delta\_{\hat Y}$ (即 $\hat Y$ 相对于该损失函数的梯度,$\hat Y = F(X)$ ) 推导如下: -#### $\Delta_{\hat Y}$ 的推导 +#### $\Delta\_{\hat Y}$ 的推导 根据上文对 $\mathcal L$ 的定义,我们很容易得到 $$ -\frac{\partial\mathcal L}{\partial\hat Y_{ij}} = -\frac{1}{n}Y_{ij} +\frac{\partial\mathcal L}{\partial\hat Y\_{ij}} = -\frac{1}{n}Y\_{ij} $$ -由此即可得到最初进入反向传播过程的梯度矩阵 $\Delta_{\hat Y}$ ,其它层上的梯度可以通过逐层反向传播得到 +由此即可得到最初进入反向传播过程的梯度矩阵 $\Delta\_{\hat Y}$ ,其它层上的梯度可以通过逐层反向传播得到 @@ -252,7 +252,7 @@ $$ [^1]: 来源于该式 $\mathrm{d}f = \mathrm{tr}\left(\frac{\partial f}{\partial X}^T \mathrm{d}X\right)$,用于标量对矩阵的求导。从需要求导的标量出发,套上迹运算,再结合一些迹内运算的恒等式推导得到类似形式,则迹运算内 $\mathrm{d}X$ 左侧部分的转置即为所求导数 -[^2]:来源于该式 $\mathrm{vec}(\mathrm{d}F) = \frac{\partial F}{\partial X}^T \mathrm{vec}(\mathrm{d}X)$,用于矩阵对矩阵的求导。类似地,从需要求导的矩阵出发,套上向量化运算,再结合一些向量化内运算的恒等式推导得到类似形式,则 $\mathrm{vec}(\mathrm{d}X)$ 左侧部分的转置即为所求导数 +[^2]: 来源于该式 $\mathrm{vec}(\mathrm{d}F) = \frac{\partial F}{\partial X}^T \mathrm{vec}(\mathrm{d}X)$,用于矩阵对矩阵的求导。类似地,从需要求导的矩阵出发,套上向量化运算,再结合一些向量化内运算的恒等式推导得到类似形式,则 $\mathrm{vec}(\mathrm{d}X)$ 左侧部分的转置即为所求导数 [^3]: 对于这一点,可以举例考虑函数 $f = (x,\ y,\ z),\ x = g(u),\ y = h(u),\ z = t(u)$ 。如果可导相关的条件上没有任何障碍,那么想要求出 $\frac{\partial f}{\partial u}$ ,我们就必须计算 $\frac{\partial f}{\partial x}\frac{\partial x}{\partial u} + \frac{\partial f}{\partial y}\frac{\partial y}{\partial u} + \frac{\partial f}{\partial z}\frac{\partial z}{\partial u}$ ,也就是考虑 $u$ 到 $f$ 的每一条作用路径。 @@ -452,7 +452,7 @@ def mini_batch(dataset, batch_size = 128, numpy = False): | $1$ | $96.28\%$ | $96.45\%$ | $86.86\%$ | | $2$ | $97.24\%$ | $97.43\%$ | $89.22\%$ | -![5.1](img/5.2.png) +5 *其实这时候忘了把 RMSprop 加进去了,但是对比也是很丰富的... @@ -480,7 +480,7 @@ def mini_batch(dataset, batch_size = 128, numpy = False): | $1$ | $95.47\%$ | $95.17\%$ | $87.09\%$ | | $2$ | $96.49\%$ | $96.71\%$ | $89.76\%$ | -![5.3](img/5.3.png) +5 *这时候仍然忘了把 RMSprop 加进去了... @@ -502,13 +502,13 @@ def mini_batch(dataset, batch_size = 128, numpy = False): | $1$ | $45.39\%$ | $21.94\%$ | $29.31\%$ | $89.67\%$ | | $2$ | $50.02\%$ | $21.94\%$ | $29.31\%$ | $89.85\%$ | -| Epoch | AdaGrad | RMSprop | AdaDelta | -| :---: | :-------: | :-----: | :-------: | -| $0$ | $89.44\%$ | $89.56\%$ | $77.58\%$ | -| $1$ | $91.83\%$ | $91.85\%$ | $86.84\%$ | -| $2$ | $92.76\%$ | $92.90\%$ | $89.26\%$ | +| Epoch | AdaGrad | RMSprop | AdaDelta | +| :---: | :-------: | :-------: | :-------: | +| $0$ | $89.44\%$ | $89.56\%$ | $77.58\%$ | +| $1$ | $91.83\%$ | $91.85\%$ | $86.84\%$ | +| $2$ | $92.76\%$ | $92.90\%$ | $89.26\%$ | -![5.4](img/5.4.png) +5 #### 总结