diff --git a/assignment-2/submission/19307130062/README.md b/assignment-2/submission/19307130062/README.md
index 02a93eb112820dff16d39f05398e6a648fea011b..6615e86187728bd84f94ff46c3ef55d6595e1b00 100644
--- a/assignment-2/submission/19307130062/README.md
+++ b/assignment-2/submission/19307130062/README.md
@@ -9,31 +9,31 @@
### Matmul
-考虑 $Y = XW$,其中 $Y \in \R^{n\times d_2},\ X \in \R^{n \times d_1},\ W \in \R^{d_1 \times d_2}$
+考虑 $Y = XW$,其中 $Y \in \mathbb R^{n\times d\_2},\ X \in \mathbb R^{n \times d\_1},\ W \in \mathbb R^{d\_1 \times d\_2}$
-设损失函数为 $\mathcal L(\boldsymbol y,\ \boldsymbol {\hat y})$ ,且 $\Delta_Y$ (即 $Y$ 相对于损失函数的梯度)已知,希望得到 $\Delta_X,\ \Delta_W$
+设损失函数为 $\mathcal L(\boldsymbol y,\ \boldsymbol {\hat y})$ ,且 $\Delta\_Y$ (即 $Y$ 相对于损失函数的梯度)已知,希望得到 $\Delta\_X,\ \Delta\_W$
推导如下:
-#### $\Delta_X$ 的推导
+#### $\Delta\_X$ 的推导
-我们考虑 $Y$ 的每一位对 $X$ 贡献的偏导,即 $\frac{\partial Y_{ij}}{\partial X}$
+我们考虑 $Y$ 的每一位对 $X$ 贡献的偏导,即 $\frac{\partial Y\_{ij}}{\partial X}$
-由于 $Y_{ij} = \sum_{k = 1}^{d_1}X_{ik}W_{kj}$ ,$X$ 各位独立,且
+由于 $Y\_{ij} = \sum\_{k = 1}^{d\_1}X\_{ik}W\_{kj}$ ,$X$ 各位独立,且
$$
-\frac{\partial Y_{ij}}{\partial X} =
+\frac{\partial Y\_{ij}}{\partial X} =
\begin{bmatrix}
-\frac{\partial Y_{ij}}{\partial X_{11}} & \frac{\partial Y_{ij}}{\partial X_{12}} & \cdots & \frac{\partial Y_{ij}}{\partial X_{1d_1}} \\
-\frac{\partial Y_{ij}}{\partial X_{21}} & \frac{\partial Y_{ij}}{\partial X_{22}} & \cdots & \frac{\partial Y_{ij}}{\partial X_{2d_1}} \\
-\vdots & \vdots & \ddots & \vdots \\
-\frac{\partial Y_{ij}}{\partial X_{n1}} & \frac{\partial Y_{ij}}{\partial X_{n2}} & \cdots & \frac{\partial Y_{ij}}{\partial X_{nd_1}} \\
+\frac{\partial Y\_{ij}}{\partial X\_{11}} & \frac{\partial Y\_{ij}}{\partial X\_{12}} & \cdots & \frac{\partial Y\_{ij}}{\partial X\_{1d\_1}} \\\\
+\frac{\partial Y\_{ij}}{\partial X\_{21}} & \frac{\partial Y\_{ij}}{\partial X\_{22}} & \cdots & \frac{\partial Y\_{ij}}{\partial X\_{2d\_1}} \\\\
+\vdots & \vdots & \ddots & \vdots \\\\
+\frac{\partial Y\_{ij}}{\partial X\_{n1}} & \frac{\partial Y\_{ij}}{\partial X\_{n2}} & \cdots & \frac{\partial Y\_{ij}}{\partial X\_{nd\_1}} \\\\
\end{bmatrix}
$$
-故 $\left[\frac{\partial Y_{ij}}{\partial X}\right]_{ik} = W_{kj},\ k \in [1,\ d_1] \cap\Z$ ,其余项为 $0$
+故 $\left[\frac{\partial Y\_{ij}}{\partial X}\right]\_{ik} = W\_{kj},\ k \in [1,\ d\_1] \cap\mathbb Z$ ,其余项为 $0$
-由于 $\Delta_Y$ 已知,即 $\frac{\partial{\mathcal L}}{\partial Y_{ij}}$ 已知,则有
+由于 $\Delta\_Y$ 已知,即 $\frac{\partial{\mathcal L}}{\partial Y\_{ij}}$ 已知,则有
$$
-\frac{\partial{\mathcal L}}{\partial X_{ij}} = \sum_{s = 1}^n\sum_{t = 1}^{d_2} \frac{\partial{\mathcal L}}{\partial Y_{st}}\frac{\partial{Y_{st}}}{\partial X_{ij}} = \sum_{s = 1}^n\sum_{t = 1}^{d_2} \frac{\partial{\mathcal L}}{\partial Y_{st}}\left[\frac{\partial{Y_{st}}}{\partial X}\right]_{ij} = \sum_{t = 1}^{d_2} \frac{\partial{\mathcal L}}{\partial Y_{it}}\left[\frac{\partial{Y_{it}}}{\partial X}\right]_{ij} = \sum_{t = 1}^{d_2} \frac{\partial{\mathcal L}}{\partial Y_{it}}W_{jt} = \sum_{t = 1}^{d_2} \frac{\partial{\mathcal L}}{\partial Y_{it}}W^T_{tj}
+\frac{\partial{\mathcal L}}{\partial X\_{ij}} = \sum\_{s = 1}^n\sum\_{t = 1}^{d\_2} \frac{\partial{\mathcal L}}{\partial Y\_{st}}\frac{\partial{Y\_{st}}}{\partial X\_{ij}} = \sum\_{s = 1}^n\sum\_{t = 1}^{d\_2} \frac{\partial{\mathcal L}}{\partial Y\_{st}}\left[\frac{\partial{Y\_{st}}}{\partial X}\right]\_{ij} = \sum\_{t = 1}^{d\_2} \frac{\partial{\mathcal L}}{\partial Y\_{it}}\left[\frac{\partial{Y\_{it}}}{\partial X}\right]\_{ij} = \sum\_{t = 1}^{d\_2} \frac{\partial{\mathcal L}}{\partial Y\_{it}}W\_{jt} = \sum\_{t = 1}^{d\_2} \frac{\partial{\mathcal L}}{\partial Y\_{it}}W^T\_{tj}
$$
即
$$
@@ -41,14 +41,14 @@ $$
$$
可知
$$
-\Delta_X = \Delta_YW^T
+\Delta\_X = \Delta\_YW^T
$$
-#### $\Delta_W$ 的推导
+#### $\Delta\_W$ 的推导
-其次,对于 $\Delta_W$ ,我们用类似的方法进行计算,有 $\left[\frac{\partial Y_{ij}}{\partial W}\right]_{kj} = X_{ik},\ k \in [1,\ d_1] \cap\Z$ ,其余项为 $0$ ,则有
+其次,对于 $\Delta\_W$ ,我们用类似的方法进行计算,有 $\left[\frac{\partial Y\_{ij}}{\partial W}\right]\_{kj} = X\_{ik},\ k \in [1,\ d\_1] \cap\mathbb Z$ ,其余项为 $0$ ,则有
$$
-\frac{\partial{\mathcal L}}{\partial W_{ij}} = \sum_{s = 1}^{n}\sum_{t = 1}^{d_2} \frac{\partial{\mathcal L}}{\partial Y_{st}}\left[\frac{\partial{Y_{st}}}{\partial W}\right]_{ij} = \sum_{s = 1}^{n} \frac{\partial{\mathcal L}}{\partial Y_{sj}}\left[\frac{\partial{Y_{sj}}}{\partial W}\right]_{ij} = \sum_{s = 1}^{n} \frac{\partial{\mathcal L}}{\partial Y_{sj}}X_{si} = \sum_{s = 1}^{n} X_{is}^T\frac{\partial{\mathcal L}}{\partial Y_{sj}}
+\frac{\partial{\mathcal L}}{\partial W\_{ij}} = \sum\_{s = 1}^{n}\sum\_{t = 1}^{d\_2} \frac{\partial{\mathcal L}}{\partial Y\_{st}}\left[\frac{\partial{Y\_{st}}}{\partial W}\right]\_{ij} = \sum\_{s = 1}^{n} \frac{\partial{\mathcal L}}{\partial Y\_{sj}}\left[\frac{\partial{Y\_{sj}}}{\partial W}\right]\_{ij} = \sum\_{s = 1}^{n} \frac{\partial{\mathcal L}}{\partial Y\_{sj}}X\_{si} = \sum\_{s = 1}^{n} X\_{is}^T\frac{\partial{\mathcal L}}{\partial Y\_{sj}}
$$
即
$$
@@ -56,7 +56,7 @@ $$
$$
可知
$$
-\Delta_W = X^T\Delta_Y
+\Delta\_W = X^T\Delta\_Y
$$
$\square$
@@ -64,25 +64,25 @@ $\square$
### ReLU
-考虑 $Y = \mathrm{ReLU}(X)$ ,其中 $Y,\ X \in \R^{n\times m}$
+考虑 $Y = \mathrm{ReLU}(X)$ ,其中 $Y,\ X \in \mathbb R^{n\times m}$
-设损失函数为 $\mathcal L(\boldsymbol y,\ \boldsymbol {\hat y})$ ,且 $\Delta_Y$ (即 $Y$ 相对于损失函数的梯度)已知,希望得到 $\Delta_X$
+设损失函数为 $\mathcal L(\boldsymbol y,\ \boldsymbol {\hat y})$ ,且 $\Delta\_Y$ (即 $Y$ 相对于损失函数的梯度)已知,希望得到 $\Delta\_X$
推导如下:
-#### $\Delta_X$ 的推导
+#### $\Delta\_X$ 的推导
方法类似于上文,只要注意这里的 $\mathrm{ReLU}$ 是一个逐元素函数
-考虑 $Y$ 的每一位对 $X$ 贡献的导数,即 $\frac{\mathrm{d}Y_{ij}}{\mathrm{d} X}$
+考虑 $Y$ 的每一位对 $X$ 贡献的导数,即 $\frac{\mathrm{d}Y\_{ij}}{\mathrm{d} X}$
-由于 $Y_{ij} = \mathrm{ReLU}(X_{ij})$ ,故 $\left[\frac{\mathrm{d}Y_{ij}}{\mathrm{d}X}\right]_{ij} = \mathrm{ReLU}'(X_{ij})$ ,其余项为 $0$
+由于 $Y\_{ij} = \mathrm{ReLU}(X\_{ij})$ ,故 $\left[\frac{\mathrm{d}Y\_{ij}}{\mathrm{d}X}\right]\_{ij} = \mathrm{ReLU}'(X\_{ij})$ ,其余项为 $0$
显然
$$
\mathrm{ReLU}'(x) = \begin{cases}
-0, & n < 0 \\
-1, & n > 0 \\
+0, & n < 0 \\\\
+1, & n > 0 \\\\
\mathrm{Undefined}, & n = 0
\end{cases}
$$
@@ -90,7 +90,7 @@ $$
则有
$$
-\frac{\partial{\mathcal L}}{\partial X_{ij}} = \sum_{s = 1}^n\sum_{t = 1}^{m} \frac{\partial{\mathcal L}}{\partial Y_{st}}\frac{\mathrm{d}{Y_{st}}}{\mathrm{d} X_{ij}} = \frac{\partial{\mathcal L}}{\partial Y_{ij}}\left[\frac{\partial{Y_{ij}}}{\partial X}\right]_{ij} =\frac{\partial{\mathcal L}}{\partial Y_{ij}}\mathrm{ReLU}'(X_{ij})
+\frac{\partial{\mathcal L}}{\partial X\_{ij}} = \sum\_{s = 1}^n\sum\_{t = 1}^{m} \frac{\partial{\mathcal L}}{\partial Y\_{st}}\frac{\mathrm{d}{Y\_{st}}}{\mathrm{d} X\_{ij}} = \frac{\partial{\mathcal L}}{\partial Y\_{ij}}\left[\frac{\partial{Y\_{ij}}}{\partial X}\right]\_{ij} =\frac{\partial{\mathcal L}}{\partial Y\_{ij}}\mathrm{ReLU}'(X\_{ij})
$$
即(此处 $\odot$ 表示矩阵的哈达玛积,即对应位乘积)
$$
@@ -98,7 +98,7 @@ $$
$$
可知
$$
-\Delta_X = \Delta_Y\odot\mathrm{ReLU}'(X)
+\Delta\_X = \Delta\_Y\odot\mathrm{ReLU}'(X)
$$
$\square$
@@ -106,23 +106,23 @@ $\square$
### Log
-考虑 $Y = \mathrm{Log}(X)$ ,其中 $Y,\ X \in \R^{n\times m}$
+考虑 $Y = \mathrm{Log}(X)$ ,其中 $Y,\ X \in \mathbb R^{n\times m}$
-设损失函数为 $\mathcal L(\boldsymbol y,\ \boldsymbol {\hat y})$ ,且 $\Delta_Y$ (即 $Y$ 相对于损失函数的梯度)已知,希望得到 $\Delta_X$
+设损失函数为 $\mathcal L(\boldsymbol y,\ \boldsymbol {\hat y})$ ,且 $\Delta\_Y$ (即 $Y$ 相对于损失函数的梯度)已知,希望得到 $\Delta\_X$
推导如下:
-#### $\Delta_X$ 的推导
+#### $\Delta\_X$ 的推导
方法类似于上文,只要注意这里的 $\mathrm{Log}$ 是一个逐元素函数
-考虑 $Y$ 的每一位对 $X$ 贡献的导数,即 $\frac{\mathrm{d}Y_{ij}}{\mathrm{d} X}$
+考虑 $Y$ 的每一位对 $X$ 贡献的导数,即 $\frac{\mathrm{d}Y\_{ij}}{\mathrm{d} X}$
-由于 $Y_{ij} = \mathrm{Log}(X_{ij})$ ,故 $\left[\frac{\mathrm{d}Y_{ij}}{\mathrm{d}X}\right]_{ij} = \mathrm{Log}'(X_{ij}) = \frac{1}{X_{ij}}$ ,其余项为 $0$
+由于 $Y\_{ij} = \mathrm{Log}(X\_{ij})$ ,故 $\left[\frac{\mathrm{d}Y\_{ij}}{\mathrm{d}X}\right]\_{ij} = \mathrm{Log}'(X\_{ij}) = \frac{1}{X\_{ij}}$ ,其余项为 $0$
则有
$$
-\frac{\partial{\mathcal L}}{\partial X_{ij}} = \sum_{s = 1}^n\sum_{t = 1}^{m} \frac{\partial{\mathcal L}}{\partial Y_{st}}\frac{\mathrm{d}{Y_{st}}}{\mathrm{d} X_{ij}} = \frac{\partial{\mathcal L}}{\partial Y_{ij}}\left[\frac{\partial{Y_{ij}}}{\partial X}\right]_{ij} =\frac{\partial{\mathcal L}}{\partial Y_{ij}}\frac{1}{X_{ij}}
+\frac{\partial{\mathcal L}}{\partial X\_{ij}} = \sum\_{s = 1}^n\sum\_{t = 1}^{m} \frac{\partial{\mathcal L}}{\partial Y\_{st}}\frac{\mathrm{d}{Y\_{st}}}{\mathrm{d} X\_{ij}} = \frac{\partial{\mathcal L}}{\partial Y\_{ij}}\left[\frac{\partial{Y\_{ij}}}{\partial X}\right]\_{ij} =\frac{\partial{\mathcal L}}{\partial Y\_{ij}}\frac{1}{X\_{ij}}
$$
即(其中 $\frac{1}{X}$ 表示 $X$ 的每一位取倒数后的结果)
$$
@@ -130,7 +130,7 @@ $$
$$
可知
$$
-\Delta_X = \Delta_Y\odot\frac{1}{X}
+\Delta\_X = \Delta\_Y\odot\frac{1}{X}
$$
$\square$
@@ -138,9 +138,9 @@ $\square$
### Softmax
-考虑 $\boldsymbol y = \mathrm{Softmax}(\boldsymbol x)$ ,其中 $\boldsymbol y,\ \boldsymbol x \in \R^{1 \times c}$
+考虑 $\boldsymbol y = \mathrm{Softmax}(\boldsymbol x)$ ,其中 $\boldsymbol y,\ \boldsymbol x \in \mathbb R^{1 \times c}$
-设损失函数为 $\mathcal L(\boldsymbol y,\ \boldsymbol {\hat y})$ ,且 $\Delta_{\boldsymbol y}$ (即 $Y$ 相对于损失函数的梯度)已知,希望得到 $\boldsymbol y$ 的表达(前向计算)及 $\Delta_{\boldsymbol x}$
+设损失函数为 $\mathcal L(\boldsymbol y,\ \boldsymbol {\hat y})$ ,且 $\Delta\_{\boldsymbol y}$ (即 $Y$ 相对于损失函数的梯度)已知,希望得到 $\boldsymbol y$ 的表达(前向计算)及 $\Delta\_{\boldsymbol x}$
推导如下:
@@ -148,36 +148,36 @@ $\square$
根据 $\mathrm{Softmax}$ 的定义,可以得到
$$
-\boldsymbol y_i = \frac{e^{\boldsymbol x_i}}{\sum_{j = 1}^ce^{\boldsymbol x_j}}
+\boldsymbol y\_i = \frac{e^{\boldsymbol x\_i}}{\sum\_{j = 1}^ce^{\boldsymbol x\_j}}
$$
-#### $\Delta_{\boldsymbol x}$ 的推导
+#### $\Delta\_{\boldsymbol x}$ 的推导
由于
$$
-\boldsymbol y_i = \frac{e^{\boldsymbol x_i}}{\sum_{j = 1}^ce^{\boldsymbol x_j}}
+\boldsymbol y\_i = \frac{e^{\boldsymbol x\_i}}{\sum\_{j = 1}^ce^{\boldsymbol x\_j}}
$$
且
$$
\frac{\partial \boldsymbol y}{\partial\boldsymbol x} =
\begin{bmatrix}
-\frac{\partial \boldsymbol y_1}{\partial \boldsymbol x_1} & \frac{\partial \boldsymbol y_1}{\partial \boldsymbol x_2} & \cdots & \frac{\partial \boldsymbol y_1}{\partial \boldsymbol x_c} \\
-\frac{\partial \boldsymbol y_2}{\partial \boldsymbol x_1} & \frac{\partial \boldsymbol y_2}{\partial \boldsymbol x_2} & \cdots & \frac{\partial \boldsymbol y_2}{\partial \boldsymbol x_c} \\
-\vdots & \vdots & \ddots & \vdots \\
-\frac{\partial \boldsymbol y_c}{\partial \boldsymbol x_1} & \frac{\partial \boldsymbol y_c}{\partial \boldsymbol x_2} & \cdots & \frac{\partial \boldsymbol y_c}{\partial \boldsymbol x_c} \\
+\frac{\partial \boldsymbol y\_1}{\partial \boldsymbol x\_1} & \frac{\partial \boldsymbol y\_1}{\partial \boldsymbol x\_2} & \cdots & \frac{\partial \boldsymbol y\_1}{\partial \boldsymbol x\_c} \\\\
+\frac{\partial \boldsymbol y\_2}{\partial \boldsymbol x\_1} & \frac{\partial \boldsymbol y\_2}{\partial \boldsymbol x\_2} & \cdots & \frac{\partial \boldsymbol y\_2}{\partial \boldsymbol x\_c} \\\\
+\vdots & \vdots & \ddots & \vdots \\\\
+\frac{\partial \boldsymbol y\_c}{\partial \boldsymbol x\_1} & \frac{\partial \boldsymbol y\_c}{\partial \boldsymbol x\_2} & \cdots & \frac{\partial \boldsymbol y\_c}{\partial \boldsymbol x\_c} \\\\
\end{bmatrix}
$$
故当 $i = j$ 时,有
$$
-\left[\frac{\partial \boldsymbol y}{\partial\boldsymbol x}\right]_{ii} = \frac{\partial \boldsymbol y_i}{\partial \boldsymbol x_i} = \frac{\partial\left( \frac{e^{\boldsymbol x_i}}{\sum_{j = 1}^ce^{\boldsymbol x_j}}\right)}{\partial \boldsymbol x_i} = \frac{e^{\boldsymbol x_i}(\sum_{j = 1}^ce^{\boldsymbol x_j}) - e^{\boldsymbol x_i}e^{\boldsymbol x_i}}{\left(\sum_{j = 1}^ce^{\boldsymbol x_j}\right)^2} = \frac{e^{\boldsymbol x_i}}{\sum_{j = 1}^ce^{\boldsymbol x_j}}\frac{\left(\sum_{j = 1}^ce^{\boldsymbol x_j}\right) - e^{\boldsymbol x_i}}{\sum_{j = 1}^ce^{\boldsymbol x_j}} = \boldsymbol y_i(1 - \boldsymbol y_i)
+\left[\frac{\partial \boldsymbol y}{\partial\boldsymbol x}\right]\_{ii} = \frac{\partial \boldsymbol y\_i}{\partial \boldsymbol x\_i} = \frac{\partial\left( \frac{e^{\boldsymbol x\_i}}{\sum\_{j = 1}^ce^{\boldsymbol x\_j}}\right)}{\partial \boldsymbol x\_i} = \frac{e^{\boldsymbol x\_i}(\sum\_{j = 1}^ce^{\boldsymbol x\_j}) - e^{\boldsymbol x\_i}e^{\boldsymbol x\_i}}{\left(\sum\_{j = 1}^ce^{\boldsymbol x\_j}\right)^2} = \frac{e^{\boldsymbol x\_i}}{\sum\_{j = 1}^ce^{\boldsymbol x\_j}}\frac{\left(\sum\_{j = 1}^ce^{\boldsymbol x\_j}\right) - e^{\boldsymbol x\_i}}{\sum\_{j = 1}^ce^{\boldsymbol x\_j}} = \boldsymbol y\_i(1 - \boldsymbol y\_i)
$$
当 $i \neq j$ 时,有
$$
-\left[\frac{\partial \boldsymbol y}{\partial\boldsymbol x}\right]_{ij} = \frac{\partial \boldsymbol y_i}{\partial \boldsymbol x_j} = \frac{\partial\left( \frac{e^{\boldsymbol x_i}}{\sum_{j = 1}^ce^{\boldsymbol x_j}}\right)}{\partial \boldsymbol x_j} = \frac{-e^{\boldsymbol x_i}e^{\boldsymbol x_j}}{\left(\sum_{j = 1}^ce^{\boldsymbol x_j}\right)^2} = -\boldsymbol y_i\boldsymbol y_j
+\left[\frac{\partial \boldsymbol y}{\partial\boldsymbol x}\right]\_{ij} = \frac{\partial \boldsymbol y\_i}{\partial \boldsymbol x\_j} = \frac{\partial\left( \frac{e^{\boldsymbol x\_i}}{\sum\_{j = 1}^ce^{\boldsymbol x\_j}}\right)}{\partial \boldsymbol x\_j} = \frac{-e^{\boldsymbol x\_i}e^{\boldsymbol x\_j}}{\left(\sum\_{j = 1}^ce^{\boldsymbol x\_j}\right)^2} = -\boldsymbol y\_i\boldsymbol y\_j
$$
则有
$$
-\frac{\partial{\mathcal L}}{\partial\boldsymbol x_{j}} = \sum_{i = 1}^c\frac{\partial{\mathcal L}}{\partial\boldsymbol y_{i}}\frac{\partial\boldsymbol y_i}{\partial \boldsymbol x_j} = \sum_{i = 1}^c\frac{\partial{\mathcal L}}{\partial\boldsymbol y_{i}}\left[\frac{\partial\boldsymbol y}{\partial \boldsymbol x}\right]_{ij}
+\frac{\partial{\mathcal L}}{\partial\boldsymbol x\_{j}} = \sum\_{i = 1}^c\frac{\partial{\mathcal L}}{\partial\boldsymbol y\_{i}}\frac{\partial\boldsymbol y\_i}{\partial \boldsymbol x\_j} = \sum\_{i = 1}^c\frac{\partial{\mathcal L}}{\partial\boldsymbol y\_{i}}\left[\frac{\partial\boldsymbol y}{\partial \boldsymbol x}\right]\_{ij}
$$
即
$$
@@ -185,7 +185,7 @@ $$
$$
可知(其中 $\left[\frac{\partial\boldsymbol y}{\partial \boldsymbol x}\right]$ 已经在上文中求出)
$$
-\Delta_{\boldsymbol x} = \Delta_{\boldsymbol y}\left[\frac{\partial\boldsymbol y}{\partial \boldsymbol x}\right]
+\Delta\_{\boldsymbol x} = \Delta\_{\boldsymbol y}\left[\frac{\partial\boldsymbol y}{\partial \boldsymbol x}\right]
$$
由于原代码中给出的是一组行向量构成的矩阵,因此,我们可以对每一行分别进行如上操作,且行与行之间互不干扰,由此完成 $\mathrm{Softmax}$ 的反向传播
@@ -203,27 +203,27 @@ $\square$
#### $F$ 的推导(前向计算)
-$F(X) = \mathrm{Log}(\mathrm{Softmax}(\mathrm{ReLU}(\mathrm{ReLU}(X\cdot W_1)\cdot W_2)\cdot W_3))$
+$F(X) = \mathrm{Log}(\mathrm{Softmax}(\mathrm{ReLU}(\mathrm{ReLU}(X\cdot W\_1)\cdot W\_2)\cdot W\_3))$
-其中 $X \in \R^{n \times 784},\ W_1 \in \R^{784\times 256},\ W_2 \in \R^{256\times 64},\ W_3 \in \R^{64\times 10},\ F(X) \in \R^{n \times 10}$ ,$n$ 为数据条数
+其中 $X \in \mathbb R^{n \times 784},\ W\_1 \in \mathbb R^{784\times 256},\ W\_2 \in \mathbb R^{256\times 64},\ W\_3 \in \mathbb R^{64\times 10},\ F(X) \in \mathbb R^{n \times 10}$ ,$n$ 为数据条数
#### FNN 后向传播的推导
根据代码,我们可以得到模型定义的损失函数为 $\mathcal L(\boldsymbol y,\ \boldsymbol {\hat y}) = -\boldsymbol {\hat y}\boldsymbol y^T$ ($\boldsymbol {\hat y}$ 表示预测向量),而在整个数据集上的定义为
-$\mathcal L(Y,\ \hat Y) = -\frac{1}{n}\sum_{i = 1}^n{\hat Y_{i:}}Y_{i:}^T = -\frac{1}{n}\sum_{i = 1}^n\sum_{j = 1}^d{\hat Y_{ij}}Y_{ij}$
+$\mathcal L(Y,\ \hat Y) = -\frac{1}{n}\sum\_{i = 1}^n{\hat Y\_{i:}}Y\_{i:}^T = -\frac{1}{n}\sum\_{i = 1}^n\sum\_{j = 1}^d{\hat Y\_{ij}}Y\_{ij}$
-由此我们需要计算 $\Delta_{\hat Y}$ (即 $\hat Y$ 相对于该损失函数的梯度,$\hat Y = F(X)$ )
+由此我们需要计算 $\Delta\_{\hat Y}$ (即 $\hat Y$ 相对于该损失函数的梯度,$\hat Y = F(X)$ )
推导如下:
-#### $\Delta_{\hat Y}$ 的推导
+#### $\Delta\_{\hat Y}$ 的推导
根据上文对 $\mathcal L$ 的定义,我们很容易得到
$$
-\frac{\partial\mathcal L}{\partial\hat Y_{ij}} = -\frac{1}{n}Y_{ij}
+\frac{\partial\mathcal L}{\partial\hat Y\_{ij}} = -\frac{1}{n}Y\_{ij}
$$
-由此即可得到最初进入反向传播过程的梯度矩阵 $\Delta_{\hat Y}$ ,其它层上的梯度可以通过逐层反向传播得到
+由此即可得到最初进入反向传播过程的梯度矩阵 $\Delta\_{\hat Y}$ ,其它层上的梯度可以通过逐层反向传播得到
@@ -252,7 +252,7 @@ $$
[^1]: 来源于该式 $\mathrm{d}f = \mathrm{tr}\left(\frac{\partial f}{\partial X}^T \mathrm{d}X\right)$,用于标量对矩阵的求导。从需要求导的标量出发,套上迹运算,再结合一些迹内运算的恒等式推导得到类似形式,则迹运算内 $\mathrm{d}X$ 左侧部分的转置即为所求导数
-[^2]:来源于该式 $\mathrm{vec}(\mathrm{d}F) = \frac{\partial F}{\partial X}^T \mathrm{vec}(\mathrm{d}X)$,用于矩阵对矩阵的求导。类似地,从需要求导的矩阵出发,套上向量化运算,再结合一些向量化内运算的恒等式推导得到类似形式,则 $\mathrm{vec}(\mathrm{d}X)$ 左侧部分的转置即为所求导数
+[^2]: 来源于该式 $\mathrm{vec}(\mathrm{d}F) = \frac{\partial F}{\partial X}^T \mathrm{vec}(\mathrm{d}X)$,用于矩阵对矩阵的求导。类似地,从需要求导的矩阵出发,套上向量化运算,再结合一些向量化内运算的恒等式推导得到类似形式,则 $\mathrm{vec}(\mathrm{d}X)$ 左侧部分的转置即为所求导数
[^3]: 对于这一点,可以举例考虑函数 $f = (x,\ y,\ z),\ x = g(u),\ y = h(u),\ z = t(u)$ 。如果可导相关的条件上没有任何障碍,那么想要求出 $\frac{\partial f}{\partial u}$ ,我们就必须计算 $\frac{\partial f}{\partial x}\frac{\partial x}{\partial u} + \frac{\partial f}{\partial y}\frac{\partial y}{\partial u} + \frac{\partial f}{\partial z}\frac{\partial z}{\partial u}$ ,也就是考虑 $u$ 到 $f$ 的每一条作用路径。
@@ -452,7 +452,7 @@ def mini_batch(dataset, batch_size = 128, numpy = False):
| $1$ | $96.28\%$ | $96.45\%$ | $86.86\%$ |
| $2$ | $97.24\%$ | $97.43\%$ | $89.22\%$ |
-
+
*其实这时候忘了把 RMSprop 加进去了,但是对比也是很丰富的...
@@ -480,7 +480,7 @@ def mini_batch(dataset, batch_size = 128, numpy = False):
| $1$ | $95.47\%$ | $95.17\%$ | $87.09\%$ |
| $2$ | $96.49\%$ | $96.71\%$ | $89.76\%$ |
-
+
*这时候仍然忘了把 RMSprop 加进去了...
@@ -502,13 +502,13 @@ def mini_batch(dataset, batch_size = 128, numpy = False):
| $1$ | $45.39\%$ | $21.94\%$ | $29.31\%$ | $89.67\%$ |
| $2$ | $50.02\%$ | $21.94\%$ | $29.31\%$ | $89.85\%$ |
-| Epoch | AdaGrad | RMSprop | AdaDelta |
-| :---: | :-------: | :-----: | :-------: |
-| $0$ | $89.44\%$ | $89.56\%$ | $77.58\%$ |
-| $1$ | $91.83\%$ | $91.85\%$ | $86.84\%$ |
-| $2$ | $92.76\%$ | $92.90\%$ | $89.26\%$ |
+| Epoch | AdaGrad | RMSprop | AdaDelta |
+| :---: | :-------: | :-------: | :-------: |
+| $0$ | $89.44\%$ | $89.56\%$ | $77.58\%$ |
+| $1$ | $91.83\%$ | $91.85\%$ | $86.84\%$ |
+| $2$ | $92.76\%$ | $92.90\%$ | $89.26\%$ |
-
+
#### 总结