diff --git a/assignment-1/submission/18300110042/README.md b/assignment-1/submission/18300110042/README.md new file mode 100644 index 0000000000000000000000000000000000000000..869015d3335cf5552420b6755ff47b7cd076e1c0 --- /dev/null +++ b/assignment-1/submission/18300110042/README.md @@ -0,0 +1,688 @@ +# 课程报告 +这是`prml-21-spring/assignment-1`的课程报告,我的代码在 [source.py](source.py) 中,[knn_lab.dat](knn_lab.dat)中可以设置每次实验时的参数,包括数据参数和模型参数。数据参数有:分几组,每组有多少个样本,数据的均值及方差;模型参数有:k值,weights,距离计算方法和数据归一化/标准化方法。 + +## KNN Classifier +k近邻法是一种监督学习的算法,可以用于分类或回归问题,本次作业中用`python`实现了k近邻的分类器。 + +算法的输入是训练数据集 $$\\{ (x_1, y_1), (x_2, y_2) ,\cdots , (x_N, y_N) \\}, $$ 其中 $x_i \in X$ 是某一样本的特征向量, $y_i \in Y = \\{ c_1, \cdots c_K \\} $ 是该样本的标签;以及某一需要判断的实例 $x.$ + +而输出则是某实例 $x$ 的标签 $y.$ + +将 $x$ 映射为 $y$ 时,需要 +- 通过某种距离算法,找出训练集中与 $x$ 最近的 $k$ 个样本; +- 根据某种规则(e.g., 投票或按照距离加权)决定 $x$ 的标签 $y$. + +作为一种基于实例的、非参的学习算法,k-NN需要存储整个数据集,并在计算的时候对整个数据集进行迭代。 +对于 $N$ 个样本,每个样本特征维度为 $D$ ,则对于一个目标样本的预测,需要的时间复杂度就是 $O(N*D)$ ,因而在数据量或特征维度较大的时候,k-NN的效率就会偏低。 + +### KNN类实现 +#### 初始化 +根据`KNN`的性质,本次试验中设计了如下几个初始化参数: +- `k`,决定k近邻的数量; +实验中设定 $k\in [1, 50]$ ; +- `weights` 决定距离在投票中所占的比重; +交叉验证时的可选项为 $ \\{0, 0.1, 0.2, 0.5, 1, 2 \\}$ +- `norm`,设置数据归一化/标准化的方法; +- `dist`,设定距离的计算方法; +可选项为 `Manhattan` 或 `Euclidean`. + +#### fit() 函数 +fit() 函数主要进行: +设置超参数; +需要交叉验证时,通过10折交叉验证、网格搜索,选取平均准确率最高的超参数组合,如果传入数据小于 $10$ ,则使用 `leave-one-out cross validation`. + +#### predict() 函数 +predict() 函数根据已有的数据推断测试数据的标签。 +在 `predict` 的过程中,需要进行距离的计算以及最终的决策,实验中选择的距离的算法,以及确定测试数据标签的规则都在这一步实现。 + +## 实验探究 +实验探究主要分为两个部分: +1. 探究数据对于kNN模型的影响; +2. 探究kNN模型本身的优化方法. + +决定kNN模型的有三个基本要素: +- 距离度量; +- k值的选择; +- 分类决策规定; + +而进一步,kNN的模型确定下来后,就完全基于训练数据进行预测,因而数据的分布对于模型的表现非常重要,甚至可以说数据进一步规定了模型。 + +我们首先通过固定这三个基本要素,探究数据对于一个固定的kNN模型的影响;而后再进一步从这三个基本要素开始,探究模型可能的优化方法。 + +### 1. 数据对于kNN模型的影响 +先固定距离度量方式为 `Euclidean distance`;分类决策规定为投票方法;在每一次实验中固定 `k` 值;通过改变数据的分布进行探究。 + +#### 初步实验 +1. 生成三组数据 +通过以下分布生成了三组数据,每组400个样本,共1200个: + +| | $1$ | $2$ | $3$ | +| :----: | :------------: | :------------: | :------------: | +| $\boldsymbol{\\mu}$ | $\begin{bmatrix} 1 & 50 \end{bmatrix}$ | $\begin{bmatrix} 15 & 10 \end{bmatrix}$ | $\begin{bmatrix} 10 & 20 \end{bmatrix}$ | +| $\boldsymbol{\\Sigma}$ | $\begin{bmatrix} 1 & 0 \\\\ 0 & 10 \end{bmatrix}$ | $\begin{bmatrix} 10 & 15 \\\\ 15 & 40 \end{bmatrix}$ | $\begin{bmatrix} 20 & 0 \\\\ 0 & 30 \end{bmatrix}$ | + +这是生成的数据集: +![总的数据](img/data_plotted_data_1.png "all data (test_1)") + +这是训练集: +![训练集](img/data_plotted_train_data_1.png "training data (test_1)") + +这是测试集: +![测试集](img/data_plotted_test_data_1.png "test data (test_1)") + +以下是不同k时的准确率: +![accs_test_1](img/accs_test_1.png "accs_test_1") + + +2. 修改数据的均值,重新生成三组数据,每组400个,共1200个: + +| | $1$ | $2$ | $3$ | +| :----: | :------------: | :------------: | :------------: | +| $\boldsymbol{\\mu}$ | $\begin{bmatrix} 1 & 20 \end{bmatrix}$ | $\begin{bmatrix} 2 & 20 \end{bmatrix}$ | $\begin{bmatrix} 2 & 25 \end{bmatrix}$ | +| $\boldsymbol{\\Sigma}$ | $\begin{bmatrix} 1 & 0 \\\\ 0 & 10 \end{bmatrix}$ | $\begin{bmatrix} 10 & 15 \\\\ 15 & 40 \end{bmatrix}$ | $\begin{bmatrix} 20 & 0 \\\\ 0 & 30 \end{bmatrix}$ | + +这是生成的数据集: +![总的数据](img/data_plotted_data_2.png "all data (test_2)") + +这是训练集: +![训练集](img/data_plotted_train_data_2.png "training data (test_2)") + +这是测试集: +![测试集](img/data_plotted_test_data_2.png "test data (test_2)") + +以下是不同k时的准确率: +![accs_test_2](img/accs_test_2.png "accs_test_2") + +3. 修改数据的协方差矩阵(增大数据的方差),重新生成三组数据,每组400个,共1200个: + +| | 1 | 2 | 3 | +| :----: | :------------: | :------------: | :------------: | +| $\boldsymbol{\\mu}$ | $\begin{bmatrix} 1 & 50 \end{bmatrix}$ | $\begin{bmatrix} 15 & 10 \end{bmatrix}$ | $\begin{bmatrix} 10 & 20 \end{bmatrix}$ | +| $\boldsymbol{\\Sigma}$ | $\begin{bmatrix} 20 & 0 \\\\ 0 & 40 \end{bmatrix}$ | $\begin{bmatrix} 40 & 15 \\\\ 15 & 80 \end{bmatrix}$ | $\begin{bmatrix} 30 & 0 \\\\ 0 & 50 \end{bmatrix}$ | + +这是生成的数据集: +![总的数据](img/data_plotted_data_3.png "all data (test_3)") + +这是训练集: +![训练集](img/data_plotted_train_data_3.png "training data (test_3)") + +这是测试集: +![测试集](img/data_plotted_test_data_3.png "test data (test_3)") + +以下是不同k时的准确率: +![accs_test_3](img/accs_test_3.png "accs_test_3") + + +4. 生成5组数据,每组240个样本,共1200个: + +| | $1$ | $2$ | $3$ | $4$ | $5$ | +| :----: | :------------: | :------------: | :------------: | :------------: | :------------: | +| $\boldsymbol{\\mu}$ | $\begin{bmatrix} 1 & 50 \end{bmatrix}$ | $\begin{bmatrix} 15 & 10 \end{bmatrix}$ | $\begin{bmatrix} 10 & 20 \end{bmatrix}$ | $\begin{bmatrix} 25 & 25 \end{bmatrix}$ | $\begin{bmatrix} 40 & 5 \end{bmatrix}$ | +| $\boldsymbol{\\Sigma}$ | $\begin{bmatrix} 1 & 0 \\\\ 0 & 10 \end{bmatrix}$ | $\begin{bmatrix} 10 & 15 \\\\ 15 & 40 \end{bmatrix}$ | $\begin{bmatrix} 20 & 0 \\\\ 0 & 30 \end{bmatrix}$ | $\begin{bmatrix} 10 & 0 \\\\ 0 & 10 \end{bmatrix}$ | $\begin{bmatrix} 10 & 0 \\\\ 0 & 10 \end{bmatrix}$ | + +这是生成的数据集: +![总的数据](img/data_plotted_data_4.png "all data (test_4)") + +这是训练集: +![训练集](img/data_plotted_train_data_4.png "training data (test_4)") + +这是测试集: +![测试集](img/data_plotted_test_data_4.png "test data (test_4)") + +以下是不同k时的准确率: +![accs_test_4](img/accs_test_4.png "accs_test_4") + +5. 用(4)中同样的分布生成数据,每组400个样本,共2000个: + +这是生成的数据集: +![总的数据](img/data_plotted_data_5.png "all data (test_5)") + +这是训练集: +![训练集](img/data_plotted_train_data_5.png "training data (test_5)") + +这是测试集: +![测试集](img/data_plotted_test_data_5.png "test data (test_5)") + +以下是不同k时的准确率: +![accs_test_5](img/accs_test_5.png "accs_test_5") + + +以下是五次实验中准确率和对应的 `k` 的汇总: + +|k | test_1 | test_2 | test_3 | test_4 | test_5 | +| :---: | :------: | :------: | :------: | :------: | :------: | +| 1 | 0.9500 | 0.5792 | 0.8250 | 0.9125 | 0.9325 | +| 3 | 0.9583 | 0.6083 | 0.8708 | 0.9083 | 0.9450 | +| 5 | 0.9542 | 0.6500 | 0.8583 | 0.9167 | 0.9625 | +| 9 | 0.9583 | 0.6792 | 0.8792 | 0.9208 | 0.9625 | +| 15 | 0.9583 | 0.6625 | 0.8833 | 0.9333 | 0.9650 | +| 20 | 0.9500 | 0.6208 | 0.8833 | 0.9333 | 0.9650 | + +可以看到,当将均值调得非常接近时,kNN的准确率是最低的,`test_2` 中的准确率非常的低;当方差放大时,准确率也有所下降(`test_3`),但前面表现不好的较大的 `k` 的准确率有所提升。当类别扩展为5类,且分布“距离”较大时,准确率没有明显下降(`test_4`,`test_5`);`test_4` 中每一类别的数量(`240`)比 `test_5` 中(`400`)少了一些,当每一类别的样本数量变大时,准确率有一定的提升,但这可能和分布的“距离”等别的因素有关。 + +通过以上观察,我们发现数据中影响kNN准确率的因素可能有两个: +- 类别的样本数量 +- 分布的“距离” + +#### 进一步探究 +- 首先,对于类别中的样本数量问题,我们进行新的实验. + +6. 取实验(3)中的样本分布,即: + +| | $1$ | $2$ | $3$ | +| :----: | :------------: | :------------: | :------------: | +| $\boldsymbol{\\mu}$ | $\begin{bmatrix} 1 & 50 \end{bmatrix}$ | $\begin{bmatrix} 15 & 10 \end{bmatrix}$ | $\begin{bmatrix} 10 & 20 \end{bmatrix}$ | +| $\boldsymbol{Sigma}$ | $\begin{bmatrix} 20 & 0 \\\\ 0 & 40 \end{bmatrix}$ | $\begin{bmatrix} 40 & 15 \\\\ 15 & 80 \end{bmatrix}$ | $\begin{bmatrix} 30 & 0 \\\\ 0 & 50 \end{bmatrix}$ | + +重新进行试验,调整样本个数,得到准确率结果如下:(横轴为每一类的样本个数,各类别数量相等) + +|k | 100 | 300 | 400 | 700 | 1000 | +| :---: | :------: | :------: | :------: | :------: | :------: | +| 1 | 0.8667 | 0.8556 | 0.8292 | 0.8048 | 0.7733 | +| 3 | 0.8500 | 0.8111 | 0.8750 | 0.8190 | 0.7967 | +| 5 | 0.9167 | 0.8500 | 0.8542 | 0.8238 | 0.8067 | +| 9 | 0.9000 | 0.8611 | 0.8583 | 0.8285 | 0.7983 | +| 15 | 0.8833 | 0.8611 | 0.8708 | 0.8238 | 0.8200 | +| 20 | 0.8833 | 0.8778 | 0.8792 | 0.8357 | 0.8233 | + + +7. 取实验(1)中的样本分布,即: + +| | $1$ | $2$ | $3$ | +| :----: | :------------: | :------------: | :------------: | +| $\boldsymbol{\\mu}$ | $\begin{bmatrix} 1 & 50 \end{bmatrix}$ | $\begin{bmatrix} 15 & 10 \end{bmatrix}$ | $\begin{bmatrix} 10 & 20 \end{bmatrix}$ | +| $\boldsymbol{Sigma}$ | $\begin{bmatrix} 1 & 0 \\\\ 0 & 10 \end{bmatrix}$ | $\begin{bmatrix} 10 & 15 \\\\ 15 & 40 \end{bmatrix}$ | $\begin{bmatrix} 20 & 0 \\\\ 0 & 30 \end{bmatrix}$ | + +重新进行试验,调整样本个数,得到准确率结果如下:(横轴为每一类的样本个数,各类别数量相等) + +|k | 100 | 300 | 400 | 700 | 1000 | +| :---: | :------: | :------: | :------: | :------: | :------: | +| 1 | 0.8500 | 0.9222 | 0.9125 | 0.9095 | 0.9050 | +| 3 | 0.8833 | 0.9278 | 0.9250 | 0.9214 | 0.9250 | +| 5 | 0.8833 | 0.9278 | 0.9333 | 0.9262 | 0.9300 | +| 9 | 0.8833 | 0.9278 | 0.9208 | 0.9238 | 0.9400 | +| 15 | 0.8667 | 0.9389 | 0.9208 | 0.9286 | 0.9383 | +| 20 | 0.8667 | 0.9333 | 0.9250 | 0.9286 | 0.9417 | + + +从(6)和(7)中可以看出,每一类别的样本数量在不同分布的情况下有所不同,当几组数据的分布本身“距离”较小时,样本数量的增加没有明显的对于准确率的提升,从某种程度上可以理解为kNN对该任务本身性能不足;而当数据分布本身“距离”较大时,样本数量没有超过一定值的时后,kNN的表现会较弱,类似对数据的欠拟合,而当样本数量超过一定值的时候,其数量的增加对模型的准确率就没有很大的影响了。 + + +- 分布的“距离” +在以上的实验中,我们发现分布的一些参数可以直接影响kNN的准确率。回顾kNN的模型,当模型通过三个基本要素确立了以后,其决策边界就由训练数据集直接决定,在实验中,训练集和测试集来自相同的总体,准确率就与数据的分布密切相关。 + +用直觉判断,我们可以很容易的理解这个现象————均值接近时,不同组的样本更容易混杂在一起,模型很难根据最近的点做出准确的预测;而当方差放大时,不同均值的数据也更容易接近,导致准确率下降。进一步,则是分布之间的“距离”从某种程度上决定了kNN的准确率————对于直观上较为“接近”的分布,kNN产生的样本分类效果较差;而对于“距离”较远的分布,kNN的分类效果较好。 + +那我们可以猜想,如果掌握了关于不同组数据的分布的信息,是否就可以直接对kNN的准确率进行预测? + +在本次试验中,训练数据与测试数据来自同样的服从高斯分布的总体,定义该总体的参数就是 $\mu$ 和 $\Sigma$ 。实验中,这两个参数对于准确率都有一定的影响,那么我们是否有可能把这两者结合起来考察?或者通过这两个参数获得某种关于分布之间“距离”的度量? + +对于不同分布之间的“距离”,有很多计算方法,但限于知识水平,这里只尝试了三种: +1. KL散度(Kullback-Leibler divergence); +2. 最大均值差异(Maximum Mean Discrepancy, MMD); +3. Wasserstein Distance. + +因为这些方法都可以用来衡量两个分布之间的距离,为了简化问题,我们在以下实验中随机产生两组数量相同、服从不同的高斯分布的二维数据,分别通过以上几种度量方法计算分布之间的距离,而后观察其与kNN算法准确率之间的关系。 + +`随机` 指的是 $\mu_i \sim U [ -50, 50 ], i\in \\{ 1, 2 \\}$ ;且 $\Sigma_{tr_{i}} \sim U [ 0, 100 ], i\in \\{ 1, 2 \\}$ ,其中 $tr_{i}$ 指协方差矩阵的对角元素;而对于非对角元素,我们采取设置为 `0` 或 $\Sigma_{\tilde{tr}} \sim U [-\sqrt{\Sigma_{tr_1}\times\Sigma_{tr_2}}, \sqrt{\Sigma_{tr_1}\times\Sigma_{tr_2}} ]$ 的方法————因为要确保生成的是半正定矩阵,这里简单的采用对称阵,即两个非对角元素相等。 + + +##### 1. KL Divergence +KL散度从信息论的角度,衡量两个分布之间的信息差距,对于连续随机变量 $x$,以及两个概率分布 $p ( x )$ 和 $q ( x )$,它们之间的KL散度为: $$\begin{equation} +\begin{aligned} +D_{KL} ( p ( x ) || q ( x ) ) & = E(\ln{p ( x ) - \ln{q ( x )}}) +\\\\ & = \int_{-\infty}^{\infty} {p ( x ) \ln{\frac{p ( x ) }{q ( x ) } } dx} +\end{aligned} +\end{equation}$$. + +需要注意的是,KL散度衡量的是 $q ( x )$ 对于 $p ( x )$ 的信息损失,与 $p ( x)$ 对于 $q ( x )$ 的不同,具有不对称性,与“距离”不同。 + +一般统计时常使用离散的数据点作为随机变量 $x$,而在计算中通过求和近似,但是因为实验中可以得到高斯分布的参数 $\mu$ 和 $\Sigma$ ,(而且可以使用`numpy`直接进行计算,计算量也较小),所以这里尝试直接计算两个分布之间的KL散度。 + +记两个高斯分布的总体分别为 $N\mathbf{(\boldsymbol{\mu_1}, \Sigma_1)}$,$N\mathbf{(\boldsymbol{\mu_2}, \Sigma_2)}$,则有 $$p(\mathbf{x})=\frac{1}{(2 \pi)^{k / 2}|\Sigma_1|^{1 / 2}} \exp (-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu_1})^{T} \Sigma_1^{-1}(\mathbf{x}-\boldsymbol{\mu_1}))$$, $$q(\mathbf{x})=\frac{1}{(2 \pi)^{k / 2}|\Sigma_2|^{1 / 2}} \exp (-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu_2})^{T} \Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu_2}))$$. + +可以得到 $$D_{K L}(p \| q)=\frac{1}{2}[\log \frac{|\Sigma_2|}{|\Sigma_1|}-k+(\mu_1-\mu_2)^{T} \Sigma_2^{-1}(\mu_1-\mu_2)+tr\\{\Sigma_2^{-1} \Sigma_1\\}]$$,其中 $k$ 是样本 $x$ 的特征维度。(具体的推导过程参考 [这里](https://mr-easy.github.io/2020-04-16-kl-divergence-between-2-gaussian-distributions/ "KL Divergence between 2 Gaussian Distributions")) + +每次生成 `2000` 个样本(为了保持平衡,每组样本个数相同,即每组 `1000` 个),通过 `100` 次迭代画图,我们最终得到的KL散度与kNN预测的准确率关系如下: +- 当随机生成的协方差矩阵为对角阵时, + +k=1时, +![KL散度与准确率的关系(k=1)](img/accs_kldiv_1nns_diagonal_large.png "k=1时KL散度与准确率的关系") + +k=3时, +![KL散度与准确率的关系(k=3)](img/accs_kldiv_3nns_diagonal_large.png "k=3时KL散度与准确率的关系") + +k=5时, +![KL散度与准确率的关系(k=5)](img/accs_kldiv_5nns_diagonal_large.png "k=5时KL散度与准确率的关系") + +k=10时, +![KL散度与准确率的关系(k=10)](img/accs_kldiv_10nns_diagonal_large.png "k=10时KL散度与准确率的关系") + +k=20时, +![KL散度与准确率的关系(k=20)](img/accs_kldiv_20nns_diagonal_large.png "k=20时KL散度与准确率的关系") + +k=50时, +![KL散度与准确率的关系(k=50)](img/accs_kldiv_50nns_diagonal_large.png "k=50时KL散度与准确率的关系") + + +- 当随机生成的协方差矩阵不是对角阵时, + +k=1时, +![KL散度与准确率的关系(k=1)](img/accs_kldiv_1nns_random.png "k=1时KL散度与准确率的关系") + +k=3时, +![KL散度与准确率的关系(k=3)](img/accs_kldiv_3nns_random.png "k=3时KL散度与准确率的关系") + +k=5时, +![KL散度与准确率的关系(k=5)](img/accs_kldiv_5nns_random.png "k=5时KL散度与准确率的关系") + +k=10时, +![KL散度与准确率的关系(k=10)](img/accs_kldiv_10nns_random.png "k=10时KL散度与准确率的关系") + +k=20时, +![KL散度与准确率的关系(k=20)](img/accs_kldiv_20nns_random.png "k=20时KL散度与准确率的关系") + +k=50时, +![KL散度与准确率的关系(k=50)](img/accs_kldiv_50nns_random.png "k=50时KL散度与准确率的关系") + +可以观察到,当KL散度超过一定的值以后,kNN的准确率基本都维持在较高的水平,有一定的相关性,但当KL散度较小时,波动很大。 + +- 改变 `随机` 中均值和方差选取的范围,观察KL散度较小时与kNN准确率之间的关系。 + +使得$\mu_i \sim U [ -10, 10 ], i\in \\{ 1, 2 \\}$ ;且 $\Sigma_{tr_{i}} \sim U [ 0, 50 ], i\in \\{ 1, 2 \\}$ ,其中 $tr_{i}$ 指协方差矩阵的对角元素;为了简化实验,对于非对角元素,我们直接设置为 `0`. 为了限制KL散度,剔除掉KL散度大于 `300` 的分布。 + +k=1时, +![KL散度与准确率的关系(k=1)](img/accs_kldiv_1nns_diagonal.png "k=1时KL散度与准确率的关系") + +k=3时, +![KL散度与准确率的关系(k=3)](img/accs_kldiv_3nns_diagonal.png "k=3时KL散度与准确率的关系") + +k=5时, +![KL散度与准确率的关系(k=5)](img/accs_kldiv_5nns_diagonal.png "k=5时KL散度与准确率的关系") + +k=10时, +![KL散度与准确率的关系(k=10)](img/accs_kldiv_10nns_diagonal.png "k=10时KL散度与准确率的关系") + +k=20时, +![KL散度与准确率的关系(k=20)](img/accs_kldiv_20nns_diagonal.png "k=20时KL散度与准确率的关系") + +k=50时, +![KL散度与准确率的关系(k=50)](img/accs_kldiv_50nns_diagonal.png "k=50时KL散度与准确率的关系") + +仍然可以观察到有一定的相关性,特别是当 `k` 较大时,较为明显。但在更小的范围内,准确率的波动较大。 + +#### 2. Maximum Mean Discrepancy + +接下来的 `2` 和 `3` 由于能力有限,对其概念了解的不是很清楚,代码也调用了除`numpy`和`matplotlib`之外的包:`pytorch`、`scipy`。这里都使用已有的样本计算“距离”,而没有通过参数计算。 + +`Maximum Mean Discrepancy` 也可以用来度量两个分布之间的距离。将数据分布投射到更高维度,将两个分布差距最大的k阶矩作为度量距离的标准。 + +令$\mu_i \sim U [ -10, 10 ], i\in \\{ 1, 2 \\}$ ;且 $\Sigma_{tr_{i}} \sim U [ 0, 50 ], i\in \\{ 1, 2 \\}$ ,其中 $tr_{i}$ 指协方差矩阵的对角元素;为了简化实验,对于非对角元素,我们直接设置为 `0`. 在这里我们迭代 `20` 次,得到更为清晰的图片。 + +这里是`k`分别等于 `2, 10, 20, 50` 时的图: +![Maximum Mean Discrepancy](img/MMD.png "Maximum Mean Discrepancy") + + +#### 3. Wasserstein Distance +`Wasserstein distance` 衡量的是把数据从分布“移动成”另一个分布时所需要移动的平均距离的最小值。相比 `KL Divergence` ,具有对称性,也可以描述如何从一个分布转化为另一个分布。 + +使得$\mu_i \sim U [ -10, 10 ], i\in \\{ 1, 2 \\}$ ;且 $\Sigma_{tr_{i}} \sim U [ 0, 50 ], i\in \\{ 1, 2 \\}$ ,其中 $tr_{i}$ 指协方差矩阵的对角元素;为了简化实验,对于非对角元素,我们直接设置为 `0`. 在这里我们迭代 `20` 次,得到更为清晰的图片。 + +k=1时, +![Wasserstein Distance (k=1) ](img/dist1.png "k=1时Wasserstein Distance") + +k=3时, +![Wasserstein Distance (k=3) ](img/dist3.png "k=3时Wasserstein Distance") + +k=5时, +![Wasserstein Distance (k=5) ](img/dist5.png "k=5时Wasserstein Distance") + +k=10时, +![Wasserstein Distance (k=10) ](img/dist10.png "k=10时Wasserstein Distance") + +k=20时, +![Wasserstein Distance (k=20) ](img/dist20.png "k=20时Wasserstein Distance") + +k=50时, +![Wasserstein Distance (k=50) ](img/dist50.png "k=50时Wasserstein Distance") + +仍然可以观察到有一定的相关性,特别是当 `k` 较大时,较为明显。但在更小的范围内,准确率的波动较大。 + + +##### 小结 +当kNN的模型确定后,训练数据集直接决定了模型的表现。样本的数量和数据的分布相结合,会对结果产生一定的影响。 + +- 当数据分布的“距离”较远时,样本数量不足会降低模型的表现;但当样本数量超过一定值的时候,模型的表现就不会再提升,反而因为模型需要遍历所有的样本导致时间复杂度较高,效率较低;而当数据分布的“距离”较近时,模型本身的能力有限,样本数量也不会对模型产生明显的影响。从这一点看,在优化kNN时,考虑时间复杂度为 $O(N*D)$ ,一方面可以考虑通过一些算法(如 `Fisher`)将数据降维(减小 `D`);另一方面也可以控制样本数量,选择较具有代表性的样本(减小 `N`);或使用 `KD-tree` 提高搜索效率。从这些方面可以帮助提高kNN的性能。 + +- 数据的分布“距离”直接影响了kNN算法所能达到的上限,通过数据分布的参数直接计算的“距离”与kNN的表现之间有一定的相关性,当“距离”超过一定值的时候,kNN表现较为稳定;但当“距离”限于一定范围内时,波动很大。通过训练数据拟合的“距离”在预测kNN的表现时表现的更为可靠。一方面可能有度量“距离”方法的问题,另一方面也可以说明数据本身,比起分布,更为直接的对kNN造成了影响。 + +- 这里我们产生的样本是服从二元高斯分布的,或许可以进一步探究,(1)对于两类不同分布的样本来说,分布的“距离”与kNN的准确率是否仍有直接的关系;(2)如果是服从别的分布,如均匀分布、对数正态分布、伽马分布等的样本,分布间的“距离”是否仍然与kNN的准确率有一定联系,或者kNN对于这类数据并不适用。 + + + +### 2. kNN模型的优化 +决定kNN模型的三个基本要素为:(1)距离度量;(2)`k` 值的选择;(3)分类决策规定。 + +以下从这三个方面进行探究。 + +#### 1)距离计算 +kNN进行预测,选择k个最近邻时首先需要的就是衡量距离的方法。 + +- 距离计算方法 +一般kNN都使用欧氏距离进行计算,这里分别尝试了曼哈顿距离和欧氏距离,他们都属于 `Minkowski Distance` 的一种。 + +对于两个向量 $\boldsymbol{X}$ 和 $\boldsymbol{Y}$,`Minkowski Distance` 的计算公式为:$$\sqrt[p]{\sum_{i=1}^{n} { (x_{i} - y_{i})^{p}}} $$,其中 $n$ 为 $\boldsymbol{X}$ 和 $\boldsymbol{Y}$ 的维度。当 $p = 1$ 时,计算的就是曼哈顿距离;当 $p = 2$ 时,计算的是欧氏距离。 + +- 曼哈顿距离 +曼哈顿距离,也称 `L1-distance` ,与欧氏距离不同,曼哈顿距离的度量受坐标轴的影响,或者可以说是欧氏距离在坐标轴上的投影之和。 + +曼哈顿距离的计算公式即为:$$\textrm{Dist}(\boldsymbol{X}, \boldsymbol{Y}) = \sum_{i=1}^{n} {| x_{i}-y_{i}| }$$ ,其中 $n$ 为 $\boldsymbol{X}$ 和 $\boldsymbol{Y}$ 的维度。 + +- 欧氏距离 +欧氏距离,即 `L2-distance` ,直接衡量两个点在空间中的距离。 + +计算公式为:$$\textrm{Dist} (\boldsymbol{X}, \boldsymbol{Y}) = \sqrt{\sum_{i=1}^{n} { (x_{i}-y_{i})^2}}$$ ,其中 $n$ 为 $\boldsymbol{X}$ 和 $\boldsymbol{Y}$ 的维度。 + +鉴于我们的数据都服从二元高斯分布(维度为2),猜想这两种距离计算的方法不会对结果产生很大的影响。 + +在测试的过程中,随机生成了12组数据,每组数据一共有 `1200` 个样本,其中 `80%` 是训练集,`20%` 是测试集; +随机生成的12组数据中,类别数量不同,每个类别的样本数量也不同,具体值如下: + +| 类别数量 | 1 | 2 | 3 | 4 | +| :---: | :----------: | :----------: | :----------: | :----------: | +| 3 | $[400, 400, 400] $ | $[600, 400, 200]$ | $[800, 200, 200]$ | $[900, 200, 100]$ | +| 5 | $[240, 240, 240, 240, 240] $ | $[400, 300, 250, 125, 125]$ | $[500, 300, 200, 100, 100]$ | $[600, 400, 100, 70, 30]$ | +| 7 | $[170, 170, 170, 170, 170, 175, 175] $ | $[300, 200, 200, 150, 150, 150, 50]$ | $[400, 300, 200, 100, 100, 80, 20]$ | $[500, 400, 100, 100, 70, 20, 10]$ | + +在固定 `k` 的情况下观察准确率的变化,$k \in [1,19]$。 + +这里的 `随机` 指的是 $\mu_i \sim U[ -50, 50 ], i\in \\{ 1, 2 \\}$ ;且 $\Sigma_{tr_{i}} \sim U [ 0, 100 ], i\in \\{ 1, 2 \\}$ ,其中 $tr_{i}$ 指协方差矩阵的对角元素;而对于非对角元素,$\Sigma_{\tilde{tr}} \sim U [-\sqrt{\Sigma_{tr_1}\times\Sigma_{tr_2}}, \sqrt{\Sigma_{tr_1}\times\Sigma_{tr_2}} ]$ 。 + +生成的数据如下: +![生成的所有数据](img/data_batch_plotted_all_dist.png "all data (testing distances)") + +对于每个 `k` 时的准确率取均值,发现 `距离计算方法` 不同时,准确率的均值没有明显变化,结果如下: + +| k | Manhattan Distance | Euclidean Distance| +| :---:| :----------: | :----------: | +| 1 | 0.9368 | 0.9365 | +| 3 | 0.9462 | 0.9462 | +| 5 | 0.9493 | 0.9493 | +| 9 | 0.9545 | 0.9524 | +| 13 | 0.9517 | 0.9524 | +| 17 | 0.9538 | 0.9545 | +| 19 | 0.9520 | 0.9542 | + + +接下来都使用 `Euclidean Distance` 进行计算。 + + +- 归一化/标准化 + +由于kNN的决策完全依赖数据间的距离,而我们使用的计算距离的方法会均等的考虑空间中各维度的距离。如果数据的某一维度的绝对值都偏大,那么这一维度上的“距离”所占的比重自然就会比其他维度的大很多,造成评判标准的不合理。因而在面对不同的数据时,常常会使用归一化或标准化的方法。 + +实验中产生的数据都服从高斯分布,比较规则,可能这项操作对于实验结果不会有很大的改善。 + +我们选取的归一化/标准化的方法主要有如下两个: +1. `Min-max normalization` +记 $\boldsymbol{x} = [ x_1,x_2,\cdots ,x_n] , i \in \\{ 1, \cdots, n \\}$ , +`Min-max normalization` 的计算公式为:$$\hat{x_{i}} = \frac{x_{i} - \min{\boldsymbol{x}}}{\max{\boldsymbol{x}} - \max{\boldsymbol{x}}}$$ +这样做可以将数据压缩到 $[0, 1]$ 之间,但会改变原有的数据分布。 + +2. `Standardization` +记 $\boldsymbol{x} = [ x_1,x_2,\cdots ,x_n] , i \in \\{ 1, \cdots, n \\}$ , +`Standardization` 的计算公式为:$$\hat{x_{i}} = \frac{x_{i} - \textrm{mean}{\boldsymbol{ (x ) }}}{\textrm{std}(x)}$$ 可以将数据映射到 `N ( 0,1 )` 上,相比`Min-max normalization`,可以更好的保留原有的数据分布。 + +采取与之前同样的方法,随机生成12组数据,观察准确率. + +生成的数据如下: +![生成的所有数据](img/data_batch_plotted_all_norm.png "all data (testing normalization)") + +得到的平均结果如下: + +| k | No Normalization | Min-Max Normalization | Standardization | +| :---: | :----------: | :----------: | :----------: | +| 1 | 0.9188 | 0.9163 | 0.9146 | +| 3 | 0.9191 | 0.9170 | 0.9208 | +| 5 | 0.9215 | 0.9198 | 0.9187 | +| 9 | 0.9267 | 0.9260 | 0.9247 | +| 13 | 0.9281 | 0.9271 | 0.9271 | +| 17 | 0.9250 | 0.9243 | 0.9243 | +| 20 | 0.9264 | 0.9240 | 0.9236 | + +在平均的情况下是否进行预处理对模型的表现影响较小。 + + +尝试比较极端的情况————两个维度之间的方差相差较大: + +生成三组数据,每组400个,共1200个: + +| | $1$ | $2$ | $3$ | +| :----: | :------------: | :------------: | :------------: | +| $\boldsymbol{\\mu}$ | $\begin{bmatrix} 1 & 40 \end{bmatrix}$ | $\begin{bmatrix} 5 & 30 \end{bmatrix}$ | $\begin{bmatrix} 10 & 20 \end{bmatrix}$ | +| $\boldsymbol{\\Sigma}$ | $\begin{bmatrix} 1 & 0 \\\\ 0 & 1500 \end{bmatrix}$ | $\begin{bmatrix} 2 & 0 \\\\ 0 & 1000 \end{bmatrix}$ | $\begin{bmatrix} 5 & 0 \\\\ 0 & 500 \end{bmatrix}$ | + +这是生成的数据集: +![总的数据](img/data_plotted_data_big_var.png "all data") + +这是训练集: +![训练集](img/data_plotted_train_data_big_var.png "training data") + +这是测试集: +![测试集](img/data_plotted_test_data_big_var.png "test data") + + +得到的准确率如下: + +| k | No Normalization | Min-Max Normalization | Standardization | +| :---: | :----------: | :----------: | :----------: | +| 1 | 0.8542 | 0.8875 | 0.8667 | +| 3 | 0.8833 | 0.9042 | 0.9000 | +| 5 | 0.8833 | 0.9083 | **0.9250** | +| 9 | 0.8792 | 0.9042 | 0.9125 | +| 13 | 0.8792 | 0.9042 | 0.9000 | +| 17 | 0.8667 | 0.9083 | 0.9042 | +| 20 | 0.8667 | 0.9083 | 0.8958 | + +在两个维度间方差较大的情况下,可以看到归一化/标准化对准确率的提升是有一定帮助的,而标准化产生的最佳结果较好。 + + +#### 2)k值的选择 +当其他值确定,且训练数据不变时,`k` 的选择决定了模型的决策边界。当我们在对 `k` 进行优化时,实际上在针对给定数据选取最合适的决策边界,即 `k`. + +通过几次实验尝试画出了k不同时的决策边界。 + +1. 两类样本的情况 +通过以下参数生成了两组数据(每组数据为 `100` 个): + +| | $1$ | $2$ | +| ---- | :------------: | :------------: | +| $\boldsymbol{\\mu}$ | $\begin{bmatrix} 1 & 30 \end{bmatrix}$ | $\begin{bmatrix} 2 & 30 \end{bmatrix}$ | +| $\boldsymbol{\\Sigma}$ | $\begin{bmatrix} 1 & 0 \\\\ 0 & 10 \end{bmatrix}$ | $\begin{bmatrix} 1 & 0 \\\\ 0 & 10 \end{bmatrix}$ | + +![这是画出的决策边界](img/boundry_2clusters.png "decision boundary (2 classes)") + +准确率如下: + +| k | Accuracy | +| :---: | :---------: | +| 1 | 0.675 | +| **3** | 0.725 | +| 5 | 0.675 | +| 9 | 0.550 | +| 11 | 0.625 | +| 13 | 0.625 | +| 15 | 0.600 | +| 17 | 0.550 | +| 19 | 0.500 | + + +2. 三类样本的情况 +通过以下参数生成了三组数据: + +| | $1$ | $2$ | $3$ | +| :----: | :------------: | :------------: | :------------: | +| $\boldsymbol{\\mu}$ | $\begin{bmatrix} 1 & 20 \end{bmatrix}$ | $\begin{bmatrix} 5 & 20 \end{bmatrix}$ | $\begin{bmatrix} 15 & 15 \end{bmatrix}$ | +| $\boldsymbol{Sigma}$ | $\begin{bmatrix} 1 & 0 \\\\ 0 & 10 \end{bmatrix}$ | $\begin{bmatrix} 10 & 15 \\\\ 15 & 40 \end{bmatrix}$ | $\begin{bmatrix} 20 & 0 \\\\ 0 & 30 \end{bmatrix}$ | + +![这是画出的决策边界](img/boundry_3clusters.png "decision boundary (3 classes)") + +准确率如下: + +| k | Accuracy | +| :---: | :---------: | +| 1 | 0.900 | +| 3 | 0.883 | +| **5** | 0.917 | +| 9 | 0.900 | +| 11 | 0.900 | +| 13 | 0.900 | +| 15 | 0.833 | +| 17 | 0.900 | +| 19 | 0.900 | + + +3. 五类样本的情况 +通过以下参数生成了五组数据: + +| | $1$ | $2$ | $3$ | $4$ | $5$ | +| :----: | :------------: | :------------: | :------------: | :------------: | :------------: | +| $\boldsymbol{\\mu}$ | $\begin{bmatrix} 1 & 20 \end{bmatrix}$ | $\begin{bmatrix} 5 & 20 \end{bmatrix}$ | $\begin{bmatrix} 20 & 30 \end{bmatrix}$ | $\begin{bmatrix} 30 & 25 \end{bmatrix}$ | $\begin{bmatrix} 25 & 25 \end{bmatrix}$ | +| $\boldsymbol{\Sigma}$ | $\begin{bmatrix} 1 & 0 \\\\ 0 & 10 \end{bmatrix}$ | $\begin{bmatrix} 10 & 15 \\\\ 15 & 40 \end{bmatrix}$ | $\begin{bmatrix} 20 & 0 \\\\ 0 & 30 \end{bmatrix}$ | $\begin{bmatrix} 10 & 0 \\\\ 0 & 30 \end{bmatrix}$ | $\begin{bmatrix} 2 & 5 \\\\ 5 & 50 \end{bmatrix}$ | + +![这是画出的决策边界](img/boundry_5clusters.png "decision boundary (5 classes)") + + +准确率如下: + +| k | Accuracy | +| :---: | :---------: | +| 1 | 0.760 | +| 3 | 0.810 | +| 5 | 0.830 | +| 9 | 0.860 | +| 11 | 0.860 | +| **13** | 0.880 | +| **15** | 0.880 | +| 17 | 0.870 | +| 19 | 0.860 | + +kNN的决策边界是非线性的,k较小时,决策边界较为陡峭,模型复杂度较高;随着k的增大,决策边界趋向平缓,模型复杂度降低,性能也可能随之下降。 + + +#### 3)分类决策规定 +我们在进行之前的实验时,都是运用 `投票` 方法进行决策;而直觉上,我们也可能想到距离更近的点所投的“票”应该更为重要。因而这里尝试改变分类决策的规定,根据距离进行加权运算。 + +选择的根据距离进行加权的公式为:$$w (x,x_{i}) = \exp{\\{-\lambda \|x - x_{i}\|^{2}\\}}, i \in \\{1,2,\cdots ,k \\} $$ 其中 $x$ 为待预测的实例,$x_{i}$ 为被选中的 `k` 个样本中的第 `i` 个,$\lambda \ge 0$ 为超参数,可以决定距离在最终决定时所占的权重,$\lambda$ 越大,距离所占的权重越大;$\lambda$ 越小,距离所占的权重越小;当 $\lambda = 0$ 时,则与 `投票` 方法相同,不考虑距离。 + +最终我们预测的 $x$ 属于各类别的概率为:$$\textrm{Pr} (y|x ) = \frac{{\textstyle \sum_{i=1}^{n}{w(x,x_{i} ) \delta(y,y_{i})}}}{{\textstyle \sum_{i=1}^{n}{w(x,x_{i})}}}$$ +其中,$\begin{array}{l} \delta (y,y_{i}) = \\{\begin{matrix} 1, \space y = y_{i}\\\\ 0, \space y \ne y_{i} \end{matrix}. \end{array}$ 为示性函数,$\textrm{Pr} (y|x)$ 为待预测的实例的标签为 $y$ 的概率,$y_{i}$ 为被选中的 `k` 个样本中的第 `i` 个样本的标签。 + +在实际操作中,由于分母相同,计算过程中省略了这一步,直接进行加权。 + +自定的 `KNN` 类中,若 $\lambda$ 未定,自动优化的待选 $\lambda \in \\{0, 0.1, 0.2, 0.5, 1, 2 \\}$ + +我们采用了和之前画决策边界时同样的分布,即: + +1. 两类样本的情况 + +通过以下参数生成了两组数据(每组数据为 `100` 个): + +| | $1$ | $2$ | +| ---- | :------------: | :------------: | +| $\boldsymbol{\\mu}$ | $\begin{bmatrix} 1 & 30 \end{bmatrix}$ | $\begin{bmatrix} 2 & 30 \end{bmatrix}$ | +| $\boldsymbol{\\Sigma}$ | $\begin{bmatrix} 1 & 0 \\\\ 0 & 10 \end{bmatrix}$ | $\begin{bmatrix} 1 & 0 \\\\ 0 & 10 \end{bmatrix}$ | + +![这是画出的决策边界](img/boundry_2clusters_w.png "decision boundary (2 classes with weights)") + +优化后的权重和准确率如下: + +| k | $\lambda$ | Accuracy | +| :---: | :-------: | :---------: | +| 1 | 0.2 | 0.625 | +| 3 | 2 | 0.625 | +| 5 | 1 | 0.675 | +| 9 | 0.2 | 0.775 | +| **11** | 0 | 0.800 | +| 13 | 0.1 | 0.775 | +| 15 | 0.5 | 0.775 | +| 17 | 2 | 0.750 | +| **19** | 0.5 | 0.800 | + + +2. 三类样本的情况 +通过以下参数生成了三组数据(每组数据为 `100` 个): + +| | $1$ | $2$ | $3$ | +| :----: | :------------: | :------------: | :------------: | +| $\boldsymbol{\\mu}$ | $\begin{bmatrix} 1 & 20 \end{bmatrix}$ | $\begin{bmatrix} 5 & 20 \end{bmatrix}$ | $\begin{bmatrix} 15 & 15 \end{bmatrix}$ | +| $\boldsymbol{\\Sigma}$ | $\begin{bmatrix} 1 & 0 \\\\ 0 & 10 \end{bmatrix}$ | $\begin{bmatrix} 10 & 15 \\\\ 15 & 40 \end{bmatrix}$ | $\begin{bmatrix} 20 & 0 \\\\ 0 & 30 \end{bmatrix}$ | + +![这是画出的决策边界](img/boundry_3clusters_w.png "decision boundary (3 classes with weights)") + +优化后的权重和准确率如下: + +| k | $\lambda$ | Accuracy | +| :---: | :-------: | :---------: | +| 1 | 1 | 0.900 | +| 3 | 0.1 | 0.917 | +| **5** | 0.2 | 0.967 | +| 9 | 0.1 | 0.933 | +| 11 | 0.1 | 0.917 | +| **13** | 0.2 | 0.967 | +| **15** | 0.2 | 0.967 | +| **17** | 0 | 0.967 | +| **19** | 0.1 | 0.967 | + + +3. 五类样本的情况 +通过以下参数生成了五组数据(每组数据为 `100` 个): + +| | $1$ | $2$ | $3$ | $4$ | $5$ | +| :----: | :------------: | :------------: | :------------: | :------------: | :------------: | +| $\boldsymbol{\\mu}$ | $\begin{bmatrix} 1 & 20 \end{bmatrix}$ | $\begin{bmatrix} 5 & 20 \end{bmatrix}$ | $\begin{bmatrix} 20 & 30 \end{bmatrix}$ | $\begin{bmatrix} 30 & 25 \end{bmatrix}$ | $\begin{bmatrix} 25 & 25 \end{bmatrix}$ | +| $\boldsymbol{\\Sigma}$ | $\begin{bmatrix} 1 & 0 \\\\ 0 & 10 \end{bmatrix}$ | $\begin{bmatrix} 10 & 15 \\\\ 15 & 40 \end{bmatrix}$ | $\begin{bmatrix} 20 & 0 \\\\ 0 & 30 \end{bmatrix}$ | $\begin{bmatrix} 10 & 0 \\\\ 0 & 30 \end{bmatrix}$ | $\begin{bmatrix} 2 & 5 \\\\ 5 & 50 \end{bmatrix}$ | + +![这是画出的决策边界](img/boundry_5clusters_w.png "decision boundary (5 classes with weights)") + +优化后的权重和准确率如下: + +| k | $\lambda$ | Accuracy | +| :---: | :-------: | :---------: | +| 1 | 1 | 0.760 | +| 3 | 2 | 0.800 | +| 5 | 2 | 0.810 | +| 9 | 2 | 0.800 | +| 11 | 0.1 | 0.810 | +| **13** | 1 | 0.830 | +| 15 | 2 | 0.800 | +| 17 | 2 | 0.800 | +| **19** | 1 | 0.830 | + +可以看到当加入权重时,会倾向于选择更大的k,准确率普遍有所提升(不过因为两次的数据并不相同,只是遵从相同的分布,可能有一定的偶然性)。 + +但kNN的 `decision boundary` 不再随 `k` 变化得那么明显,模型在 `k` 变大后的能力衰减较小。 + +##### 小结 +- 模型中 `距离度量方法` 的变化对于实验选取的数据的结果没有明显的影响,而kNN一般也使用 `Euclidean Distance` 进行度量。而在这里需要注意预处理时数据在不同特征维度上的方差,进而影响距离绝对值大小的因素,考虑对数据进行归一化/标准化操作。 +其中归一化的操作会修改数据的原始分布,造成一定的问题;标准化的操作可能在大多数情况下更好。 + +- `k` 值可以看做对模型的平滑处理,`k` 值越大,模型的复杂度降低,决策边界也会更加平缓,但预测的能力也会有所下降。 + +- 我们可以尝试通过改变 `分类决策规定` 来对模型进行调整,一般我们直接使用 `投票` 方法进行决策。对此,一种常见且符合直觉的方法就是根据距离调整 `k` 个训练样本 `投票` 所占的比例。通过加权算法,模型在 `k` 值变大时仍然可以保持一定的复杂度,提高能力。 + +在对 `k` 值和 `分类决策规定` 进行调整的过程中,我们可以在模型的复杂程度和稳定性(决策边界的平缓程度)之间做一些 trade-off 。 + + +## 总结 +- kNN是一个比较简单的监督学习方法,属于基于实例的非参数估计,因而其能力直接受数据影响,在本实验中,当样本量足够的时候,数据的分布对其分类准确率有直接的影响。本实验中探究的是kNN对于二元高斯分布产生数据的分类效果。kNN适用于哪类分布的数据,以及如何更好的度量数据的分布与kNN准确率的关系可以进一步探究。 + +- kNN的复杂度为 `O(N*D)`,因而会有 “Curse of Dimensionality” 的问题,我们可以从 `N` 和 `D` 两个角度进行优化,比如通过如 `KD-tree`的方法减少搜索的时间复杂度;或者通过一些算法对特征进行降维来减少 `D`。而在实验中发现,样本数量的不断增加对于kNN预测的准确率没有很大的影响,所以在保持样本数量足够的前提下,我们可以选取更合适、更具有代表性的样本来减少 `N`。 + +- kNN的三个基本要素为:(1)距离度量;(2)`k` 值的选择;(3)分类决策规定。通过改变这三个要素,我们可以对模型进行一定的优化,使之更适合需要判断的数据。 + + +### 代码运行方法: +``` + python source.py 0 #0-5,为不同的lab number +``` diff --git a/assignment-1/submission/18300110042/img/MMD.png b/assignment-1/submission/18300110042/img/MMD.png new file mode 100644 index 0000000000000000000000000000000000000000..9e480548d5f6d576c474182fa70c555893b1d324 Binary files /dev/null and b/assignment-1/submission/18300110042/img/MMD.png differ diff --git a/assignment-1/submission/18300110042/img/accs_kldiv_10nns_diagonal.png b/assignment-1/submission/18300110042/img/accs_kldiv_10nns_diagonal.png new file mode 100644 index 0000000000000000000000000000000000000000..4e9849c67ddd9edf818572ea8822cd9ceab88a75 Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_kldiv_10nns_diagonal.png differ diff --git a/assignment-1/submission/18300110042/img/accs_kldiv_10nns_diagonal_large.png b/assignment-1/submission/18300110042/img/accs_kldiv_10nns_diagonal_large.png new file mode 100644 index 0000000000000000000000000000000000000000..9046a45c112f977a9ebc42ffa942fcf1116e9f9e Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_kldiv_10nns_diagonal_large.png differ diff --git a/assignment-1/submission/18300110042/img/accs_kldiv_10nns_random.png b/assignment-1/submission/18300110042/img/accs_kldiv_10nns_random.png new file mode 100644 index 0000000000000000000000000000000000000000..94669f10b58b3aafd07baa55a6b448bd6d63180b Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_kldiv_10nns_random.png differ diff --git a/assignment-1/submission/18300110042/img/accs_kldiv_1nns_diagonal.png b/assignment-1/submission/18300110042/img/accs_kldiv_1nns_diagonal.png new file mode 100644 index 0000000000000000000000000000000000000000..1ed8bcaf3b0da7535cac018eec49a7757f71fea0 Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_kldiv_1nns_diagonal.png differ diff --git a/assignment-1/submission/18300110042/img/accs_kldiv_1nns_diagonal_large.png b/assignment-1/submission/18300110042/img/accs_kldiv_1nns_diagonal_large.png new file mode 100644 index 0000000000000000000000000000000000000000..506cf2165072be5616a2d5e30f4878a71ce281e5 Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_kldiv_1nns_diagonal_large.png differ diff --git a/assignment-1/submission/18300110042/img/accs_kldiv_1nns_random.png b/assignment-1/submission/18300110042/img/accs_kldiv_1nns_random.png new file mode 100644 index 0000000000000000000000000000000000000000..4c6f0be1495ecab3c2c695484e5ed2d4e84f3e72 Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_kldiv_1nns_random.png differ diff --git a/assignment-1/submission/18300110042/img/accs_kldiv_20nns_diagonal.png b/assignment-1/submission/18300110042/img/accs_kldiv_20nns_diagonal.png new file mode 100644 index 0000000000000000000000000000000000000000..164a17e030d5e11f470049025eefaff931cd0b17 Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_kldiv_20nns_diagonal.png differ diff --git a/assignment-1/submission/18300110042/img/accs_kldiv_20nns_diagonal_large.png b/assignment-1/submission/18300110042/img/accs_kldiv_20nns_diagonal_large.png new file mode 100644 index 0000000000000000000000000000000000000000..8434573b55ed5edf048c09cdab328f4e1b067459 Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_kldiv_20nns_diagonal_large.png differ diff --git a/assignment-1/submission/18300110042/img/accs_kldiv_20nns_random.png b/assignment-1/submission/18300110042/img/accs_kldiv_20nns_random.png new file mode 100644 index 0000000000000000000000000000000000000000..9ec5715674fbc35dc5b1d7f68e2c4a727f60bcb0 Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_kldiv_20nns_random.png differ diff --git a/assignment-1/submission/18300110042/img/accs_kldiv_3nns_diagonal.png b/assignment-1/submission/18300110042/img/accs_kldiv_3nns_diagonal.png new file mode 100644 index 0000000000000000000000000000000000000000..0e5987644b9ceef93d5bda14ad5862c124d4a15b Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_kldiv_3nns_diagonal.png differ diff --git a/assignment-1/submission/18300110042/img/accs_kldiv_3nns_diagonal_large.png b/assignment-1/submission/18300110042/img/accs_kldiv_3nns_diagonal_large.png new file mode 100644 index 0000000000000000000000000000000000000000..7487bdf99524b465c03bf092bddfbc873363e6e0 Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_kldiv_3nns_diagonal_large.png differ diff --git a/assignment-1/submission/18300110042/img/accs_kldiv_3nns_random.png b/assignment-1/submission/18300110042/img/accs_kldiv_3nns_random.png new file mode 100644 index 0000000000000000000000000000000000000000..9fed2e781371dd4b6453d59b54b6302b3203bb79 Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_kldiv_3nns_random.png differ diff --git a/assignment-1/submission/18300110042/img/accs_kldiv_50nns_diagonal.png b/assignment-1/submission/18300110042/img/accs_kldiv_50nns_diagonal.png new file mode 100644 index 0000000000000000000000000000000000000000..a19fe3d0d17942370977fc431c39f718c35ea445 Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_kldiv_50nns_diagonal.png differ diff --git a/assignment-1/submission/18300110042/img/accs_kldiv_50nns_diagonal_large.png b/assignment-1/submission/18300110042/img/accs_kldiv_50nns_diagonal_large.png new file mode 100644 index 0000000000000000000000000000000000000000..27b898128a6014de89deff15a42b2322f2f8c86a Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_kldiv_50nns_diagonal_large.png differ diff --git a/assignment-1/submission/18300110042/img/accs_kldiv_50nns_random.png b/assignment-1/submission/18300110042/img/accs_kldiv_50nns_random.png new file mode 100644 index 0000000000000000000000000000000000000000..f28948c5ab02093a0eae2eba203b05426e6e5520 Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_kldiv_50nns_random.png differ diff --git a/assignment-1/submission/18300110042/img/accs_kldiv_5nns_diagonal.png b/assignment-1/submission/18300110042/img/accs_kldiv_5nns_diagonal.png new file mode 100644 index 0000000000000000000000000000000000000000..7a1fcf7ab270cb9fe3632e5acf4e41d1e21650c2 Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_kldiv_5nns_diagonal.png differ diff --git a/assignment-1/submission/18300110042/img/accs_kldiv_5nns_diagonal_large.png b/assignment-1/submission/18300110042/img/accs_kldiv_5nns_diagonal_large.png new file mode 100644 index 0000000000000000000000000000000000000000..81623abf058b28136bffef667c67507881cc1882 Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_kldiv_5nns_diagonal_large.png differ diff --git a/assignment-1/submission/18300110042/img/accs_kldiv_5nns_random.png b/assignment-1/submission/18300110042/img/accs_kldiv_5nns_random.png new file mode 100644 index 0000000000000000000000000000000000000000..11c0de98a997239eee14f9e46d9316b24a42e215 Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_kldiv_5nns_random.png differ diff --git a/assignment-1/submission/18300110042/img/accs_test_1.png b/assignment-1/submission/18300110042/img/accs_test_1.png new file mode 100644 index 0000000000000000000000000000000000000000..1f177a5a01946a8018862e221d47e7a5aa124c72 Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_test_1.png differ diff --git a/assignment-1/submission/18300110042/img/accs_test_2.png b/assignment-1/submission/18300110042/img/accs_test_2.png new file mode 100644 index 0000000000000000000000000000000000000000..01208478a4f536660d120ca4fa8a47c1c6485a7b Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_test_2.png differ diff --git a/assignment-1/submission/18300110042/img/accs_test_3.png b/assignment-1/submission/18300110042/img/accs_test_3.png new file mode 100644 index 0000000000000000000000000000000000000000..3ce585932b84b001a4e4bc769652aa885d12d77e Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_test_3.png differ diff --git a/assignment-1/submission/18300110042/img/accs_test_4.png b/assignment-1/submission/18300110042/img/accs_test_4.png new file mode 100644 index 0000000000000000000000000000000000000000..935579fe4f3e4566870ceadd050209a4a54df3cb Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_test_4.png differ diff --git a/assignment-1/submission/18300110042/img/accs_test_5.png b/assignment-1/submission/18300110042/img/accs_test_5.png new file mode 100644 index 0000000000000000000000000000000000000000..5803abe316944d6175bd5124ef42fbeec13579e5 Binary files /dev/null and b/assignment-1/submission/18300110042/img/accs_test_5.png differ diff --git a/assignment-1/submission/18300110042/img/boundry_2clusters.png b/assignment-1/submission/18300110042/img/boundry_2clusters.png new file mode 100644 index 0000000000000000000000000000000000000000..05fdfd6c5ae7ae0fbfde22cd06a56aead09078fa Binary files /dev/null and b/assignment-1/submission/18300110042/img/boundry_2clusters.png differ diff --git a/assignment-1/submission/18300110042/img/boundry_2clusters_w.png b/assignment-1/submission/18300110042/img/boundry_2clusters_w.png new file mode 100644 index 0000000000000000000000000000000000000000..05fdfd6c5ae7ae0fbfde22cd06a56aead09078fa Binary files /dev/null and b/assignment-1/submission/18300110042/img/boundry_2clusters_w.png differ diff --git a/assignment-1/submission/18300110042/img/boundry_3clusters.png b/assignment-1/submission/18300110042/img/boundry_3clusters.png new file mode 100644 index 0000000000000000000000000000000000000000..66a8dddd694f17aa2f412c08fcc08ceb75a93e18 Binary files /dev/null and b/assignment-1/submission/18300110042/img/boundry_3clusters.png differ diff --git a/assignment-1/submission/18300110042/img/boundry_3clusters_w.png b/assignment-1/submission/18300110042/img/boundry_3clusters_w.png new file mode 100644 index 0000000000000000000000000000000000000000..66a8dddd694f17aa2f412c08fcc08ceb75a93e18 Binary files /dev/null and b/assignment-1/submission/18300110042/img/boundry_3clusters_w.png differ diff --git a/assignment-1/submission/18300110042/img/boundry_5clusters.png b/assignment-1/submission/18300110042/img/boundry_5clusters.png new file mode 100644 index 0000000000000000000000000000000000000000..88aed93e47738b63c7f8c8b8712018c93eaa4e6b Binary files /dev/null and b/assignment-1/submission/18300110042/img/boundry_5clusters.png differ diff --git a/assignment-1/submission/18300110042/img/boundry_5clusters_w.png b/assignment-1/submission/18300110042/img/boundry_5clusters_w.png new file mode 100644 index 0000000000000000000000000000000000000000..88aed93e47738b63c7f8c8b8712018c93eaa4e6b Binary files /dev/null and b/assignment-1/submission/18300110042/img/boundry_5clusters_w.png differ diff --git a/assignment-1/submission/18300110042/img/data_batch_plotted_all_dist.png b/assignment-1/submission/18300110042/img/data_batch_plotted_all_dist.png new file mode 100644 index 0000000000000000000000000000000000000000..0cf7ff48201971c4273eb2c11df48a880fc0b307 Binary files /dev/null and b/assignment-1/submission/18300110042/img/data_batch_plotted_all_dist.png differ diff --git a/assignment-1/submission/18300110042/img/data_batch_plotted_all_norm.png b/assignment-1/submission/18300110042/img/data_batch_plotted_all_norm.png new file mode 100644 index 0000000000000000000000000000000000000000..84429d4aeb0c6341101121ed051bb6ab37e52aef Binary files /dev/null and b/assignment-1/submission/18300110042/img/data_batch_plotted_all_norm.png differ diff --git a/assignment-1/submission/18300110042/img/data_plotted_data_1.png b/assignment-1/submission/18300110042/img/data_plotted_data_1.png new file mode 100644 index 0000000000000000000000000000000000000000..b1145c9b1e0e3bfa692d16feb0375acdf68c3ec3 Binary files /dev/null and b/assignment-1/submission/18300110042/img/data_plotted_data_1.png differ diff --git a/assignment-1/submission/18300110042/img/data_plotted_data_2.png b/assignment-1/submission/18300110042/img/data_plotted_data_2.png new file mode 100644 index 0000000000000000000000000000000000000000..2264fb038478e1057cb5b152432c70543a0ca500 Binary files /dev/null and b/assignment-1/submission/18300110042/img/data_plotted_data_2.png differ diff --git a/assignment-1/submission/18300110042/img/data_plotted_data_3.png b/assignment-1/submission/18300110042/img/data_plotted_data_3.png new file mode 100644 index 0000000000000000000000000000000000000000..a40044a3990b9c3761ea86e61a46067b00fb9bf8 Binary files /dev/null and b/assignment-1/submission/18300110042/img/data_plotted_data_3.png differ diff --git a/assignment-1/submission/18300110042/img/data_plotted_data_4.png b/assignment-1/submission/18300110042/img/data_plotted_data_4.png new file mode 100644 index 0000000000000000000000000000000000000000..f69af197376048c62ea9d5790db6ffec941b7dd7 Binary files /dev/null and b/assignment-1/submission/18300110042/img/data_plotted_data_4.png differ diff --git a/assignment-1/submission/18300110042/img/data_plotted_data_5.png b/assignment-1/submission/18300110042/img/data_plotted_data_5.png new file mode 100644 index 0000000000000000000000000000000000000000..03c93ffd629cc4d8139a99bb917152a69d2508e4 Binary files /dev/null and b/assignment-1/submission/18300110042/img/data_plotted_data_5.png differ diff --git a/assignment-1/submission/18300110042/img/data_plotted_data_big_var.png b/assignment-1/submission/18300110042/img/data_plotted_data_big_var.png new file mode 100644 index 0000000000000000000000000000000000000000..d1356dbba2e449ffc8844bb6abafff3aeff5a434 Binary files /dev/null and b/assignment-1/submission/18300110042/img/data_plotted_data_big_var.png differ diff --git a/assignment-1/submission/18300110042/img/data_plotted_test_data_1.png b/assignment-1/submission/18300110042/img/data_plotted_test_data_1.png new file mode 100644 index 0000000000000000000000000000000000000000..c309f57dfe7f9923ecd82a7d9655ea7d741ebf0e Binary files /dev/null and b/assignment-1/submission/18300110042/img/data_plotted_test_data_1.png differ diff --git a/assignment-1/submission/18300110042/img/data_plotted_test_data_2.png b/assignment-1/submission/18300110042/img/data_plotted_test_data_2.png new file mode 100644 index 0000000000000000000000000000000000000000..b8f5f9858c2229cea79d4fab7f27a467b8372460 Binary files /dev/null and b/assignment-1/submission/18300110042/img/data_plotted_test_data_2.png differ diff --git a/assignment-1/submission/18300110042/img/data_plotted_test_data_3.png b/assignment-1/submission/18300110042/img/data_plotted_test_data_3.png new file mode 100644 index 0000000000000000000000000000000000000000..5c665a1be7d66e7fae6e995c5d230cb0f2dc1d7a Binary files /dev/null and b/assignment-1/submission/18300110042/img/data_plotted_test_data_3.png differ diff --git a/assignment-1/submission/18300110042/img/data_plotted_test_data_4.png b/assignment-1/submission/18300110042/img/data_plotted_test_data_4.png new file mode 100644 index 0000000000000000000000000000000000000000..e26ae81c6501667d4899d17516a78a234938caa0 Binary files /dev/null and b/assignment-1/submission/18300110042/img/data_plotted_test_data_4.png differ diff --git a/assignment-1/submission/18300110042/img/data_plotted_test_data_5.png b/assignment-1/submission/18300110042/img/data_plotted_test_data_5.png new file mode 100644 index 0000000000000000000000000000000000000000..026878a9509bd4deba2eebf4ca2605a9404fc01f Binary files /dev/null and b/assignment-1/submission/18300110042/img/data_plotted_test_data_5.png differ diff --git a/assignment-1/submission/18300110042/img/data_plotted_test_data_big_var.png b/assignment-1/submission/18300110042/img/data_plotted_test_data_big_var.png new file mode 100644 index 0000000000000000000000000000000000000000..334020220b0e5f26d241c0f5e2510be6f66fe918 Binary files /dev/null and b/assignment-1/submission/18300110042/img/data_plotted_test_data_big_var.png differ diff --git a/assignment-1/submission/18300110042/img/data_plotted_train_data_1.png b/assignment-1/submission/18300110042/img/data_plotted_train_data_1.png new file mode 100644 index 0000000000000000000000000000000000000000..4cd8166132858a45bb291b9c910a70fdebd2fde4 Binary files /dev/null and b/assignment-1/submission/18300110042/img/data_plotted_train_data_1.png differ diff --git a/assignment-1/submission/18300110042/img/data_plotted_train_data_2.png b/assignment-1/submission/18300110042/img/data_plotted_train_data_2.png new file mode 100644 index 0000000000000000000000000000000000000000..722aa9edd5d5c82aab7f93082b50601c5c3e4b92 Binary files /dev/null and b/assignment-1/submission/18300110042/img/data_plotted_train_data_2.png differ diff --git a/assignment-1/submission/18300110042/img/data_plotted_train_data_3.png b/assignment-1/submission/18300110042/img/data_plotted_train_data_3.png new file mode 100644 index 0000000000000000000000000000000000000000..91c7b217e4a33679968c6529dc257b9184c8e309 Binary files /dev/null and b/assignment-1/submission/18300110042/img/data_plotted_train_data_3.png differ diff --git a/assignment-1/submission/18300110042/img/data_plotted_train_data_4.png b/assignment-1/submission/18300110042/img/data_plotted_train_data_4.png new file mode 100644 index 0000000000000000000000000000000000000000..057eb512fb13bf80fb8aef184ba5ac9229048cca Binary files /dev/null and b/assignment-1/submission/18300110042/img/data_plotted_train_data_4.png differ diff --git a/assignment-1/submission/18300110042/img/data_plotted_train_data_5.png b/assignment-1/submission/18300110042/img/data_plotted_train_data_5.png new file mode 100644 index 0000000000000000000000000000000000000000..6280e96757dde6b85afb09de9bfc19d063f68644 Binary files /dev/null and b/assignment-1/submission/18300110042/img/data_plotted_train_data_5.png differ diff --git a/assignment-1/submission/18300110042/img/data_plotted_train_data_big_var.png b/assignment-1/submission/18300110042/img/data_plotted_train_data_big_var.png new file mode 100644 index 0000000000000000000000000000000000000000..3c0e137f92d0a2b096e2b533214615a9f2a456d1 Binary files /dev/null and b/assignment-1/submission/18300110042/img/data_plotted_train_data_big_var.png differ diff --git a/assignment-1/submission/18300110042/img/dist1.png b/assignment-1/submission/18300110042/img/dist1.png new file mode 100644 index 0000000000000000000000000000000000000000..09cbd28466a6575bb537b7e8814503f12c56d4e9 Binary files /dev/null and b/assignment-1/submission/18300110042/img/dist1.png differ diff --git a/assignment-1/submission/18300110042/img/dist10.png b/assignment-1/submission/18300110042/img/dist10.png new file mode 100644 index 0000000000000000000000000000000000000000..5fa4f708bb6445e0a4da46ed775ed2ba654dce42 Binary files /dev/null and b/assignment-1/submission/18300110042/img/dist10.png differ diff --git a/assignment-1/submission/18300110042/img/dist20.png b/assignment-1/submission/18300110042/img/dist20.png new file mode 100644 index 0000000000000000000000000000000000000000..c9eca5605ab07f23672c774af3acfa6dd1957ea2 Binary files /dev/null and b/assignment-1/submission/18300110042/img/dist20.png differ diff --git a/assignment-1/submission/18300110042/img/dist3.png b/assignment-1/submission/18300110042/img/dist3.png new file mode 100644 index 0000000000000000000000000000000000000000..4aed63e721564867b87a0dd3afd31cb43d9338a5 Binary files /dev/null and b/assignment-1/submission/18300110042/img/dist3.png differ diff --git a/assignment-1/submission/18300110042/img/dist5.png b/assignment-1/submission/18300110042/img/dist5.png new file mode 100644 index 0000000000000000000000000000000000000000..d3aa5b03cf4a6c86c98d5d623db1683478513b8d Binary files /dev/null and b/assignment-1/submission/18300110042/img/dist5.png differ diff --git a/assignment-1/submission/18300110042/img/dist50.png b/assignment-1/submission/18300110042/img/dist50.png new file mode 100644 index 0000000000000000000000000000000000000000..b8a35fb26d139318e86e9cc8b3c985f8e0b29158 Binary files /dev/null and b/assignment-1/submission/18300110042/img/dist50.png differ diff --git a/assignment-1/submission/18300110042/knn_lab.dat b/assignment-1/submission/18300110042/knn_lab.dat new file mode 100644 index 0000000000000000000000000000000000000000..41e9b68e149b69d5871dfa8da0eafa277394d71c --- /dev/null +++ b/assignment-1/submission/18300110042/knn_lab.dat @@ -0,0 +1,117 @@ +{ + "knn_lab": [ + { + "means": { + "method": "fix", + "data": [ [1, 50], [15, 10], [10, 20] ] + }, + "covs": { + "method": "fix", + "data": [ + [ [1, 0], [0, 10 ] ], + [ [10, 15], [15, 40 ] ], + [ [20, 0], [0, 30] ] + ] + }, + "n_data": [ 400, 400, 400 ], + "k": [1, 3, 5, 9, 15, 20], + "dist": "euc", + "weights": 2, + "norm": "N" + }, + { + "means": { + "method": "fix", + "data": [ [1, 10], [5, 15] ] + }, + "covs": { + "method": "fix", + "data": [ + [ [73, 0], [0, 22 ] ], + [ [21.2, 0], [0, 32.1] ] + ] + }, + "n_data": [ 1000, 1000 ], + "k": [5], + "dist": "euc", + "weights": 2, + "norm": "N" + }, + { + "means": { + "method": "random", + "data": [ [1, 10], [5, 15] ] + }, + "covs": { + "method": "fix", + "data": [ + [ [73, 0], [0, 22 ] ], + [ [21.2, 0], [0, 32.1] ] + ] + }, + "n_data": [ 1000, 1000 ], + "k": [50], + "dist": "euc", + "weights": 0, + "norm": "N" + }, + { + "means": { + "method": "fix", + "data": [ [1, 20], [5, 20], [15, 15] ] + }, + "covs": { + "method": "fix", + "data": [ + [ [1, 0], [0, 10 ] ], + [ [10, 15], [15, 40] ], + [ [20, 0], [0, 30] ] + ] + }, + "n_data": [ 100, 100, 100 ], + "k": [1, 3, 5, 9, 11, 13, 15, 17, 19], + "dist": "euc", + "weights": 0, + "norm": "N" + }, + { + "means": { + "method": "fix", + "data": [ [1, 30], [2, 30] ] + }, + "covs": { + "method": "fix", + "data": [ + [ [1, 0], [0, 10 ] ], + [ [1, 0], [0, 10 ] ] + ] + }, + "n_data": [ 100, 100 ], + "k": [1, 3, 5, 9, 11, 13, 15, 17, 19], + "dist": "euc", + "weights": 0, + "norm": "N" + }, + { + "means": { + "method": "fix", + "data": [ [1, 20], [5, 20], [20, 30], [30, 25], [25, 25] ] + }, + "covs": { + "method": "fix", + "data": [ + [ [1, 0], [0, 10 ] ], + [ [10, 15], [15, 40] ], + [ [20, 0], [0, 30] ], + [ [10, 0], [0, 30] ], + [ [2, 5], [5, 50] ] + ] + }, + "n_data": [ 100, 100, 100, 100, 100 ], + "k": [1, 3, 5, 9, 11, 13, 15, 17, 19], + "dist": "euc", + "weights": 0, + "norm": "N" + } + ] +} \ No newline at end of file diff --git a/assignment-1/submission/18300110042/source.py b/assignment-1/submission/18300110042/source.py new file mode 100644 index 0000000000000000000000000000000000000000..f72b02ca335921ffda687c593bbb4c612e0e453f --- /dev/null +++ b/assignment-1/submission/18300110042/source.py @@ -0,0 +1,397 @@ +import sys +import numpy as np +import matplotlib.pyplot as plt + +class KNN: + def __init__(self, k=5, weights=0, norm='N', dist='euc', cv_records=False, verbose=False): + # init models' hyperparameters + self.k = k + self.weights = weights + self.norm = norm + self.dist = dist + self.cv_records = cv_records + self.verbose = verbose + + + def fit(self, train_data, train_label): + # check data input + assert train_data.shape[0] == train_label.shape[0] + assert train_data.shape[1] + + # init train data and train labels + self.train_data = train_data + self.train_label = train_label + + # search the best hyperparameters set and optimize the model if requested + # hyperparameters which can be optimized (k, weights, norm, dist) + best_hyperparam = self.grid_search_cv() + + # modify the knn model's hyperparameters + if best_hyperparam: + for hp in best_hyperparam: + setattr(self, hp[0], hp[1]) + + + def predict(self, test_data): + train_data_indices = np.arange(self.train_data.shape[0]) + return self.predict_cv(test_data, train_data_indices, self.k, self.weights, self.norm, self.dist) + + + def grid_search_cv(self, cv=10): + best_hyperparam = [] + + # construct hyperparams for grid search + hyperparams = dict() + if self.k == 0: + k_up_bound = 50 + if self.train_data.shape[0] < 50: + k_up_bound = self.train_data.shape[0] + hyperparams['k'] = [k for k in range(1, k_up_bound)] + if self.norm == 'auto': + hyperparams['norm'] = ['N', 'min_max', 'standard'] + if self.weights == -1: + hyperparams['weights'] = [0, 0.1, 0.2, 0.5, 1, 2] + if self.dist == 'auto': + hyperparams['dist'] = ['euc', 'manhattan'] + + # return if optimize is not requested + if not hyperparams: + return best_hyperparam + + # construct parameters conmbination + p_names, p_values = zip(*sorted(hyperparams.items())) + hps = [tuple(hp) for hp in p_values] + combns = [[]] + for hp in hps: + combns = [x + [y] for x in combns for y in hp] + + # grid search for the best hyperparameters + accuracy_scores = [] + for comb in combns: + params = dict(zip(p_names, comb)) + # cross validate to evaluate model + score = self.cross_validation(cv, params) + accuracy_scores.append(score) + + best_comb = sorted(zip(combns, accuracy_scores), key=lambda x: x[1], reverse=True)[0] + best_hyperparam = list(zip(p_names, best_comb[0])) + return best_hyperparam + + + def cross_validation(self, cv, hyperparam): + # shuffle the train data + shuffled_indexs = np.random.permutation(self.train_data.shape[0]) + + # k-fold split train data + kfolds = np.array_split(shuffled_indexs, cv) + kfolds = [f for f in kfolds if len(f) != 0] + + accuracy_score = 0 + for i in range(len(kfolds)): + val_data = self.train_data[kfolds[i]] + val_label = self.train_label[kfolds[i]] + train_data_indices = np.concatenate(kfolds[:i] + kfolds[i+1:]).flatten() + + k = hyperparam.get('k', self.k) + weights = hyperparam.get('weights', self.weights) + norm = hyperparam.get('norm', self.norm) + dist = hyperparam.get('dist', self.dist) + predict_label = self.predict_cv(val_data, train_data_indices, k, weights, norm, dist) + + accuracy_score += np.mean(np.equal(predict_label, val_label)) + + return accuracy_score / len(kfolds) + + + def predict_cv(self, test_data, train_data_indices, k, weights, norm, dist): + # normalize data + train_data = self.train_data[train_data_indices] + train_label = self.train_label[train_data_indices] + normparams = self.calc_normparams(train_data, norm) + norm_train_data, norm_test_data = self.normalize_data(train_data, test_data, norm, normparams) + + # find k nearest neighbors + predict_labels = [] + nn_labels_list, nn_distances_list = self.get_nearest_neighbors(norm_test_data, norm_train_data, train_label, k, dist) + + if (nn_labels_list >= 0).all() and not weights: + for labels in nn_labels_list: + pred = np.bincount(labels).argmax() + predict_labels.append(pred) + else: + d_weights = np.exp(-weights * nn_distances_list ** 2) + for i in range(test_data.shape[0]): + votes = {} + labels = nn_labels_list[i] + for j in range(k): + l = labels[j] + if l in votes: + votes[l] += d_weights[i][j] + else: + votes[l] = d_weights[i][j] + pred = sorted(votes.items(), key=lambda x: x[1], reverse=True)[0][0] + predict_labels.append(pred) + + return np.array(predict_labels) + + + def get_nearest_neighbors(self, test_data, train_data, train_label, k, dist): + # calc distances + distances_list = np.vstack([self.get_distance(d, train_data, dist) for d in test_data]) + + # find the k nearest neighbors + nn_indices_list = np.argsort(distances_list, axis=-1)[:, :k] + nn_labels_list = train_label[nn_indices_list] + nn_distances_list = np.sort(distances_list, axis=-1)[:, :k] + + return nn_labels_list, nn_distances_list + + + def get_distance(self, x1, x2, dist='euc'): + if dist == 'manhattan': + distance = np.sum(np.absolute(x1 - x2), axis=-1) + else: # default: euclidean distance + distance = np.sqrt(np.sum((x1 - x2)**2, axis=-1)) + return distance + + + def normalize_data(self, train_data, test_data, norm, normparams): + if norm == 'min_max': + norm_train_data = (train_data - normparams['f_min']) / normparams['denom'] + norm_test_data = (test_data - normparams['f_min']) / normparams['denom'] + elif norm == 'standard': + norm_train_data = (train_data - normparams['mean']) / normparams['sigma'] + norm_test_data = (test_data - normparams['mean']) / normparams['sigma'] + else: # default no normalization + norm_train_data = train_data + norm_test_data = test_data + return norm_train_data, norm_test_data + + + def calc_normparams(self, data, norm): + params = dict() + if norm == 'min_max': + feature_max = data.max(axis=0) + feature_min = data.min(axis=0) + denom = feature_max - feature_min + params['f_min'] = feature_min + params['denom'] = denom + if norm == 'standard': + mean = np.mean(data, axis=0) + sigma = np.std(data, axis=0) + params['mean'] = mean + params['sigma'] = sigma + return params + +""" +------------------ below is code for experiement ------------------ +""" + +def load_lab_data(file_name): + import os, json + knn_lab_list = [] + if os.path.exists(file_name): + with open(file_name, 'r') as f: + json_data = json.loads(f.read()) + knn_lab_list = json_data.get('knn_lab', []) + return knn_lab_list + +def parse_lab_data(lab): + n_data = lab['n_data'] + ks = lab['k'] + weights = lab['weights'] + norm = lab['norm'] + dist = lab['dist'] + d_means = lab['means'] + means = [] + if d_means['method'] == 'fix': + means = d_means['data'] + else: + for m in d_means['data']: + mean = np.random.uniform(m[0], m[1], 2) + means.append(mean) + d_covs = lab['covs'] + covs = [] + if d_covs['method'] == 'fix': + covs = d_covs['data'] + else: + for c in d_covs['data']: + cov = np.zeros((2, 2)) + cov[0, 0] = cov[0, 0] + np.random.uniform(c[0], c[1], 1) + cov[1, 1] = cov[1, 1] + np.random.uniform(c[0], c[1], 1) + x = np.sqrt(cov[0, 0] * cov[1, 1]) + # cov[0, 1] = np.random.uniform(-x, x, 1) + cov[1, 0] = cov[0, 1] + covs.append(cov) + return means, covs, n_data, ks, weights, norm, dist + +def create_clustered_data(d_means, d_covs, n_data): + ds = [] + ls = [] + for i in range(len(d_means)): + d = np.random.multivariate_normal(d_means[i], d_covs[i], n_data[i]) + ds.append(d) + ls.append(np.ones((n_data[i],), dtype=int) * i) + return ds, ls + +def combine_all_data(ds, ls): + data = np.concatenate(ds) + lable = np.concatenate(ls) + return data, lable + +def generate_lab_data(data, lable, rate=0.2): + idx = np.arange(len(data)) + np.random.shuffle(idx) + data = data[idx] + label = lable[idx] + split = int(len(data) * (1 - rate)) + train_data, test_data = data[:split,], data[split:,] + train_label, test_label = label[:split,], label[split:,] + return (train_data, train_label), (test_data, test_label) + +def run_lab(labnum): + labs = load_lab_data('knn_lab.dat') + lab = labs[labnum] + means, covs, n_data, ks, weights, norm, dist = parse_lab_data(lab) + lab['means'] = means + lab['covs'] = covs + ds, ls = create_clustered_data(means, covs, n_data) + data, label = combine_all_data(ds, ls) + (train_data, train_label), (test_data, test_label) = generate_lab_data(data, label) + accs = [] + models = [] + for k in ks: + model = KNN(k, weights, norm, dist) + model.fit(train_data, train_label) + predict_label = model.predict(test_data) + models.append(model) + accs.append(np.mean(np.equal(predict_label, test_label))) + print("accs =", accs) + return models, lab, ds, data, label, accs + +def plot_data(data, labels, title='data', save=False, +colours=['tab:blue', 'tab:orange', 'tab:green', 'tab:red', 'tab:purple', 'tab:brown', 'tab:pink', 'tab:grey', 'tab:olive', 'tab:cyan']): + assert data.shape[0] == labels.shape[0] + + num_samples = data.shape[0] + label_record = sorted(set(labels)) + assert len(colours) >= len(label_record) + label_dict = {k: v for v, k in enumerate(label_record)} + data_record = [[] for l in label_record] + for i in range(num_samples): + data_record[label_dict[labels[i]]].append(data[i]) + plt.title(title) + for t in range(len(label_record)): + data_t = np.array(data_record[t]) + plt.scatter(data_t[:, 0], data_t[:, 1], c=colours[t]) + if save: + plt.savefig(f'data_plotted_{title}') + plt.show() + +def plot_decision_boundary(labnum, save=False, +colours=['tab:blue', 'tab:orange', 'tab:green', 'tab:red', 'tab:purple', 'tab:brown', 'tab:pink', 'tab:grey', 'tab:olive', 'tab:cyan']): + models, lab, ds, data, labels, accs = run_lab(labnum) + for m in models: + print(m.weights) + ks = lab['k'] + assert data.shape[0] == labels.shape[0] + label_record = sorted(set(labels)) + num_classes = len(label_record) + assert len(colours) >= len(label_record) + label_dict = {k: v for v, k in enumerate(label_record)} + + x_min, x_max = np.min(data[:, 0]) - 1, np.max(data[:, 0]) + 1 + y_min, y_max = np.min(data[:, 1]) - 1, np.max(data[:, 1]) + 1 + xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1)) + + assert len(ks) == 9 + + fig, axs = plt.subplots(3, 3, sharex='col', sharey='row', figsize=(15,12)) + indices = [(x, y) for x in [0,1,2] for y in [0,1,2]] + titles = ['KNN (k=%d)' % k for k in ks] + + for idx, model, title in zip(indices, models, titles): + Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) + Z = Z.reshape(xx.shape) + + axs[idx[0], idx[1]].contourf(xx, yy, Z, alpha=0.5) + colour_labels = [colours[label_dict[label]] for label in labels] + axs[idx[0], idx[1]].scatter(data[:,0], data[:,1], s=20, c=colour_labels, edgecolors=colour_labels) + axs[idx[0], idx[1]].set_title(title) + + if save: + plt.savefig(f'decision_boundary_plotted_{num_classes}classes') + + plt.show() + +def get_KLdiv(mean_1, cov_1, mean_2, cov_2): + assert len(mean_1) == len(mean_2) + num_dims = len(mean_1) + mu1 = np.array(mean_1) + mu2 = np.array(mean_2) + cov2_inv = np.linalg.inv(cov_2) + + logd = np.log(np.linalg.det(cov_2) / np.linalg.det(cov_1)) + trace_cov = np.trace(np.matmul(cov2_inv, cov_1)) + mean_cov = (mu2 - mu1).T.dot(cov2_inv).dot((mu2 - mu1)) + kldiv = 1/2 * (logd + trace_cov + mean_cov - num_dims) + return kldiv + +def plot_distance(labnum): + dists = [] + accss = [] + for i in range(20): + models, lab, ds, data, label, accs = run_lab(labnum) + dists.append(wasserstein_distance(ds[0], ds[1])) + accss.append(accs[0]) + dtoa = zip(dists, accss) + dtoa = dict(sorted(dtoa, key=lambda x: x[0])) + fig, ax = plt.subplots() + ax.plot(dtoa.keys(), dtoa.values()) + plt.show() +def guassian_kernel(source, target, kernel_mul=2.0, kernel_num=5, fix_sigma=None): + import torch + n_samples = int(source.size()[0])+int(target.size()[0]) + total = torch.cat([source, target], dim=0) + total0 = total.unsqueeze(0).expand(int(total.size(0)), int(total.size(0)), int(total.size(1))) + total1 = total.unsqueeze(1).expand(int(total.size(0)), int(total.size(0)), int(total.size(1))) + L2_distance = ((total0-total1)**2).sum(2) + if fix_sigma: + bandwidth = fix_sigma + else: + bandwidth = torch.sum(L2_distance.data) / (n_samples**2-n_samples) + bandwidth /= kernel_mul ** (kernel_num // 2) + bandwidth_list = [bandwidth * (kernel_mul**i) for i in range(kernel_num)] + kernel_val = [torch.exp(-L2_distance / bandwidth_temp) for bandwidth_temp in bandwidth_list] + return sum(kernel_val) + +def mmd(source, target, kernel_mul=2.0, kernel_num=5, fix_sigma=None): + import torch + batch_size = int(source.size()[0]) + kernels = guassian_kernel(source, target, + kernel_mul=kernel_mul, kernel_num=kernel_num, fix_sigma=fix_sigma) + XX = kernels[:batch_size, :batch_size] + YY = kernels[batch_size:, batch_size:] + XY = kernels[:batch_size, batch_size:] + YX = kernels[batch_size:, :batch_size] + loss = torch.mean(XX + YY - XY -YX) + return loss + +def wasserstein_distance(x, y): + from scipy.spatial.distance import cdist + from scipy.optimize import linear_sum_assignment + d = cdist(x, y) + assignment = linear_sum_assignment(d) + return d[assignment].sum() / len(x) + +def maximum_mean_discrepancy(x, y): + import torch + from torch.autograd import Variable + X = Variable(torch.Tensor(x)) + Y = Variable(torch.Tensor(y)) + return mmd(X,Y).item() + +if __name__ == '__main__': + if len(sys.argv) > 1: + labnum = sys.argv[1] + models, lab, ds, data, label, accs = run_lab(int(labnum))