diff --git a/assignment-3/submission/17307130243/README.md b/assignment-3/submission/17307130243/README.md new file mode 100644 index 0000000000000000000000000000000000000000..f15dd712c0095e7942df4d87ae84249832112f83 --- /dev/null +++ b/assignment-3/submission/17307130243/README.md @@ -0,0 +1,234 @@ +# 聚类算法 + +## 0. 概述 +本次实验利用`numpy`包,完成了对**K-Means**算法的实现以及对**GMM**模型的构建,同时生成多种类型的数据并在其上运用以上两种算法进行聚类。此外,在**K-means**算法的基础上实现了可以自动选择聚类个数的功能。实验中的数据聚类实验均利用`matplotlib`包对结果进行可视化展示,并对结果作了一定的分析。 + +## 1. 算法实现 + +### 1.1 K-Means算法 + +#### 1.1.1 算法概要 + +K-means 算法是一种迭代地对无标签数量型数据进行聚类的方法。作为一种聚类算法,在指定类别个数 $K$ 后,它需要找到一个数据指标集到$\{0,...,k-1\}$的映射$C$,即将所有数据划分为$K$类,使得反映各类别中的数据的类内差异的函数: +$$W(C)=\frac{1}{2}\sum_{1\leq k\leq K}\sum_{C(i)=C(i')=k}d(x_i,x_{i'})$$ + +尽可能小,其中$d$为衡量两数据的差异或不相似程度的函数。 +具体地,K-means算法用样本点之间的欧氏距离的平方来衡量它们之间的差异/不相似程度,即: +$$d(x_i,x_{i'})=\|x_i-x_{i'}\|^2=\sum_{p}(x_{ip}-x_{i'p})^2$$ +从而,若令$N_k$为第$k$类样本个数,容易得到:$$W(C)=\frac{1}{2}\sum_{1\leq k\leq K}\sum_{C(i)=C(i')=k}\|x_i-x_{i'}\|^2=\sum\limits_{k=1}^KN_k\sum\limits_{C(i)=k}\Vert x_i-\bar x_k\Vert^2$$ + +K-means算法不直接求解以上问题,而是以一种迭代的方式来使上式逐渐减小。首先,算法以某种方式获得 $K$ 个初始的聚类中心,随后开始迭代,每次迭代完成以下两步: + +- **Assignment:** 对每一个样本,计算其与各聚类中心的欧氏距离,将其划分为距离最近的聚类中心所在的类别 +- **Update:** 对于每个聚类,将其中数据的平均值作为新的聚类中心 + +当满足迭代停止条件时,算法停止。 +注意:K-means算法可能收敛至局部最优解。 + + + +#### 1.1.2 代码实现 +K-means算法的代码实现位于文件[source.py](./source.py) 中的类 `KMeans` + +- **初始化方法** + +K-means 算法是初始值敏感的算法,K 个初始聚类中心的选取可能会影响算法的收敛速度,在数据集本身聚类性质不佳或者K 的选取不合适时,不同初始聚类中心的选取甚至会得到完全不同的聚类结果。代码实现时(位于类`KMeans`中的方法`init_centers`)提供了两种初始化的方法以供选择:**随机初始化**和**k-means++初始化**,默认方法为**k-means++初始化**。 + +**随机初始化**: 从训练数据中随机抽取 K 个样本作为初始的聚类中心。 + +**k-means++**: 由于随机抽取数据有一定盲目性,可能发生多个初始聚类中心比较相似,位于同一个聚簇中,从而导致算法收敛缓慢或者更倾向于收敛到局部最优解而非全局最优解无法收敛,甚至可能无法收敛。针对这一问题,k-means++算法做出改进,其主要思想就是:初始聚类中心间的距离尽可能地远。具体做法如下: + +1. 在训练数据中随机选取一个点,将其加入初始聚类中心集。 +2. 对于其余的每个数据,计算其与已选中心间最近距离$D(x)$。 +3. 对每个数据$x$,以概率$\frac{{D(x)}}{\sum_{x\in\mathcal{X}}D(x)}$选为下一个聚类中心。即距离已选的聚类中心越远的点与有可能被选为下一个聚类中心。 +4. 重复以上2、3,直至聚类中心个数达到 K。 + +在代码具体实现时,为方便,直接将第2步中$D(x)$最大的$x$选为新的聚类中心。 + +- **聚类效果评估指标** + +算法实现时将训练数据的误差平方和(Sum of Squared Errors, SSE)作为评估算法聚类效果的指标: +$$SSE=\sum_{k=1}^K\sum_{C(n)=k}\|x_n-m_k\|^2$$ +每次迭代计算一次$SSE$并存储,当前后两次迭代所得模型的$SSE$之差足够小或者迭代次数到达所设置的最大值时,算法停止,最后一次计算得到的$SSE$作为模型聚类效果的度量。 + +- **预测方法** +位于类`KMeans`的方法`predict`,对于给定的数据,将其划分为距离最近的聚类中心所在类。 + + + + +### 1.2 高斯混合模型(GMM) + +#### 1.2.1 模型概要 +高斯混合模型是具有以下形式的概率分布模型: +$$p(x|\mu,\sigma^2)=\sum_{k=1}^K\pi_kf(x|\mu_k,\sigma_k^2),\sum_k^K\pi_k=1, f(x|\mu_k,\sigma_k^2)\sim\mathcal{N}(\mu_k,\sigma_k^2),\theta=(\mu,\sigma^2)$$ +高斯混合模型由于其复杂性,常常被运用于估计其他连续的随机变量或随机向量。 +高斯混合分布的概率密度函数中存在多个复合指数函数的求和,通常的(对数)极大似然估计(MLE)难以计算极值,因此一般利用EM算法配合极大似然估计(MLE)或者极大后验估计来对高斯混合模型的参数进行估计,具体做法如下: +- 首先,引入隐变量$\gamma_{ik}, 1\leq i\leq N, 1\leq k \leq K$,若第$i$个观测数据来自第$k$个子高斯分布,则$\gamma_{ik}=1$, 否则$\gamma_{ik}=0$。 + 此时,完全数据的似然函数为: + $$p(x,\gamma|\mu,\sigma^2)=\prod_{k=1}^K\prod_{i=1}^N[\pi_kf(x|\mu_k,\sigma_k^2)]^{\gamma_{ik}}$$ + 对数似然函数: + $$\sum_{k=1}^K\sum_{i=1}^N\gamma_{ik}[\log\pi_k-\frac{1}{2}\log(2\pi)-\log\sigma_k-\frac{1}{2\sigma_k^2}(x_i-\mu_k)^2]$$ +- 对各项参数初始化,开始迭代 +- E步: + 1. 建立$Q$函数: + 设$\theta_{(t)}$为上一步迭代所得出的参数 + $$Q(\theta,\theta^{(t)})=\mathbb{E}[\log P(x,\gamma|\theta)|x, \theta^{(t)}]=\\\sum_{k=1}^K\sum_{i=1}^N\mathbb{{E}}\gamma_{ik}[\log\pi_k-\frac{1}{2}\log(2\pi)-\log\sigma_k-\frac{1}{2\sigma_k^2}(x_i-\mu_k)^2]$$ + 即在似然函数中用隐变量的条件期望来代替其自身,从而可以进行类似极大似然估计的步骤 + 2. 更新$\gamma$以其期望,可以得到: + $$\hat{\gamma_{ik}}=\mathbb{{E}}\gamma_{ik}=\frac{\pi_kf(x_i|\theta_k^{(t)})}{\sum_{k=1}^K\pi_kf(x_i|\theta_{k}^{(t)})}, \forall i,k$$ +- M步: + 求$\theta^{(t+1)}=\argmax_{\theta}Q(\theta,\theta^{(t)})$ + 可以得到如下参数更新公式: + $$\mu_k=\frac{\sum_{i}\hat{\gamma_{ik}x_i}}{\sum_{i}\hat{\gamma_{ik}}}$$ + $$\sigma_k^2=\frac{\sum_{i}\hat{\gamma_{ik}}(x_i-\mu_k)^2}{\sum_{i}\hat{\gamma_{ik}}}$$ + $$\hat{\pi_k}=\frac{\sum_{i}\hat{\gamma_{ik}}}{N}$$ + +交替重复以上E步与M步,直至Q函数不发生明显变化。 +以上过程与公式针对一维随机变量,多维随机变量的情形与之类似,仅有个别地方为了合理性需要做小的改动,这里不再赘述。 +#### 1.2.2 代码实现 + +高斯混合模型的代码位于于文件[source.py](./source.py) 中的类 `GaussianMixture`。 +- 参数初始化 + 模型的参数有: + > weights:各子高斯分布的权重 + > means: 各子高斯分布的均值 + > covariances: 各子高斯分布的协方差矩阵 + > gamma:隐变量 + + `gamme`与`weights`进行均匀初始化,`covariances`初始化为单位矩阵。 + `means`的初始化同样提供了**随机初始化**与**k-means++**两种方式,默认为**k-means++**,与KMean模型中的实现相同。 + +- 迭代停止条件 + + 每次迭代计算一次当前$Q$函数的值并储存,当$Q$函数的值前后两次之差小于一个给定值或迭代次数达到最大值时迭代停止。 +- E步:`e_step()`,计算$Q$函数,计算隐变量的期望,对其进行更新。 +- M步:`m_step()`,对模型各参数进行更新。 +- 数值考虑: + 1. 可能存在离集群很远的数据,计算得出其来自任何一个子分布的概率都接近于零,由于数值精度有限,导致$\gamma$矩阵存在全0行,归一化时出现错误,遇到这种情况时将$\gamma$矩阵对应行设置为每个元素都相等且和为1的向量。 + 2. 迭代过程中得到的协方差矩阵可能不是正定的,甚至可能不是可逆的,为了避免这种情况,在每个协方差阵上加单位矩阵乘以一个很小的正数。这一做法的合理性参考[Regularized Gaussian Covariance Estimation](https://freemind.pluskid.org/machine-learning/regularized-gaussian-covariance-estimation/). + +## 2. 实验 + +### 2.1 基础实验 + +#### 2.1.1 初始化方式对K-means算法的影响 +实验代码位于[source.py](./source.py)中的`init_expr()` +这部分实验中,根据以下参数生成来自三个高斯分布的样本,每类200个样本。然后分别调用以k-means++/random为初始化方法的KMeans模型来对数据进行聚类,为消减偶然性带来的大方差,每组实验进行了三次。 +$$\mu: [-2, 0],[2,0],[1.5,2]$$ +$$\Sigma:I, 0.7I, 1.5I$$ +下图展示了两种初始化方法下的聚类效果以及各阶段的聚类中心。 +|**K-means++**|**random**| +|--|--| +|![kmeans++](./img/init_expr_k-means 0.png)|![random](./img/init_expr_random1.png) + +从图中可以看出,采用k-means++算法进行初始化,得到的初始聚类中心十分极端,一般位于集群边缘,而右图中随机初始化得到的聚类中心仅分布在一块小区域中。两者最终的聚类效果相差不大。 + +下图是六次实验中SSE与迭代次数的关系图: +
+ +
+可以看到,k-means++初始化由于初始聚类中心均在边缘且距离很远,所以最初的SSE值普遍很高,但是经过一次迭代后迅速下降,并且快速收敛,普遍比随机初始化的情形更快收敛。这体现出k-means++算法的强大能力 + +#### 2.1.2 特殊形状数据 +这部分实验生成了一些特殊形状的数据集来进行聚类,代码位于[source.py](./source.py)文件中的类 `GenData`. + +- **方形块状数据** +数据生成代码位于`GenData`类中的`square`方法,聚类实验代码位于`square_expr`。 +数据生成方式:对$[-1, 1]^2$上的均匀分布进行采样,并去除横坐标介于-0.05与0.05之间的样本,形成两个块状矩阵类。 + +|原始数据|K-means 聚类结果|GMM聚类结果| +|--|--|--| +|![square1](./img/square.png)|![square2](./img/square_kmeans_outcome.png)|![square3](./img/square_gmm_outcome.png)| + + +可以看到,k-means算法很好地完成了聚类,聚类中心在各矩形中心。GMM算法难以收敛,这里选取了某一步迭代后的结果,可见GMM对该数据集聚类效果不好。 + +- **圆形数据** + 数据生成代码位于`GenData`类中的`three_circles`方法,聚类实验代码位于`circle_expr`。 + 数据生成方式:对单位圆上的均匀分布进行采样,并将其平移。采样多次,形成三个相切的圆。 + +|原始数据|K-means 聚类结果|GMM聚类结果| +|--|--|--| +|![square1](./img/circle.png)|![square2](./img/circle_kmeans_outcome.png)|![square3](./img/circle_gmm_outcome.png)| + +从图中可以看出,k-means与GMM都取得了不错的聚类效果。k-means算法得到的聚类中心几乎就是三个圆的圆心,但是GMM算法的聚类中心均远远偏离三个圆心,训练数据只能模拟高斯分布靠近边缘的分布。这一结果反映了k-means算法对于均匀分布的、类间有明显边界的、集群形状凸的数据的聚类能力,同时反映了GMM模型对与自身不相似的分布的强大的模拟能力。 + +- **环形数据** +数据生成代码位于`GenData`类中的`ring`方法,聚类实验代码位于`ring_expr`。 +数据生成方式:首先分别生成服从两个圆周上的均匀分布的样本,再对各样本加上服从小方差、零均值正态分布的噪声,得到两个环形数据集。 + +|原始数据|K-means 聚类结果|GMM聚类结果| +|--|--|--| +|![square1](./img/ring.png)|![square2](./img/ring_kmeans_outcome.png)|![square3](./img/ring_gmm_outcome.png)| + + +两种聚类算法在该数据集上的聚类表现都不好。 +#### 2.1.3 混合高斯分布 + +这部分实验的数据主要为基于以下均值和协方差阵生成的混合高斯分布: +$$\mu: [-2, 0],[2,0],[1.5,2]$$ +$$\Sigma:E, 0.7E, 1.5E, E=[[1, \rho],[\rho,1]]$$ + +- **不同$\rho$下的混合高斯分布** + +|$\rho$|原始类别|K-means聚类结果|GMM聚类结果| +|--|--|--|--| +|0.1|![](./img/gmm_init_0.1.png)|![](./img/gmm_01_balance_kmeans_outcome.png)|![](./img/gmm_01_balance_gmm_outcome.png)| +|0.5|![](./img/gmm_init_0.5.png)|![](./img/gmm_05_balance_kmeans_outcome.png)|![](./img/gmm_05_balance_gmm_outcome.png)| +|0.9|![](./img/gmm_init_0.9.png)|![](./img/gmm_09_balance_kmeans_outcome.png)|![](./img/gmm_09_balance_gmm_outcome.png)| +|0.99|![](./img/gmm_init_0.99.png)|![](./img/gmm_099_balance_kmeans_outcome.png)|![](./img/gmm_099_balance_gmm_outcome.png)| + +- **等比例不同规模** + +|每类样本数|K-means聚类结果|GMM聚类结果| +|--|--|--| +|50|![](./img/gmm_05_balance_s_kmeans_outcome.png)|![](./img/gmm_05_balance_s_gmm_outcome.png)| +|200|![](./img/gmm_05_balance_kmeans_outcome.png)|![](./img/gmm_05_balance_gmm_outcome.png) +|500|![](./img/gmm_05_balance_l_kmeans_outcome.png)|![](./img/gmm_05_balance_l_gmm_outcome.png) + +- **不均衡数据** + +|各类样本个数|K-means聚类结果|GMM聚类结果| +|--|--|--| +|(80,200,600)|![](./img/gmm_05_imbalance_kmeans_outcome.png)|![](./img/gmm_05_imbalance_gmm_outcome.png)| + +K-means算法倾向于把数据划分成等规模的类别,在不均衡数据集上可能会表现不好。 +### 2.2 自动选择聚簇数量 +这部分实验实现了基于KMeans的算法的自动选择聚簇数量的聚类算法。代码位于类`ClusteringAlgorithm`。 +#### 2.2.1 Elbow算法 + +Elbow 是一种用于确定聚类数量的启发式方法,需要作出损失函数(SSE)与聚类个数K 的关系图,并寻找曲线的拐点作为聚类个数,满足:当K小于该点时,SSE快速降低,K大于该点时,SSE降低的速度放缓,即SSE前后两点的差值在该点处有显著降低。 + +代码实现位于`ClusteringAlgorithm.elbow` +实验中,使用了`tester_demo.py`中的`data_1`数据集,其为含有三个子高斯分布的高斯混合分布。 +以下为SSE与K 的关系曲线: +可以看到 3 是满足要求的拐点,恰恰也是真实的聚簇个数。 + +
+ +
+ +#### 2.2.2 Gap Statistic +elbow方法需要人工根据曲线来选择最合适的K,不能算作完全自动的聚簇个数选择方法。而且,有时SSE与K的关系曲线不那么清晰,难以用肉眼找出最好的K。Gap Statistic 方法可以视作一种自动化的elbow方法,主要步骤如下: +定义 **$Gap(k)=\mathbb{E}\log(SSE) - \log SSE$** +其中$\mathbb{E}\log(SSE)$通过MC模拟产生,具体生成方式是:在样本所在的矩形区域中均匀采样得到与训练数据数量相同的样本,对该组样本作KMeans算法聚类,得到一个SSE。重复多次后(default=20), 计算$\log SSE$的均值,得到$\mathbb{E}\log(SSE)$的估计$w$。 +为了修正MC采样带来的误差,计算标准差来修正Gap: +$$sd(k)=\sqrt{\frac{1}{B}\sum_b(\log(SSE_{kb})-w)^2}$$ + +$$s_k=\sqrt{\frac{1 + B}{B}}sd(k)$$ +最后,选择满足$Gap(k+1)-s_{k+1}\leq Gap(k)$的最小的K作为最优的聚类个数。 + +以下实验沿用上面的数据集,运用Gap Statistic方法求得最优的聚类个数为3,符合事实,可见Gap Statistic实现正确且有效。 + + +
+ +
+ +## 3.总结 +**K-means**算法原理简单,收敛较快,在一些数据集上可以有很好的效果。由算法使用的度量:欧氏距离以及其聚类方法可以知道,K-means算法用数个超平面(二维情形为直线)划分全空间,划分所得每块区域是凸多边形。因此,K-means对集群形状是凸的、连通的数据集会更有效,否则可能需要对原始数据集做一些变换来达到更好的聚类效果,其本身并不太灵活。K-means算法是基于数据本身来聚类,只会考虑存在的数据,一些没有数据的小区域的存在不会对聚类产生太大影响,但是“没有数据”也是一种信息。比如上面带状的混合高斯分布,K-means算法作的划分几乎与分布的走向垂直,这是只考虑数据间距离及其均值的短处。 + +**GMM**作为一种聚类手段时,属于软聚类,它提供数据属于某个类别的概率而非直接划分,因此可以处理那些处于重叠区域的数据,适用的集群形状也更加多样,总体来说相比于K-means算法,GMM更加灵活。但是每一步迭代的计算开销很大,与某些数据集,可能需要自动选择聚类个数才能较好地体现GMM的优势,计算量将会很大。 +## 代码运行方式 +> python source.py \ No newline at end of file diff --git a/assignment-3/submission/17307130243/img/.keep b/assignment-3/submission/17307130243/img/.keep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/assignment-3/submission/17307130243/img/circle.png b/assignment-3/submission/17307130243/img/circle.png new file mode 100644 index 0000000000000000000000000000000000000000..e5f6c5265d8dfa9cf6d83b68ef8a6ea87fefd1fb Binary files /dev/null and b/assignment-3/submission/17307130243/img/circle.png differ diff --git a/assignment-3/submission/17307130243/img/circle_gmm_outcome.png b/assignment-3/submission/17307130243/img/circle_gmm_outcome.png new file mode 100644 index 0000000000000000000000000000000000000000..d5351c5e5a4d97fc9ab9ccdcf04c85dff4de5ada Binary files /dev/null and b/assignment-3/submission/17307130243/img/circle_gmm_outcome.png differ diff --git a/assignment-3/submission/17307130243/img/circle_kmeans_outcome.png b/assignment-3/submission/17307130243/img/circle_kmeans_outcome.png new file mode 100644 index 0000000000000000000000000000000000000000..2f3503a4a47c6aedc2d7951f48067b4e71451bc9 Binary files /dev/null and b/assignment-3/submission/17307130243/img/circle_kmeans_outcome.png differ diff --git a/assignment-3/submission/17307130243/img/elbow.png b/assignment-3/submission/17307130243/img/elbow.png new file mode 100644 index 0000000000000000000000000000000000000000..42e792206d812a1e19fd2fb42e634a0f71cfd9ed Binary files /dev/null and b/assignment-3/submission/17307130243/img/elbow.png differ diff --git a/assignment-3/submission/17307130243/img/gap_statistic.png b/assignment-3/submission/17307130243/img/gap_statistic.png new file mode 100644 index 0000000000000000000000000000000000000000..22861360d183d65b473cae7b78757eac286faf94 Binary files /dev/null and b/assignment-3/submission/17307130243/img/gap_statistic.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_01_balance.png b/assignment-3/submission/17307130243/img/gmm_01_balance.png new file mode 100644 index 0000000000000000000000000000000000000000..c94473bec8b23004756926e361f8179e4cd379e3 Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_01_balance.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_01_balance_gmm_outcome.png b/assignment-3/submission/17307130243/img/gmm_01_balance_gmm_outcome.png new file mode 100644 index 0000000000000000000000000000000000000000..297bfc232f5b2e6afd7ca77be141f7472059ee31 Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_01_balance_gmm_outcome.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_01_balance_kmeans_outcome.png b/assignment-3/submission/17307130243/img/gmm_01_balance_kmeans_outcome.png new file mode 100644 index 0000000000000000000000000000000000000000..d6111a7d3a25e3448788836133d2b6a25ab9fdb7 Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_01_balance_kmeans_outcome.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_05_balance.png b/assignment-3/submission/17307130243/img/gmm_05_balance.png new file mode 100644 index 0000000000000000000000000000000000000000..62fbf8cd3ed6476c7d3e5f9828e48bea2e166e7b Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_05_balance.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_05_balance_gmm_outcome.png b/assignment-3/submission/17307130243/img/gmm_05_balance_gmm_outcome.png new file mode 100644 index 0000000000000000000000000000000000000000..1ac733fc2f5bd739afe52179230d5355fbd1ea49 Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_05_balance_gmm_outcome.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_05_balance_kmeans_outcome.png b/assignment-3/submission/17307130243/img/gmm_05_balance_kmeans_outcome.png new file mode 100644 index 0000000000000000000000000000000000000000..15caf173bbdedc86e82a84e71c720ddda771db1c Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_05_balance_kmeans_outcome.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_05_balance_l.png b/assignment-3/submission/17307130243/img/gmm_05_balance_l.png new file mode 100644 index 0000000000000000000000000000000000000000..b126c94cca4ec9fc504815e9ac7e5d3699dc9bee Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_05_balance_l.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_05_balance_l_gmm_outcome.png b/assignment-3/submission/17307130243/img/gmm_05_balance_l_gmm_outcome.png new file mode 100644 index 0000000000000000000000000000000000000000..18529fcf0c2a9ecdc86db03be2aa1c53c0721020 Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_05_balance_l_gmm_outcome.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_05_balance_l_kmeans_outcome.png b/assignment-3/submission/17307130243/img/gmm_05_balance_l_kmeans_outcome.png new file mode 100644 index 0000000000000000000000000000000000000000..43715dfcceea731ec673f5aead8293ae4ebbe2d3 Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_05_balance_l_kmeans_outcome.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_05_balance_s.png b/assignment-3/submission/17307130243/img/gmm_05_balance_s.png new file mode 100644 index 0000000000000000000000000000000000000000..e96d126a647263f56cf50efc4b8ca320e3ef4924 Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_05_balance_s.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_05_balance_s_gmm_outcome.png b/assignment-3/submission/17307130243/img/gmm_05_balance_s_gmm_outcome.png new file mode 100644 index 0000000000000000000000000000000000000000..7cf97604df751921c724489b1e52c197589c7ae8 Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_05_balance_s_gmm_outcome.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_05_balance_s_kmeans_outcome.png b/assignment-3/submission/17307130243/img/gmm_05_balance_s_kmeans_outcome.png new file mode 100644 index 0000000000000000000000000000000000000000..294d60335c962ce8e7140b803bb451240e2f0eb1 Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_05_balance_s_kmeans_outcome.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_05_imbalance.png b/assignment-3/submission/17307130243/img/gmm_05_imbalance.png new file mode 100644 index 0000000000000000000000000000000000000000..dc3325962b271906d47a6930ca71f254fc4cef6d Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_05_imbalance.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_05_imbalance_gmm_outcome.png b/assignment-3/submission/17307130243/img/gmm_05_imbalance_gmm_outcome.png new file mode 100644 index 0000000000000000000000000000000000000000..2031e0450aa8fea03176862ee1d1f2d1c34b9838 Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_05_imbalance_gmm_outcome.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_05_imbalance_kmeans_outcome.png b/assignment-3/submission/17307130243/img/gmm_05_imbalance_kmeans_outcome.png new file mode 100644 index 0000000000000000000000000000000000000000..af11521393401920f22ff3dd4a765ef0df01c1f4 Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_05_imbalance_kmeans_outcome.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_099_balance_gmm_outcome.png b/assignment-3/submission/17307130243/img/gmm_099_balance_gmm_outcome.png new file mode 100644 index 0000000000000000000000000000000000000000..ef22712ba5d092890cc48a8e71180e13a0357c33 Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_099_balance_gmm_outcome.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_099_balance_kmeans_outcome.png b/assignment-3/submission/17307130243/img/gmm_099_balance_kmeans_outcome.png new file mode 100644 index 0000000000000000000000000000000000000000..cab3edd943195fb68825790a827393fbe2f48bb0 Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_099_balance_kmeans_outcome.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_09_balance_gmm_outcome.png b/assignment-3/submission/17307130243/img/gmm_09_balance_gmm_outcome.png new file mode 100644 index 0000000000000000000000000000000000000000..45520f346ac3c833a973c53b2a7c0f371b1fc672 Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_09_balance_gmm_outcome.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_09_balance_kmeans_outcome.png b/assignment-3/submission/17307130243/img/gmm_09_balance_kmeans_outcome.png new file mode 100644 index 0000000000000000000000000000000000000000..9c80e782ba13dbe1c15c1fbe85a87430b461f857 Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_09_balance_kmeans_outcome.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_init.png b/assignment-3/submission/17307130243/img/gmm_init.png new file mode 100644 index 0000000000000000000000000000000000000000..0b37bcb7f6092e91fbb5fea74da9b90461570ffe Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_init.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_init_0.1.png b/assignment-3/submission/17307130243/img/gmm_init_0.1.png new file mode 100644 index 0000000000000000000000000000000000000000..bb0efbb769f2dbf979d8837cf32e3d991682730a Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_init_0.1.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_init_0.5.png b/assignment-3/submission/17307130243/img/gmm_init_0.5.png new file mode 100644 index 0000000000000000000000000000000000000000..0de0c013aff2b781b864d2fbaa45e83c66bf041d Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_init_0.5.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_init_0.9.png b/assignment-3/submission/17307130243/img/gmm_init_0.9.png new file mode 100644 index 0000000000000000000000000000000000000000..16d3a5d5b9e7b620d9d18445593a598c04e839d6 Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_init_0.9.png differ diff --git a/assignment-3/submission/17307130243/img/gmm_init_0.99.png b/assignment-3/submission/17307130243/img/gmm_init_0.99.png new file mode 100644 index 0000000000000000000000000000000000000000..e51ed04bf9319c11b9a8de47bee38ac09e938de1 Binary files /dev/null and b/assignment-3/submission/17307130243/img/gmm_init_0.99.png differ diff --git a/assignment-3/submission/17307130243/img/init_expr_k-means 0.png b/assignment-3/submission/17307130243/img/init_expr_k-means 0.png new file mode 100644 index 0000000000000000000000000000000000000000..1a41040d03adef16f685a6aab2125e91114efe26 Binary files /dev/null and b/assignment-3/submission/17307130243/img/init_expr_k-means 0.png differ diff --git a/assignment-3/submission/17307130243/img/init_expr_line.png b/assignment-3/submission/17307130243/img/init_expr_line.png new file mode 100644 index 0000000000000000000000000000000000000000..c0a693ca6e14294309f6fb60e1db982c1a2c1f40 Binary files /dev/null and b/assignment-3/submission/17307130243/img/init_expr_line.png differ diff --git a/assignment-3/submission/17307130243/img/init_expr_random1.png b/assignment-3/submission/17307130243/img/init_expr_random1.png new file mode 100644 index 0000000000000000000000000000000000000000..e91e2a47443eff6e7b158a6f08ca2c3a914dd74c Binary files /dev/null and b/assignment-3/submission/17307130243/img/init_expr_random1.png differ diff --git a/assignment-3/submission/17307130243/img/ring.png b/assignment-3/submission/17307130243/img/ring.png new file mode 100644 index 0000000000000000000000000000000000000000..8dc1959e9ae9a8b426181d68d65c49bb50ca37a6 Binary files /dev/null and b/assignment-3/submission/17307130243/img/ring.png differ diff --git a/assignment-3/submission/17307130243/img/ring_gmm_outcome.png b/assignment-3/submission/17307130243/img/ring_gmm_outcome.png new file mode 100644 index 0000000000000000000000000000000000000000..344b4ba4659d1a1831f5916ea2c421e1a8876175 Binary files /dev/null and b/assignment-3/submission/17307130243/img/ring_gmm_outcome.png differ diff --git a/assignment-3/submission/17307130243/img/ring_kmeans_outcome.png b/assignment-3/submission/17307130243/img/ring_kmeans_outcome.png new file mode 100644 index 0000000000000000000000000000000000000000..89de87931f45be8a7111532afd871ac99051a934 Binary files /dev/null and b/assignment-3/submission/17307130243/img/ring_kmeans_outcome.png differ diff --git a/assignment-3/submission/17307130243/img/square.png b/assignment-3/submission/17307130243/img/square.png new file mode 100644 index 0000000000000000000000000000000000000000..d0fe12b69bccde056396851840278ade38b9861a Binary files /dev/null and b/assignment-3/submission/17307130243/img/square.png differ diff --git a/assignment-3/submission/17307130243/img/square_gmm_outcome.png b/assignment-3/submission/17307130243/img/square_gmm_outcome.png new file mode 100644 index 0000000000000000000000000000000000000000..34059a9c9ae87a6f022ecaf5bd2c74b87a2b9e48 Binary files /dev/null and b/assignment-3/submission/17307130243/img/square_gmm_outcome.png differ diff --git a/assignment-3/submission/17307130243/img/square_kmeans_outcome.png b/assignment-3/submission/17307130243/img/square_kmeans_outcome.png new file mode 100644 index 0000000000000000000000000000000000000000..0b10d01ab40a447b996431f38bfd347044e9676d Binary files /dev/null and b/assignment-3/submission/17307130243/img/square_kmeans_outcome.png differ diff --git a/assignment-3/submission/17307130243/source.py b/assignment-3/submission/17307130243/source.py new file mode 100644 index 0000000000000000000000000000000000000000..1726ecf422da91ec0a775135f4341ef81d2cb25b --- /dev/null +++ b/assignment-3/submission/17307130243/source.py @@ -0,0 +1,789 @@ +import numpy as np + +import matplotlib.pyplot as plt +from scipy.spatial import voronoi_plot_2d # 作图 +from scipy.spatial import Voronoi + + +class KMeans: + + def __init__(self, n_clusters, init='k-means++', + max_iter=300, tol=1e-4, normal=False): + """ + + :param n_clusters:int, the number of clusters to form + :param init:{'k-means++', 'random'}, callable or array-like of shape(n_clusters, n_features) + :param max_iter:int, default=300, maximum number of iterations for a single run + :param tol:float, default=1e-4 + + Attributes: + centers:NdArray of shape(n_clusters, n_features) + labels:NdArray of shape(n_samples,), labels of training data + n_iter:int, the number of iterations + sse: list of float,sum of square error + + + """ + # parameters + self.n_clusters = n_clusters + self.init = init + self.max_iter = max_iter + self.tol = tol + self.normal = normal + + # attributes + self.centers = None + self.labels = None + self.n_iter = 0 + self.sse = [] + + def init_centers(self, train_data, ): + """ + 初始化聚类中心. + :param train_data: NdArray of shape(n_samples, n_features), + the input samples + :return: centers: NdArray of shape (n_clusters, n_features) + """ + assert train_data.shape[0] >= self.n_clusters + if hasattr(self.init, '__array__'): + # 直接赋值 + self.centers = self.init + + elif self.init == 'random': + # 从训练数据中随机选取n_clusters个样本作为聚类中心初始值 + idx = np.random.permutation(train_data.shape[0])[: self.n_clusters] + self.centers = train_data[idx] + + elif self.init == 'k-means++': + # 采用k-means++算法初始化 + self.centers = k_means_plusplus(train_data, self.n_clusters) + + return self.centers + + def update_centers(self, train_data, labels): + """单步计算各簇中心.""" + # dimension of data + n_features = train_data.shape[1] + centers = np.zeros((self.n_clusters, n_features)) + for i in range(self.n_clusters): + centers[i, :] = np.mean(train_data[labels == i], axis=0) + + self.centers = centers + return centers + + def cal_sse(self, train_data, labels): + """计算SSE, SSE的大小用以评估聚类效果.""" + distances = np.zeros(train_data.shape[0]) + for i in range(self.n_clusters): + distances[labels == i] = np.linalg.norm(train_data[labels == i] - self.centers[i], axis=1) + return np.sum(distances ** 2) + + def predict(self, data): + """ + 判断数据所属类别. + :param data: NdArray of shape(n_samples, n_features) + :return: labels: NdArray of shape(n_samples,) + """ + if len(data.shape) == 1: + data = data.reshape((-1, 1)) + + distances = np.zeros((data.shape[0], self.n_clusters)) + for j in range(self.n_clusters): + distances[:, j] = np.linalg.norm(data - self.centers[j], axis=1) + + labels = np.argmin(distances, axis=1) + return labels + + def update_labels(self, train_data): + """单步中对训练数据进行类别划分.""" + self.labels = self.predict(train_data) + return self.labels + + def fit(self, train_data): + """ + 模型训练过程. + :param train_data: NdArray of shape(n_samples, n_features) + :return: init_centers: NdArray of shape(n_clusters, n_features) + """ + if len(train_data.shape) == 1: + train_data = train_data.reshape((-1, 1)) + + # 归一化 + if self.normal: + data_ = normalize(train_data) + else: + data_ = train_data + + # 参数初始化 + self.sse = [] + self.n_iter = 0 + # 聚类中心初始化 + init_centers = self.init_centers(data_) + curr_centers = init_centers + + # 迭代 + for _ in range(self.max_iter): + self.n_iter += 1 + labels = self.update_labels(data_) + self.sse.append(self.cal_sse(data_, labels)) + self.update_centers(data_, labels) + if np.linalg.norm(self.centers - curr_centers, ord='fro') <= self.tol: + break + curr_centers = self.centers + + return init_centers + + +class GaussianMixture: + + def __init__(self, n_clusters, max_iter=30, tol=1e-4, init='k-means++', normal=True): + """ + + :param n_clusters:int, the number of clusters to form + :param max_iter:int, default=300, maximum number of iterations for a single run + :param tol:float, default=1e-4 + :param init:{'k-means++', 'random'}, callable or array-like of shape(n_clusters, n_features) + """ + # parameters + self.n_clusters = n_clusters + self.max_iter = max_iter + self.tol = tol + self.init = init + self.normal = normal + + # attributes + self.weights = None # (n_clusters,) + self.means = None # (n_clusters, n_features) + self.covariances = None # (n_clusters, n_features, n_features) + self.gamma = None # (n_samples, n_clusters) + self.n_iter = 0 + self.Q_funcs = [] + self.labels = None + + def init_params(self, train_data): + """ + 初始化混合高斯分布的各项参数以及隐变量. 具体地: + means: 按照 self.init(random 或 k-means++) 的值来初始化 + weights & gamma: 均匀初始化 + covariances: 初始化为单位矩阵 + + :param train_data: NdArray of shape(n_samples, n_features) + :return: None + """ + n_samples, n_features = train_data.shape + # uniform + self.weights = np.ones(self.n_clusters) / self.n_clusters + self.covariances = np.array([np.identity(n_features)] * self.n_clusters) + self.gamma = np.ones((n_samples, self.n_clusters)) / self.n_clusters + + # means + if hasattr(self.init, '__array__'): + assert self.init.shape == (self.n_clusters, n_features) + self.means = self.init + elif self.init == 'random': + # 从训练数据中随机选取n_clusters个样本作为聚类中心初始值 + idx = np.random.permutation(train_data)[: self.n_clusters] + self.means = train_data[idx] + + elif self.init == 'k-means++': + # 采用k-means++算法初始化 + self.means = k_means_plusplus(train_data, self.n_clusters) + + def cal_q(self, train_data): + """ + 计算当前数据及参数下 Q函数的取值, + Q函数即对数完全似然函数在数据及参数已知时的条件期望. + :param train_data: NdArray of shape(n_samples, n_features) + :return: q:float, the value of q function + """ + n_samples, dim = train_data.shape + q = 0 + for j in range(self.n_clusters): + # calculate + cov = self.covariances[j, :, :] + cov_inv = np.linalg.inv(cov) + det = np.abs(np.linalg.det(cov)) + + # update + q += (np.log(self.weights[j]) - 0.5 * np.log(det) + - 0.5 * dim * np.log(2 * np.pi)) * np.einsum('i->', self.gamma[:, j]) + for i in range(n_samples): + v = train_data[i] - self.means[j] + q -= self.gamma[i, j] * 0.5 * np.matmul(v, np.matmul(v, cov_inv)) + + return q + + def e_step(self, train_data): + """ + E step of EM Algorithm for GMM. + :return: None + """ + + # update gamma + n_samples = train_data.shape[0] + for j in range(n_samples): + new_gamma_j = [] + single_data = train_data[j] + for k in range(self.n_clusters): + new_gamma_j.append(self.weights[k] * + multi_normal_pdf(single_data, + self.means[k], + self.covariances[k])) + new_gamma_j = np.array(new_gamma_j) + if sum(new_gamma_j) == 0: + self.gamma[j, :] = np.ones(self.n_clusters) / self.n_clusters + else: + self.gamma[j, :] = new_gamma_j / sum(new_gamma_j) + + # calculate q function + q = self.cal_q(train_data) + self.Q_funcs.append(q) + + def m_step(self, train_data): + """ + M Step of EM Algorithm for GMM. + :return: None + """ + # update weights, not completed, to be normalized + self.weights = np.einsum('ij->j', self.gamma) + + # update means and covariances + for k in range(self.n_clusters): + # update means + self.means[k] = np.matmul(self.gamma[:, k], train_data) / self.weights[k] + + # update covariances + s = 0 + for j in range(train_data.shape[0]): + v = train_data[j] - self.means[k] + s += self.gamma[j, k] * np.matmul(v.reshape((-1, 1)), v.reshape((1, -1))) + self.covariances[k] = s / self.weights[k] + + # to avoid singular cov matrix + self.covariances[k] += 1e-3 * np.eye(train_data.shape[1]) + + # update weight finished + self.weights = self.weights / np.einsum('i->', self.weights) + + def fit(self, train_data): + """ + EM Algorithm process. + :return: None + """ + if len(train_data.shape) == 1: + train_data = train_data.reshape((-1, 1)) + + # 参数初始化 + self.n_iter = 0 + self.Q_funcs = [] + + # normalization + if self.normal: + data = normalize(train_data) + else: + data = train_data + + self.init_params(data) + for _ in range(self.max_iter): + self.e_step(data) + self.n_iter += 1 + + if (self.n_iter > 1 and + np.abs(self.Q_funcs[-1] - self.Q_funcs[-2]) < self.tol): + break + self.m_step(data) + + self.labels = np.argmax(self.gamma, axis=1) + + def predict(self, test_data): + """ + 给定数据集,给出其最有可能所属的子高斯分布. + :param test_data: NdArray of shape(n, n_features) + :return: labels: NdArray of shape(n,) + """ + if len(test_data.shape) == 1: + test_data = test_data.reshape((-1, 1)) + + if self.normal: + test_data = normalize(test_data) + + labels = np.zeros(test_data.shape[0]) + for i in range(test_data.shape[0]): + prob = [] + for k in range(self.n_clusters): + prob.append(self.weights[k] * + multi_normal_pdf(test_data[i], + self.means[k], + self.covariances[k])) + labels[i] = np.argmax(prob) + + return labels + + def pdf(self, x): + """ + 混合高斯分布的密度函数. + :param x: NdArray of shape(n_features,) + :return: pdf:float + """ + density = 0 + for n in range(self.n_clusters): + density += self.weights[n] * multi_normal_pdf(x, self.means[n], self.covariances[n]) + + return density + + +class ClusteringAlgorithm: + + def __init__(self): + self.centers = None + self.model = None + self.n_clusters = 1 + + def elbow(self, train_data): + k_candidates = np.arange(1, 11) + k_sse = np.zeros(10) + for i in range(10): + model = KMeans(n_clusters=k_candidates[i]) + model.fit(train_data) + k_sse[i] = model.sse[-1] + plt.plot(k_candidates, k_sse, marker='.') + plt.title("SSE vs K") + plt.xlabel("K") + plt.ylabel("SSE") + plt.savefig("elbow.png", dpi=300) + plt.show() + + k_optimal = input("optimal k:") + self.model = KMeans(k_optimal) + + def gap_statistic(self, train_data, figure=False): + n_samples, n_features = train_data.shape + k_max = max(10, int((n_samples / 2) ** 0.5)) + data_min = train_data.min(axis=0) + data_max = train_data.max(axis=0) + k_candidates = np.arange(1, 1 + k_max) + k_sse = np.zeros(k_max) + s = np.zeros(k_max) + expect = np.zeros(k_max) + + # MC次数 + n_simulate = 20 + scalar = ((n_simulate + 1) / n_simulate) ** 0.5 + k_optimal = 0 + + for k in k_candidates: + model = KMeans(n_clusters=k) + model.fit(train_data) + k_sse[k - 1] = model.sse[-1] + simulate_sse = np.zeros(n_simulate) + for i in range(n_simulate): + simulate_data = np.random.uniform(data_min, data_max, [n_samples, n_features]) + model.fit(simulate_data) + simulate_sse[i] = model.sse[-1] + + expect[k - 1] = np.mean(np.log(simulate_sse)) + s[k - 1] = scalar * np.std(np.log(simulate_sse)) + + gap = expect - np.log(k_sse) + for k in range(k_max): + if k_optimal == 0 and gap[k] >= (gap[k + 1] - s[k + 1]): + k_optimal = k + 1 + break + + if figure: + plt.figure() + plt.plot(k_candidates, expect, marker='.', label='EW_k') + plt.plot(k_candidates, np.log(k_sse), marker='.', label='logW_k') + plt.plot(k_candidates, gap, marker='.', label='gap') + plt.legend() + plt.grid() + plt.xlabel("the number of clusters K") + plt.savefig("gap_statistic.png", dpi=300) + plt.show() + + self.model = KMeans(k_optimal) + + def fit(self, train_data): + if len(train_data.shape) == 1: + train_data = train_data.reshape((-1, 1)) + self.model.fit(train_data) + + def predict(self, test_data): + pass + + +colors = ['lightskyblue', 'sandybrown', 'mediumpurple', 'darkseagreen', + 'salmon', 'darkturquoise', 'hotpink', 'slategrey'] + + +class plot: + def __init__(self): + pass + + @staticmethod + def contour(f, scale): + """ + 等高线图. + :param f: 密度函数 + :param scale: 图像范围 + :return:None + """ + x_l, x_u, y_l, y_u = scale + X = np.linspace(int(x_l), int(x_u), 50) + Y = np.linspace(int(y_l), int(y_u), 50) + + X, Y = np.meshgrid(X, Y) + Z = np.zeros_like(X) + for i in range(X.shape[0]): + for j in range(X.shape[1]): + Z[i, j] = f(X[i, j], Y[i, j]) + + plt.contour(X, Y, Z, cmap='Greys') + + @staticmethod + def scatter(data_, labels, centers, voronoi=True): + """ + 聚类结果的可视化,适用于2维数据 + :param data_: NdArray of shape(n_samples, n_features) + :param labels: NdArray of shape(n_samples,) + :param centers: NdArray of shape(n_clusters, n_features) + :param voronoi: bool + :return: None + """ + # 作voronoi图 + if voronoi: + vor = Voronoi(points=centers) + voronoi_plot_2d(vor) + + for i in range(data_.shape[0]): + plt.scatter(data_[i, 0], data_[i, 1], color=colors[int(labels[i])], alpha=0.85) + + # data_min = data_.min(axis=0) + # data_max = data_.max(axis=0) + # plt.xlim(int(data_min[0]), int(data_max[0]) + 1) + # plt.ylim(int(data_min[1]), int(data_max[1]) + 1) + + +""" +数据生成以及聚类实验 +""" + + +class GenData(object): + def __init__(self): + pass + + @staticmethod + def square(): + """ + 生成两类二维长方形数据. + :return:data, n_cluster + """ + data = np.zeros((800, 2)) + for i in range(800): + new_data = np.random.uniform([-1, -1], [1, 1]) + while -0.05 <= new_data[0] <= 0.05: + new_data = np.random.uniform([-1, -1], [1, 1]) + data[i] = new_data + return data, 2 + + @staticmethod + def circle(num): + """ + 生成单位圆内均匀分布的数据. + :param num: int, number of data + :return: data + """ + data = [] + for i in range(num): + new_data = np.random.uniform([-1, -1], [1, 1]) + while np.linalg.norm(new_data) > 1: + new_data = np.random.uniform([-1, -1], [1, 1]) + data.append(new_data) + + return np.array(data) + + def three_circles(self): + """ + 三个相切的圆. + :return: data, n_clusters + """ + circle1 = self.circle(200) * 0.5 + [0.5, 0] + circle2 = self.circle(200) * 0.5 + [-0.5, 0] + circle3 = self.circle(200) * 0.5 + [0, 0.5 * 3 ** 0.5] + circle = np.concatenate([circle1, circle2, circle3]) + np.random.shuffle(circle) + return circle, 3 + + @staticmethod + def ring(): + ring1 = [] # 半径1的圆周 + ring2 = [] # 半径2的圆周 + + for i in range(200): + alpha = np.random.uniform(0, 2 * np.pi) + error = np.random.normal(0, 0.09) + r = 1 + error + ring1.append(r * np.array([np.cos(alpha), np.sin(alpha)])) + + for i in range(400): + alpha = np.random.uniform(0, 2 * np.pi) + error = np.random.normal(0, 0.09) + r = 2 + error + ring2.append(r * np.array([np.cos(alpha), np.sin(alpha)])) + + return np.concatenate([ring1, ring2]), 2 + + @staticmethod + def mixed_gaussian(rho, sizes=(200, 200, 200)): + """ + 生成含有三个子分布的混合高斯分布. + :param rho: float in (0, 1), 相关系数 + :param sizes: tuple, 每个子分布的规模 + :return: + """ + cov = np.array([[1, rho], [rho, 1]]) + data1 = np.random.multivariate_normal([-2, 0], cov, size=sizes[0]) + plt.scatter(data1[:, 0], data1[:, 1], color=colors[0], alpha=0.8, edgecolors='white') + data2 = np.random.multivariate_normal([2, 0], cov * 0.7, size=sizes[1]) + plt.scatter(data2[:, 0], data2[:, 1], color=colors[1], alpha=0.8, edgecolors='white') + data3 = np.random.multivariate_normal([1.5, 2], cov * 1.5, size=sizes[2]) + plt.scatter(data3[:, 0], data3[:, 1], color=colors[2], alpha=0.8, edgecolors='white') + plt.title("rho={}".format(rho)) + plt.savefig("gmm_init_{}.png".format(rho), dpi=300) + plt.show() + return np.concatenate([data1, data2, data3]), 3 + + +def init_expr(): + """ + 初始化方法对K-means的影响 + :return:None + """ + data_gen = GenData() + train_data, _ = data_gen.mixed_gaussian(0.1, (100, 200, 300)) + means = np.array([[-2, 0], [2, 0], [1.5, 2]]) + + # 全部数据 + plt.scatter(train_data[:, 0], train_data[:, 1], color='grey', alpha=0.6) + plt.savefig("total_init_expr.png", dpi=300) + plt.show() + sse = [] + labels = [] + + methods = ['random', 'k-means++'] + + # 随机初始化 + for j in range(2): + for i in range(3): + model = KMeans(3, init=methods[j]) + init_centers = model.fit(train_data) + + plot.scatter(train_data, model.labels, model.centers) + plt.xlim(-5, 5) + plt.ylim(-3, 6) + + # 作聚类中心 + plt.scatter(init_centers[:, 0], init_centers[:, 1], marker='x', color='black', label='initial centers') + plt.scatter(means[:, 0], means[:, 1], marker='x', color='gold', label='actual centers') + plt.scatter(model.centers[:, 0], model.centers[:, 1], marker='x', color='red', label='final centers') + plt.legend() + plt.savefig("init_expr_{}{}.png".format(methods[j], i), dpi=300) + plt.show() + sse.append(model.sse) + + legends = ['random1', 'random2', 'random3', 'k-means++1', 'k-means++2', 'k-means++3'] + for i in range(len(sse)): + mk = '--' if i < 3 else '-' + plt.plot(sse[i], linestyle=mk, color=colors[i], label=legends[i]) + + plt.xlabel("iteration") + plt.ylabel("SSE") + plt.legend() + plt.grid() + plt.title("model under different initializing methods".capitalize()) + plt.savefig("init_expr_line.png", dpi=300) + plt.show() + + +methods = ['k-means', 'GMM'] + + +def general(train_data, n_cluster, filename, method='k-means'): + """通用实验""" + assert method in methods + model = KMeans(n_cluster, normal=True) if method == 'k-means' else GaussianMixture(n_cluster, max_iter=50) + model.fit(train_data) + loss = model.sse if method == 'k-means' else model.Q_funcs + name = 'loss' if method == 'k-means' else 'q_function' + # loss curve + plt.plot(loss) + plt.xlabel("iteration") + plt.ylabel(name) + plt.grid() + plt.savefig(name + filename, dpi=300) + plt.show() + + # voronoi = True if method == 'k-means' else False + centers = model.centers if method == 'k-means' else model.means + plot.scatter(train_data, model.labels, centers, False) + plt.scatter(centers[:, 0], centers[:, 1], marker='x', color='red') + plt.savefig(filename, dpi=300) + plt.show() + + +def general_compare(train_data, n_clusters, data_name): + plt.scatter(train_data[:, 0], train_data[:, 1], color='grey', alpha=0.5) + plt.savefig("{}.png".format(data_name), dpi=300) + plt.show() + general(train_data, n_clusters, data_name + "_kmeans_outcome.png") + general(train_data, n_clusters, data_name + "_gmm_outcome.png", "GMM") + + +def square_expr(): + """方形数据聚类.""" + gen = GenData() + train_data, n_clusters = gen.square() + general_compare(train_data, n_clusters, "square") + + +def circle_expr(): + """圆形数据聚类.""" + gen = GenData() + train_data, n_clusters = gen.three_circles() + general_compare(train_data, n_clusters, "circle") + + +def ring_expr(): + """环形数据聚类.""" + gen = GenData() + train_data, n_clusters = gen.ring() + general_compare(train_data, n_clusters, "ring") + + +def mixed_gaussian_expr(): + np.random.seed(16) + gen = GenData() + train_data, n_clusters = gen.mixed_gaussian(0.1) + general_compare(train_data, n_clusters, "gmm_01_balance") + + train_data, n_clusters = gen.mixed_gaussian(0.5) + general_compare(train_data, n_clusters, "gmm_05_balance") + + train_data, n_clusters = gen.mixed_gaussian(0.9) + general_compare(train_data, n_clusters, "gmm_09_balance") + + train_data, n_clusters = gen.mixed_gaussian(0.99) + general_compare(train_data, n_clusters, "gmm_099_balance") + + train_data, n_clusters = gen.mixed_gaussian(0.5, (50, 50, 50)) + general_compare(train_data, n_clusters, "gmm_05_balance_s") + + train_data, n_clusters = gen.mixed_gaussian(0.5, (500, 500, 500)) + general_compare(train_data, n_clusters, "gmm_05_balance_l") + + train_data, n_clusters = gen.mixed_gaussian(0.5, (80, 200, 600)) + general_compare(train_data, n_clusters, "gmm_05_imbalance") + + +"""utils""" + + +def k_means_plusplus(dataset, n_clusters): + """ + 利用 k-means++ 算法选取初始聚类中心. + :param dataset: NdArray of shape(n_samples, n_features) + :param n_clusters: int, number of clusters and centroids to form + :return: centers: NdArray of shape(n_clusters, n_features) + """ + n_samples = dataset.shape[0] + centers = [dataset[np.random.randint(0, n_samples)]] + + # 寻找其余 n_clusters - 1 个 聚类中心 + for _ in range(1, n_clusters): + max_dist = 0 + next_center = dataset[0] + for data in dataset: + # data到各聚类中心的距离 + dist_centers = np.linalg.norm(data - centers, axis=1) + min_dist = np.min(dist_centers) + + # update + if min_dist > max_dist: + max_dist = min_dist + next_center = data + + centers.append(next_center) + return np.array(centers) + + +def multi_normal_pdf(x, mean, cov): + """多元正态分布密度""" + if isinstance(x, float) or isinstance(x, int): + return np.exp(-0.5 * (x - mean) ** 2 / cov ** 2) / ((2 * np.pi) ** 0.5 * cov) + + # assert len(x.shape) == 1 + # assert cov.shape[0] == cov.shape[1], 'wrong covariance.' + # assert x.shape[0] == mean.shape[0] == cov.shape[0], 'dimension error.' + + d = x.shape[0] + inv = np.linalg.pinv(cov) + det = 1 / np.linalg.det(inv) + v = x - mean + exp_part = -0.5 * np.matmul(v, np.matmul(v, inv)) + return np.exp(exp_part) / ((2 * np.pi) ** d * det) ** 0.5 + + +def normalize(data): + """归一化.""" + mean = np.mean(data, axis=0) + std = np.std(data, axis=0) + return (data - mean) / std + + +def shuffle(*datas): + data = np.concatenate(datas) + label = np.concatenate([ + np.ones((d.shape[0],), dtype=int) * i + for (i, d) in enumerate(datas) + ]) + N = data.shape[0] + idx = np.arange(N) + np.random.shuffle(idx) + data = data[idx] + label = label[idx] + return data, label + + +if __name__ == '__main__': + """ + mean = (1, 2) + cov = np.array([[73, 0], [0, 22]]) + x = np.random.multivariate_normal(mean, cov, (320,)) + + mean = (16, -5) + cov = np.array([[21.2, 0], [0, 32.1]]) + y = np.random.multivariate_normal(mean, cov, (80,)) + + mean = (10, 22) + cov = np.array([[10, 5], [5, 10]]) + z = np.random.multivariate_normal(mean, cov, (400,)) + + data, _ = shuffle(x, y, z) + means = np.array([[1, 2], [16, -5], [10, 22]]) + + gen = GenData() + data1, n_clusters = gen.mixed_gaussian(0.9) + + """ + + # 初始化方法 实验 + init_expr() + + square_expr() + + circle_expr() + + ring_expr() + + mixed_gaussian_expr()