diff --git a/assignment-3/submission/18307130104/img/data.png b/assignment-3/submission/18307130104/img/data.png
new file mode 100644
index 0000000000000000000000000000000000000000..ed1b9b72af278e67e2b13b7ed558d16479b9c225
Binary files /dev/null and b/assignment-3/submission/18307130104/img/data.png differ
diff --git a/assignment-3/submission/18307130104/img/data4.png b/assignment-3/submission/18307130104/img/data4.png
new file mode 100644
index 0000000000000000000000000000000000000000..aec1120aa909e038445885452fa724f25c11a042
Binary files /dev/null and b/assignment-3/submission/18307130104/img/data4.png differ
diff --git a/assignment-3/submission/18307130104/img/elbow.png b/assignment-3/submission/18307130104/img/elbow.png
new file mode 100644
index 0000000000000000000000000000000000000000..663ff8d277bbaef134407ecbc8ad2e65cece527f
Binary files /dev/null and b/assignment-3/submission/18307130104/img/elbow.png differ
diff --git a/assignment-3/submission/18307130104/img/res-bad-initialize.png b/assignment-3/submission/18307130104/img/res-bad-initialize.png
new file mode 100644
index 0000000000000000000000000000000000000000..c0ccab0a874b0766b81653ade932839f988900af
Binary files /dev/null and b/assignment-3/submission/18307130104/img/res-bad-initialize.png differ
diff --git a/assignment-3/submission/18307130104/img/res.png b/assignment-3/submission/18307130104/img/res.png
new file mode 100644
index 0000000000000000000000000000000000000000..b2ae48ffcf67944270cc310cbd63b5bd5e646f08
Binary files /dev/null and b/assignment-3/submission/18307130104/img/res.png differ
diff --git a/assignment-3/submission/18307130104/readme.md b/assignment-3/submission/18307130104/readme.md
new file mode 100644
index 0000000000000000000000000000000000000000..e48e221687b384b33563c197aab95bded9bb2873
--- /dev/null
+++ b/assignment-3/submission/18307130104/readme.md
@@ -0,0 +1,156 @@
+# 课程报告
+
+18307130104
+
+这是 prml 的 assignment-3 的实验报告,我的代码可以在 source.py 中查看。
+
+在 assignment-3 中,完成了对 KMeans 类,GaussianMixture 类和 ClusteringAlgorithm 类的编写。在 ClusterAlgorithm 类中,使用 Elbow Method 配合 KMeans 类,对数据中聚类的个数进行判断。
+
+## 类实现说明
+
+### KMeans
+
+KMeans 算法的关键点在于,对于每个类别,都设定一个质心,对于需要预测的点,计算该点到各个质心的距离,选择最近的一个类别的质心,将该点设置为该类别。
+
+KMeans 会根据被分类为每个种类的数据对每个类别的质心进行更新,具体方法为,求出这些点的新质心替代该类别的质心。下面分析一下这种方法的合理性,以正态分布的数据为例,如果想要预测结果比较符合实际,那么我们的质心需要接近正态分布的 $\mu$,考虑我们对于已知的属于同一分布的数据集,如何估算该分布的$\mu$,有一个非常直观的方法就是根据已知的数据求出质心。这也符合根据已知数据对正态分布进行参数估计的结论。
+
+KMeans 算法非常重要的一个部分在于,如何初始化每个类的质心。由于训练数据在生成时经过打乱,我们可以直接取前 K 个点作为各个类别的质心。但是使用这种方法,如果出现两个点距离比较近,会导致整个算法收敛的比较慢。因此,我们需要考虑如何选择比较分散的点作为各类别的质心,以保证能够比较快得对不同类别进行区分。
+
+这就引入了 KMeans++ 算法,在选择初始质心方面进行了优化。KMeans++ 算法首先随机选择一个点作为其中一个质心,之后每次加入一个离当前已加入的质心最小距离最大的点作为一个类的质心,直到找到 K 个质心。
+
+### GMM
+
+GMM 只实现了应对一维和二维数据的情况。针对二维的实现,由于二维正态分布参数存储的问题,对两种情况进行了分类讨论。下面说明以一维为例,之后会说明二维相比于一维的改变。
+
+整个算法分成:初始化,E 步骤,M 步骤。E 步骤和 M 步骤会交替进行指导训练结束。
+
+#### 初始化
+
+初始化对于 GMM 来说是一个非常大的问题,如果初始化的点选择不好,在训练轮数低的情况下会出现类的合并和拆分的情况,即有几个类被作为同一个类的情况,和一个类被拆成几个类的情况。
+
+首先想到的就是根据最小的点和最大的点,平均分配每个类的初始 $\mu$ 值。这种方法在每个类别的 $\mu$ 差距比较平均的时候效果显著,但是如果出现下图所示的数据分布时,在训练100轮之后的结果并不理想。
+
+
+
+训练100轮后,各个分布的 $p_i$, $\mu$ 和 $\sigma$ 如下:
+
+> pi
+>
+> [3.38068955e-01 3.34008836e-01 1.32723871e-19 3.27922208e-01]
+>
+> mu
+>
+> [[-14.19834808]
+>
+> [ 19.9809874 ]
+>
+> [149.95577527]
+>
+> [149.94683807]]
+>
+> sigma
+>
+> [[309.49790345]
+>
+> [ 0.8893968 ]
+>
+> [ 3.94501471]
+>
+> [ 0.97396421]]
+
+可以看到左边两个类别的数据被当作了一个类别,异常点在于新类的 $\sigma$ 比较大。最右边的类被分成了两个,异常点在于其中一个类别的$p_i$非常小。可以想到,如果训练过程中一个类别的$\sigma$ 特别大,就把这个类给拆分成两个类,如果一个类的 $p_i$ 特别小,就和相近的类合并。经过查找,确实有这种算法,即 ISODATA算法,想法与这个比较类似。
+
+但是我没有按照这个想法改进 GMM,而是采用了另一种比较简单得方法。由于之前已经实现了 KMeans 算法,所以其实可以直接使用 KMeans 算法进行初始的划分,简单地训练几轮,就可以将质心拿来作为每个类别的 $\mu$,利用 KMeans 对测试数据的分类结果对每个类别的正态分布进行参数估计,得到 $\sigma$。事实证明,这样有比较好的效果,结果为
+
+> pi
+>
+> [0.328 0.3356 0.2672 0.0692]
+>
+> mu
+>
+> [[ 2.00081537e+01]
+> [ 1.50029575e+02]
+> [-1.99476037e+01]
+> [ 3.14833874e-02]]
+>
+> sigma
+>
+> [[0.96566592]
+> [0.97531871]
+> [0.97361227]
+> [1.00238844]]
+
+可以看到j经过 KMeans 初始化的 GMM,得到的结果不会出现很小的 $p_i$ 或很大的 $\sigma$。
+
+#### E 步骤
+
+E 步骤的目标是求得 $\gamma_{nk}$ 表示样本 $x^{n}$ 属于第 k 个高斯分布的后验概率,有$\gamma_{nk}=\frac{\pi_k N(x^{(n)};\mu_k,\sigma_k)}{\sum \pi_k N(x^{(n)};\mu_k,\sigma_k)}$
+
+#### M步骤
+
+M 步骤的目标是更新参数,由于存在限制
+
+ $\sum p_i = 1$
+
+所以不能简单对似然函数进行求导,而是采用拉格朗日乘数法来消除这一约束。
+
+$L(p, \lambda)=\sum_{i=1}^{N} \sum_{k=1}^{K} \log p_k \gamma_{ik} + \lambda(\sum_{i=1}^K p_i -1)$
+
+$\frac {\partial L(p,\lambda)}{\partial p_j}=\sum_{i=1}^N \frac 1 {p_j} \gamma_{ij}+\lambda$
+
+令偏导为 0,有
+
+$\sum_{i=1}^N \gamma_{ij}=-\lambda p_j (*)$
+
+$\sum_{j=1}^K \sum_{i=1}^K \gamma_{ij}=\sum_{j=1}^K -\lambda p_j = -\lambda$
+
+又有
+
+$\sum_{j=1}^K \gamma_{ij} = 1$
+
+所以
+
+$\lambda=-N$
+
+带入\*式可以得到
+
+$p_j=\frac {\sum_{i=1}^N \gamma_{ij}}{N}$
+
+然后找到使似然函数最大化的 $\mu$ 和 $\sigma$
+
+$\mu = \frac {\sum_{i=1}^N x_i \gamma_{ij}}{\sum_{i=1}^N \gamma_{ij}}$
+
+$\gamma = \frac{\sum_{i=1}^N (x_i - \mu_j)^2\gamma_{ij}}{\sum_{i=1}^N \gamma_{ij}}$
+
+#### 二维情况的修改
+
+针对二维数据的情况,需要修改的点在于正态分布函数的编写。由于测试数据主要取一维,所以二维情况采用了最简单随机初始化,对于算法的改进主要体现在一维情况上。
+
+## 实验
+
+为了测试 KMeans++ 的效果,进行了测试。数据集如图所示
+
+
+
+对于初始化点比较靠近的情况,结果如下:
+
+
+
+采用 KMeans++ 初始化之后,分类效果显著提升:
+
+
+
+GMM 相关的实验在 GMM 初始化步骤的优化中已经说明了。
+
+## 自动选择聚簇数量
+
+采用了 Elbow Method 配合 KMeans 算法,输入一个 K 的上限,算法遍历所有可能的 K,通过不同 K 情况下 KMeans 的表现,决定最后选择的 K。将每个类别所有数据到质心的欧拉距离的平方和作为衡量标准。
+
+在测试数据集上效果图如下所示
+
+
+
+虽然使用肉眼很容易看出 3,4 作为 K 都有不错的效果,但是如何规定一下何为突变是需要思考的。
+
+我的方法是,对于每个 $K_i \in (2, K_{max})$,都计算 $(2,K_i)$,$(K_i,K_{max})$的平均变化率,如果两个变化率相差大于一定程度,就纳入可能的 K 范围,最后在该范围中选择一个差值最大的最为 K。如果没有可行的 K,可能是因为 $K_{max}$ 太小,或者 K 应该等于 2。
+
diff --git a/assignment-3/submission/18307130104/source.py b/assignment-3/submission/18307130104/source.py
new file mode 100644
index 0000000000000000000000000000000000000000..f6cb5387dde5652241a044dc3a111528b223558a
--- /dev/null
+++ b/assignment-3/submission/18307130104/source.py
@@ -0,0 +1,246 @@
+import numpy as np
+import matplotlib.pyplot as plt
+
+class KMeans:
+ def __init__(self, n_clusters):
+ self.K = n_clusters
+ self.Cent = []
+ self.MAXDIS = 1e9
+
+ def Dis(self, a, b):
+ return np.mean((a - b) ** 2)
+
+ def GetCluster(self, s):
+ mdis = self.MAXDIS
+ ntype = -1
+ for i in range(len(self.Cent)):
+ ndis = self.Dis(self.Cent[i], s)
+ # print(self.Cent[i], s, ndis)
+ if ndis < mdis:
+ mdis = ndis
+ ntype = i
+ return ntype
+
+ def fit(self, train_data, max_turn = 300):
+ # 数据默认是随机的,所以直接选择前k个当作初始质心
+ if self.K > len(train_data):
+ print("error")
+ return
+ # 使用更好的方法选择初始质心
+ self.Cent.append(train_data[0])
+ for i in range(1, self.K):
+ mxc = None
+ mx = 0
+ for d in range(len(train_data)):
+ ti = self.GetCluster(train_data[d])
+ di = self.Dis(self.Cent[ti], train_data[d])
+ if di > mx:
+ mx = di
+ mxc = d
+ if mxc == None:
+ print('error')
+ return
+ self.Cent.append(train_data[mxc])
+ # 初始化每个点的类别
+ T = []
+ D = []
+ for i in range(len(train_data)):
+ T.append(-1)
+ D.append(i)
+ D = np.array(D)
+ changed = True
+ turn = 0
+ while changed == True or turn < max_turn:
+ # 循环直到没有点变化类别
+ changed = False
+ for i in range(len(train_data)):
+ ntype = self.GetCluster(train_data[i])
+ if ntype != T[i]:
+ # 发生类别变化
+ T[i] = ntype
+ changed = True
+ for i in range(self.K):
+ # 创建 类别i 的筛选器
+ filt = []
+ have_item = False
+ for j in range(len(train_data)):
+ if T[j] == i:
+ have_item = True
+ filt.append(True)
+ else:
+ filt.append(False)
+ # 新质心
+ if have_item:
+ self.Cent[i] = np.mean(train_data[filt], axis = 0)
+ turn += 1
+ ret = 0
+ # 为 Elbow method 计算误差
+ for i in range(len(train_data)):
+ ret += float(self.Dis(self.Cent[T[i]], train_data[i]))
+ return ret
+
+ def predict(self, test_data):
+ ret = []
+ for i in range(len(test_data)):
+ ret.append(self.GetCluster(test_data[i]))
+ return np.array(ret)
+
+class GaussianMixture:
+
+ def __init__(self, n_clusters):
+ self.K = n_clusters
+ self.pi = np.random.randint(0, 100, size=n_clusters)
+ self.pi = self.pi / np.sum(self.pi)
+ self.mu = None
+ self.sigma = None
+ self.gama = None
+
+ def normal(self, mu, sigma, x):
+ # 保证1维数据
+ mu = float(mu)
+ sigma = float(sigma)
+ x = float(x)
+ return (1/((2*np.pi) ** 0.5 * sigma + 1e-10)) * np.exp(-(x-mu)**2/(2*sigma**2 + 1e-10))
+
+ def multi_normal(self, mu, sigma, x):
+ # 实际上只有二维的
+ x = (x - mu).reshape(x.shape[0], 1)
+ sigma = np.diag(sigma)
+ return float((1 / ((np.linalg.det(sigma) ** 0.5) * ((2 * np.pi) ** (x.shape[0] / 2)))) * \
+ np.exp(-0.5 * np.matmul(np.matmul(x.T, np.linalg.inv(sigma)), x)))
+
+ def fit(self, train_data):
+ n = train_data.shape[0]
+ if(len(train_data.shape) == 1 or train_data.shape[1] == 1):
+ train_data = train_data.reshape(-1, 1)
+ # 一维的情况
+ kmeans_init = KMeans(self.K)
+ # 利用 KMeans 初始化 mu sigma
+ kmeans_init.fit(train_data, max_turn=10)
+ tpe = kmeans_init.predict(train_data)
+ mu = []
+ sigma = []
+ for i in range(self.K):
+ mu.append([float(kmeans_init.Cent[i])])
+ tmp = 0
+ cnt = 0
+ for j in range(len(train_data)):
+ if i == tpe[j]:
+ tmp += train_data[j] ** 2
+ cnt += 1
+ sigma.append(tmp/cnt - kmeans_init.Cent[i] ** 2)
+ self.mu = np.array(mu)
+ self.sigma = np.array(sigma)
+ for steps in range(50):
+ # E 步骤
+ tmp_q = []
+ for i in range(n):
+ tmp_q.append([])
+ tot = 1e-10
+ for j in range(self.K):
+ multin = self.normal(self.mu[j], self.sigma[j], train_data[i])
+ qi = self.pi[j] * multin
+ tot += qi
+ tmp_q[i].append(qi)
+ for j in range(self.K):
+ tmp_q[i][j] /= tot
+ self.gama = np.array(tmp_q)
+ # M 步骤
+ self.pi = np.sum(self.gama, axis = 0) / (np.sum(self.gama) + 1e-10)
+ self.mu = np.zeros((self.K, 1))
+ for k in range(self.K):
+ self.mu[k] = np.average(train_data, axis = 0, weights=self.gama[:, k])
+ self.sigma = np.zeros((self.K, 1))
+ for k in range(self.K):
+ self.sigma[k] = 1e-5 + np.average((train_data - self.mu[k]) ** 2, axis = 0, weights=self.gama[:, k])
+ else:
+ m = train_data.shape[1]
+ self.mu = np.array([train_data[i] for i in range(self.K)])
+ self.sigma = np.array([[10, 10] for i in range(self.K)])
+ for steps in range(50):
+ # E 步骤
+ tmp_q = []
+ for i in range(n):
+ tmp_q.append([])
+ tot = 1e-10
+ for j in range(self.K):
+ multin = self.multi_normal(self.mu[j], self.sigma[j], train_data[i])
+ qi = self.pi[j] * multin
+ tot += qi
+ tmp_q[i].append(qi)
+ for j in range(self.K):
+ tmp_q[i][j] /= tot
+ self.gama = np.array(tmp_q)
+ # M 步骤
+ self.pi = np.sum(self.gama, axis = 0) / (np.sum(self.gama) + 1e-10)
+ self.mu = np.zeros((self.K, 2))
+ for k in range(self.K):
+ self.mu[k] = np.average(train_data, axis=0, weights=self.gama[:, k])
+ self.sigma = np.zeros((self.K, 2))
+ for k in range(self.K):
+ self.sigma[k] = 1e-3 + np.average((train_data - self.mu[k]) ** 2, axis=0, weights=self.gama[:, k])
+
+ def predict(self, test_data):
+ out = []
+ if(len(test_data.shape) == 1 or test_data.shape[1] == 1):
+ # 一维
+ for i in range(test_data.shape[0]):
+ mxp = 0
+ cluster_id = -1
+ for j in range(self.K):
+ pi = self.pi[j] * self.normal(self.mu[j], self.sigma[j], test_data[i])
+ if(pi > mxp):
+ mxp = pi
+ cluster_id = j
+ out.append(cluster_id)
+ else:
+ # 二维
+ for i in range(test_data.shape[0]):
+ mxp = 0
+ cluster_id = -1
+ for j in range(self.K):
+ pi = self.pi[j] * self.multi_normal(self.mu[j], self.sigma[j], test_data[i])
+ if(pi > mxp):
+ mxp = pi
+ cluster_id = j
+ out.append(cluster_id)
+ return np.array(out)
+
+class ClusteringAlgorithm:
+
+ def __init__(self):
+ self.cluster_num = 2
+ self.model = None
+
+ def fit(self, train_data, upper):
+ cent = np.mean(train_data, axis = 0)
+ tot = 0
+ for p in train_data:
+ tot += float(np.mean((p - cent) ** 2))
+ retlist = []
+ id = []
+ for i in range(2, upper):
+ kmeans = KMeans(i)
+ ret = kmeans.fit(train_data)
+ retlist.append(ret)
+ id.append(i)
+ mx = 0
+ # 如果没有合适的 K,K 就是 2
+ self.cluster_num = 2
+ n = len(retlist)
+ for i in range(1, n - 1):
+ del1 = (retlist[0] - retlist[i]) / i
+ del2 = (retlist[i] - retlist[n - 1]) / (n - 1 - i)
+ delta = del1 - del2
+ # 找到符合要求,并且插值最大的 K
+ if delta > 0.3 * max(del1, del2) and delta > mx:
+ mx = delta
+ self.cluster_num = i + 2
+ # l1 = plt.plot(id, retlist, 'b--')
+ # plt.savefig(f'./img/elbow.png')
+ # plt.show()
+ self.model = KMeans(self.cluster_num)
+ self.model.fit(train_data)
+
+ def predict(self, test_data):
+ return self.model.predict(test_data)
diff --git a/assignment-3/submission/18307130104/tester_demo.py b/assignment-3/submission/18307130104/tester_demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..8ff13c20b11a2e5c14529ede9d0f9c1632fd75de
--- /dev/null
+++ b/assignment-3/submission/18307130104/tester_demo.py
@@ -0,0 +1,267 @@
+import numpy as np
+import sys
+import matplotlib.pyplot as plt
+from source import KMeans, GaussianMixture, ClusteringAlgorithm
+
+
+def shuffle(*datas):
+ data = np.concatenate(datas)
+ label = np.concatenate([
+ np.ones((d.shape[0],), dtype=int)*i
+ for (i, d) in enumerate(datas)
+ ])
+ N = data.shape[0]
+ idx = np.arange(N)
+ np.random.shuffle(idx)
+ data = data[idx]
+ label = label[idx]
+ return data, label
+
+
+def data_1():
+ mean = (1, 2)
+ cov = np.array([[73, 0], [0, 22]])
+ x = np.random.multivariate_normal(mean, cov, (800,))
+
+ mean = (16, -5)
+ cov = np.array([[21.2, 0], [0, 32.1]])
+ y = np.random.multivariate_normal(mean, cov, (200,))
+
+ mean = (10, 22)
+ cov = np.array([[10, 5], [5, 10]])
+ z = np.random.multivariate_normal(mean, cov, (1000,))
+
+ data, _ = shuffle(x, y, z)
+ return (data, data), 3
+
+
+def data_2():
+ train_data = np.array([
+ [23, 12, 173, 2134],
+ [99, -12, -126, -31],
+ [55, -145, -123, -342],
+ ])
+ return (train_data, train_data), 2
+
+
+def data_3():
+ train_data = np.array([
+ [23],
+ [-2999]
+ ])
+ return (train_data, train_data), 2
+
+def data_4():
+ mean = -20
+ cov = 1
+ x = np.random.normal(mean, cov, 800)
+
+ mean = 0
+ cov = 1
+ y = np.random.normal(mean, cov, 200)
+
+ mean = 150
+ cov = 1
+ z = np.random.normal(mean, cov, 1000)
+
+ mean = 20
+ cov = 1
+ k = np.random.normal(mean, cov, 1000)
+
+ # plt.scatter(x, np.zeros(len(x)))
+ # plt.scatter(y, np.zeros(len(y)))
+ # plt.scatter(z, np.zeros(len(z)))
+ # plt.scatter(k, np.zeros(len(k)))
+ # plt.show()
+ data, _ = shuffle(x, y, z, k)
+ return (data[0:2500], data[2500:3000]), 4
+
+def data_5():
+ mean = (1, 2, 1, 4)
+ cov = np.array([[73, 0, 0, 0], [0, 22, 0, 0], [0, 0, 11, 0], [0, 0, 0, 20]])
+ x = np.random.multivariate_normal(mean, cov, (5000,))
+
+ mean = (16, -5, 16, -5)
+ cov = np.array([[21.2, 0, 0, 0], [0, 32.1, 0, 0], [0, 0, 11, 0], [0, 0, 0, 20]])
+ y = np.random.multivariate_normal(mean, cov, (2000,))
+
+ mean = (10, 22, 10, 22)
+ cov = np.array([[10, 0, 0, 0], [0, 10, 0, 0], [0, 0, 10, 0], [0, 0, 0, 10]])
+ z = np.random.multivariate_normal(mean, cov, (3000,))
+
+ data, _ = shuffle(x, y, z)
+ return (data, data), 3
+
+def display(data, name):
+ datas =[[],[],[]]
+ for kind in range(3):
+ for i in range(len(data[kind])):
+ datas[kind].append(data[kind][i])
+
+ for each in datas:
+ each = np.array(each)
+ if(each.size > 0):
+ plt.scatter(each[:, 0], each[:, 1])
+ plt.savefig(f'img/{name}')
+ plt.show()
+
+def displayres(data, label, name):
+ datas =[[],[],[]]
+ for i in range(len(data)):
+ datas[label[i]].append(data[i])
+
+ for each in datas:
+ each = np.array(each)
+ if(each.size > 0):
+ plt.scatter(each[:, 0], each[:, 1])
+ plt.savefig(f'img/{name}')
+ plt.show()
+
+def data_6():
+ mean = (1, 2)
+ cov = np.array([[33, 0], [0, 22]])
+ x = np.random.multivariate_normal(mean, cov, (800,))
+
+ mean = (16, -5)
+ cov = np.array([[11, 0], [0, 12]])
+ y = np.random.multivariate_normal(mean, cov, (200,))
+
+ mean = (10, 22)
+ cov = np.array([[10, 5], [5, 10]])
+ z = np.random.multivariate_normal(mean, cov, (1000,))
+
+ data, _ = shuffle(x, y, z)
+ display([x, y, z], 'data')
+ return (data, data), 3
+
+def data_7():
+ # mean = (1, 2)
+ # cov = np.array([[33, 0], [0, 22]])
+ # x = np.random.multivariate_normal(mean, cov, (2000,))
+
+ # mean = (16, -5)
+ # cov = np.array([[11, 0], [0, 12]])
+ # y = np.random.multivariate_normal(mean, cov, (3000,))
+
+ # mean = (10, 22)
+ # cov = np.array([[10, 5], [5, 10]])
+ # z = np.random.multivariate_normal(mean, cov, (1000,))
+
+ # mean = (50, 60)
+ # cov = np.array([[10, 5], [5, 10]])
+ # z = np.random.multivariate_normal(mean, cov, (4000,))
+
+ # data, _ = shuffle(x, y, z)
+ # return (data[0: 8000], data[8000: 10000]), 4
+ mean = -20
+ cov = 1
+ x = np.random.normal(mean, cov, 800)
+
+ mean = 30
+ cov = 1
+ y = np.random.normal(mean, cov, 200)
+
+ mean = 80
+ cov = 1
+ z = np.random.normal(mean, cov, 1000)
+
+ mean = 140
+ cov = 1
+ a = np.random.normal(mean, cov, 1000)
+
+ plt.scatter(x, np.zeros(len(x)))
+ plt.scatter(y, np.zeros(len(y)))
+ plt.scatter(z, np.zeros(len(z)))
+ plt.scatter(a, np.zeros(len(a)))
+ plt.show()
+
+ data, _ = shuffle(x, y, z, a)
+ return (data[0:2500], data[2500:3000]), 4
+
+def test_with_n_clusters(data_fuction, algorithm_class):
+ (train_data, test_data), n_clusters = data_fuction()
+ model = algorithm_class(n_clusters)
+ model.fit(train_data)
+ res = model.predict(test_data)
+ # displayres(test_data, res, "res")
+ # print(res, test_data)
+ assert len(
+ res.shape) == 1 and res.shape[0] == test_data.shape[0], "shape of result is wrong"
+ return res
+
+def test_without_giving_n_clusters(data_fuction, algorithm_class):
+ (train_data, test_data), n_clusters = data_fuction()
+ model = algorithm_class()
+ model.fit(train_data, 10)
+ res = model.predict(test_data)
+ return model.cluster_num
+
+def testcase_1_1():
+ test_with_n_clusters(data_1, KMeans)
+ return True
+
+
+def testcase_1_2():
+ res = test_with_n_clusters(data_2, KMeans)
+ return res[0] != res[1] and res[1] == res[2]
+
+def testcase_1_3():
+ res = test_with_n_clusters(data_7, KMeans)
+ print(res)
+ return True
+
+
+def testcase_2_1():
+ test_with_n_clusters(data_1, GaussianMixture)
+ return True
+
+
+def testcase_2_2():
+ res = test_with_n_clusters(data_3, GaussianMixture)
+ return res[0] != res[1]
+
+def testcase_2_3():
+ res = test_with_n_clusters(data_4, GaussianMixture)
+ return True
+
+def testcase_2_4():
+ res = test_with_n_clusters(data_7, GaussianMixture)
+ return True
+
+def testcase_3_1():
+ res = test_without_giving_n_clusters(data_4, ClusteringAlgorithm)
+ return res == 3
+
+def test_all(err_report=False):
+ testcases = [
+ ["KMeans-1", testcase_1_1, 4],
+ ["KMeans-2", testcase_1_2, 4],
+ # ["KMeans-3", testcase_1_3, 4],
+ # ["KMeans-4", testcase_1_4, 4],
+ # ["KMeans-5", testcase_1_5, 4],
+ ["GMM-1", testcase_2_1, 4],
+ ["GMM-2", testcase_2_2, 4],
+ # ["GMM-3", testcase_2_3, 4],
+ ["GMM-4", testcase_2_4, 4],
+ # ["GMM-5", testcase_2_5, 4],
+ ["ClusteringAlgorithm", testcase_3_1, 4],
+ ]
+ sum_score = sum([case[2] for case in testcases])
+ score = 0
+ for case in testcases:
+ try:
+ res = case[2] if case[1]() else 0
+ except Exception as e:
+ if err_report:
+ print("Error [{}] occurs in {}".format(str(e), case[0]))
+ res = 0
+ score += res
+ print("+ {:14} {}/{}".format(case[0], res, case[2]))
+ print("{:16} {}/{}".format("FINAL SCORE", score, sum_score))
+
+
+if __name__ == "__main__":
+ if len(sys.argv) > 1 and sys.argv[1] == "--report":
+ test_all(True)
+ else:
+ test_all()