diff --git a/assignment-3/submission/18307130130/README.md b/assignment-3/submission/18307130130/README.md new file mode 100644 index 0000000000000000000000000000000000000000..caa6772ec9d59a0b3644c50aa7b9b932d6c30a5e --- /dev/null +++ b/assignment-3/submission/18307130130/README.md @@ -0,0 +1,193 @@ +# Assignment-3 Report + +> 姓名:李睿琛 +> +> 学号:18307130130 + +## 1. K-Means + +### 1.1 算法实现 + +选择k个初始中心点,计算每个样本到每个聚类中心的距离,将样本归到最近的类别中,对每个类别计算新的聚类中心,重复直到所有样本聚类结果不变化。 + +**初始中心的选择** + +初始聚类中心的选择对聚类结果有较大的影响,选取不恰当的初始中心,容易导致获得局部最优解而非全局最优解,这里参考实现了K-Means++算法。 + +* 随机选取一个点作为第一个聚类中心 +* 计算所有样本与第一个聚类中心的距离。 +* 选择出上一步中距离最大的点作为第二个聚类中心。 +* 计算所有点到与之最近的聚类中心的距离,选取最大距离的点作为新的聚类中心。 +* 直到选出了这k个中心。 + +**距离度量** + +使用欧式距离:`np.sqrt(np.sum(np.power(vecA - vecB, 2)))` + +### 1.2 预测 + +对于测试样本,找到最近的聚集中心,即为对应类别。星号即为聚类中心。 + +![](./img/clusterKMEANS.png) + + +### 1.3 输出结果 + +输出每次迭代聚类结果发生变化的样本数: + +``` +聚类结果发生变化的样本数: 1049 +聚类结果发生变化的样本数: 155 +聚类结果发生变化的样本数: 133 +聚类结果发生变化的样本数: 96 +聚类结果发生变化的样本数: 60 +聚类结果发生变化的样本数: 37 +``` + +## 2. GMM + +GMM也可以看作是K-means的推广,因为GMM不仅是考虑到了数据分布的均值,也考虑到了协方差。 GMM假设数据是从多个高斯分布中生成的:有K个高斯分布,赋予每一个分布一个权重,每当生成一个数据时,就按权重的比例随机选择一个分布,然后按照该分布生成数据。因此使用数据进行反推,通过GMM对三个高斯分布的参数估计出来。 + + ```python +# 第一簇的数据 +num1, mu1, var1 = 400, [0.5, 0.5], [1, 3] +X1 = np.random.multivariate_normal(mu1, np.diag(var1), num1) +# 第二簇的数据 +num2, mu2, var2 = 600, [5.5, 2.5], [2, 2] +X2 = np.random.multivariate_normal(mu2, np.diag(var2), num2) +# 第三簇的数据 +num3, mu3, var3 = 1000, [1, 7], [6, 2] +X3 = np.random.multivariate_normal(mu3, np.diag(var3), num3) + ``` + +### 2.1 算法实现 + +初始化均值和标准差参数后,使用最大期望算法:E步、M步, 然后重复进行以上两步,直到最大似然收敛。 + +* **E步:** + +![](./img/estep.png) + +* **M步:** + +![](./img/mstep.png) + +**自定义高斯分布函数** + +```python +def Gaussian(self,x,mean,cov): + """ + 自定义的高斯分布概率密度函数 + :param x: 输入数据 + :param mean: 均值数组 + :param cov: 协方差矩阵 + :return: x的概率 + """ + dim = np.shape(cov)[0] + # cov的行列式为零时的措施 + covdet = np.linalg.det(cov + np.eye(dim) * 0.001) + covinv = np.linalg.inv(cov + np.eye(dim) * 0.001) + xdiff = (x - mean).reshape((1,dim)) + # 概率密度 + prob = 1.0/(np.power(np.power(2*np.pi,dim)*np.abs(covdet),0.5))*\ + np.exp(-0.5*xdiff.dot(covinv).dot(xdiff.T))[0][0] + return prob +``` + +**最大似然收敛** + +```python +while np.abs(loglikelyhood - oldloglikelyhood) > 0.0001 + ... + loglikelyhood = np.sum([np.log(np.sum([self.weights_[k]*self.Gaussian(train_data[n], self.means_[k], self.covariances_[k]) for k in range(self.k)])) for n in range(n_samples) ]) +``` + +**结合Kmeans的参数初始化** + +运行`test_demo`中一维数据时,发现随机初始**均值和标准差**分裂结果很差,于是结合Kmeans方法进行初始化。 + +```python +# 随机初始化,放弃使用 +self.covariances_ = np.array([np.eye(d)] * self.k) * 0.1 +self.means_ = np.array([1.0 / self.k] * self.k) + +# 使用Kmeans初始化标准差和均值 +Kmeansmodel = KMeans(self.k) +Kmeansmodel.fit(train_data) +result = Kmeansmodel.predict(train_data) +# 初始化均值 +self.means_ = Kmeansmodel.centroids +self.covariances_ = [] +for i in range(self.k): + tmp = train_data[np.where(result == i)] - self.means_[i] + cov = np.eye(self.means_[0].shape[1]) + cov_tmp = np.array(np.power(np.sum(np.multiply(tmp, tmp), axis=0), 0.5)) + for index in range(len(cov_tmp[0])): + cov[index][index] = cov_tmp[0][index] + # 初始化标准差 + self.covariances_.append(cov) +``` + +### 2.2 预测 + +对于测试样本,计算其属于每个分布的概率,概率最大的即为所属类别。 + +![](./img/clusterGMM.png) + +GMM涉及非矩阵运算,**速度慢。** + + +### 2.3 输出结果 + +输出每次迭代最大似然变化 + +``` +2 原似然: -13877.619900915008 新似然: -13447.877245238586 +3 原似然: -13447.877245238586 新似然: -13227.228798616086 +4 原似然: -13227.228798616086 新似然: -13172.244969227962 +5 原似然: -13172.244969227962 新似然: -13168.077341797663 +6 原似然: -13168.077341797663 新似然: -13166.698044582836 +``` + + +## 3. 自动选择聚簇数量 + +使用数据同GMM。 + +### 3.1 Elbow Method + +Elbow Method公式 : +$$ +D_k = \sum _{i=1}^{K}\sum_{x_j\in C_i} dist(x_j, \mu_i)^2 +$$ +计转折点为`Elbow`,k小于`Elbow`时,k每增加1,$$D_k$$会大幅减小;当大于`Elbow`时,$$D_k$$变化就没有那么明显。 + +![](./img/Elbow1.png) + +有些数据集在Elbow上的转折趋势不够明显,因此可以采用BIC准则方法 + +### 3.2 BIC准则 + +BIC是对模型拟合程度和复杂度的一种度量方式[1],其中$$\hat l(D)$$表示最大似然。 +$$ +BIC(\phi) = \hat l(D)- {\frac{p_\phi}{2}} \times logR +$$ + +![](./img/likelihood.png) + +输出结果:(4, -9789.54526448437)中4代表聚类数目,-9789.54526448437代表对应BIC值。 + +```python +[(4, -9789.54526448437), (3, -9811.33552800443), (5, -9958.834817656798), (6, -9993.632222469756), (7, -10090.665980345215), (8, -10132.64857344288), (2, -10138.346440921536), (9, -10151.237746415201)] +``` + +![](./img/bic.png) + +因此选择k=4,分析结果k=3与k=4的BIC值相似,所以认为k=4也可以接受,星号为聚集中心: + +![](./img/clusterBIC.png) + + +## 4. Reference + +[1] [Notes on Bayesian Information Criterion Calculation for X-means Clustering](https://github.com/bobhancock/goxmeans/blob/master/doc/BIC_notes.pdf) \ No newline at end of file diff --git a/assignment-3/submission/18307130130/img/Elbow1.png b/assignment-3/submission/18307130130/img/Elbow1.png new file mode 100644 index 0000000000000000000000000000000000000000..53e73a90625febd25b49133034e2a31a928e39ab Binary files /dev/null and b/assignment-3/submission/18307130130/img/Elbow1.png differ diff --git a/assignment-3/submission/18307130130/img/bic.png b/assignment-3/submission/18307130130/img/bic.png new file mode 100644 index 0000000000000000000000000000000000000000..27acefa521e918c245263317e451cce578b55b98 Binary files /dev/null and b/assignment-3/submission/18307130130/img/bic.png differ diff --git a/assignment-3/submission/18307130130/img/clusterBIC.png b/assignment-3/submission/18307130130/img/clusterBIC.png new file mode 100644 index 0000000000000000000000000000000000000000..bc1c9ae59cf12d257a6b0f406ddf185457d8b6bd Binary files /dev/null and b/assignment-3/submission/18307130130/img/clusterBIC.png differ diff --git a/assignment-3/submission/18307130130/img/clusterGMM.png b/assignment-3/submission/18307130130/img/clusterGMM.png new file mode 100644 index 0000000000000000000000000000000000000000..088068d980f9551eeec13408bddf0f83246b39e9 Binary files /dev/null and b/assignment-3/submission/18307130130/img/clusterGMM.png differ diff --git a/assignment-3/submission/18307130130/img/clusterKMEANS.png b/assignment-3/submission/18307130130/img/clusterKMEANS.png new file mode 100644 index 0000000000000000000000000000000000000000..cf8519d6962ae98996f7c75617aadd307fcf3161 Binary files /dev/null and b/assignment-3/submission/18307130130/img/clusterKMEANS.png differ diff --git a/assignment-3/submission/18307130130/img/estep.png b/assignment-3/submission/18307130130/img/estep.png new file mode 100644 index 0000000000000000000000000000000000000000..6600c71dc49360cdf67edd94b64ebdd82c7c00a8 Binary files /dev/null and b/assignment-3/submission/18307130130/img/estep.png differ diff --git a/assignment-3/submission/18307130130/img/likelihood.png b/assignment-3/submission/18307130130/img/likelihood.png new file mode 100644 index 0000000000000000000000000000000000000000..cfd4d171f394496ca0240ea627e662c1a3876aae Binary files /dev/null and b/assignment-3/submission/18307130130/img/likelihood.png differ diff --git a/assignment-3/submission/18307130130/img/mstep.png b/assignment-3/submission/18307130130/img/mstep.png new file mode 100644 index 0000000000000000000000000000000000000000..8b8eb8a37c253da0b3f602fe10c618ee52090daf Binary files /dev/null and b/assignment-3/submission/18307130130/img/mstep.png differ diff --git a/assignment-3/submission/18307130130/source.py b/assignment-3/submission/18307130130/source.py new file mode 100644 index 0000000000000000000000000000000000000000..5a8372f68cbaaa035fd564dfc32a18e4514e2403 --- /dev/null +++ b/assignment-3/submission/18307130130/source.py @@ -0,0 +1,364 @@ +import numpy as np +import matplotlib.pyplot as plt +from numpy.random import multivariate_normal +plt.style.use('seaborn') +from numpy.random import randint +def randomname(N): + x = 0 + for i in range(N): + x = x * 10 + randint(0, 9) + return x + +def drawcenter(centers): + data = np.transpose(centers).tolist() + plt.scatter(data[0], data[1], marker="*", color='yellow', s=100) + +def drawpartition(clusters, centers): # 绘制所有聚簇的散点图 + k = len(clusters) + cmap = plt.cm.get_cmap("nipy_spectral", k+1) + for i in range(k): + data = np.transpose(clusters[i]).tolist() + plt.scatter(data[0], data[1], color=cmap(i), s=10) + if centers != []: + drawcenter(centers) + # plt.savefig("./img/%d.png"%randomname(5),dpi=400) + plt.show() + +# 对数据进行打乱并分离标签 +def shuffle(*datas): + data = np.concatenate(datas) + label = np.concatenate([ + np.ones((d.shape[0],), dtype=int)*i + for (i, d) in enumerate(datas) + ]) + N = data.shape[0] + idx = np.arange(N) + np.random.shuffle(idx) + data = data[idx] + label = label[idx] + return data, label + +# 欧氏距离,返回标量 +def distEclud(vecA, vecB): + return np.sqrt(np.sum(np.power(vecA - vecB, 2))) + +# Kmeans++选取质心 +def createCent(data, k): + ''' + 初始化质心 + :param data: 数据集 + :param k: cluster的个数 + :return: + ''' + centroids = [] + # step1: 随机选择样本点之中的一个点 + centroids.append(data[np.random.randint(data.shape[0]), :]) # np.random.randint() + #plotCent(data, np.array(centroids)) + # 迭代 k-1 次 + for c_id in range(k - 1): + dist = [] + for i in range(data.shape[0]): # 遍历所有点 + point = data[i, :] + d = float('inf') + for j in range(len(centroids)): # 扫描所有质心,选出该样本点与最近的类中心的距离 + temp_dist = distEclud(point, centroids[j]) + d = min(d, temp_dist) + dist.append(d) + dist = np.array(dist) + next_centroid = data[np.argmax(dist), :] # 返回的是dist里面最大值的下标,对应的是上面循环中的i + centroids.append(next_centroid) # 选出了下一次的聚类中心,开始k+1循环 + dist = [] + #plotCent(data, np.array(centroids)) + centroids = np.mat(centroids) + return centroids + +# 根据Kmean结果计算BIC,选择对应m +def BIC(kmeans, X): + centers = kmeans.centroids + labels = np.array(kmeans.clusterAssment[:, 0].reshape(1, -1)) + labels = labels[0] + labels = labels.astype("int") + m = kmeans.k + n = np.bincount(labels) + N, d = X.shape + cl_var = (1.0 / (N - m) / d) * sum([distEclud(X[np.where(labels == i)], centers[i]) ** 2 for i in range(m)]) + const_term = 0.5 * m * np.log(N) * (d+1) + bic = np.sum([n[i] * np.log(n[i]) - n[i] * np.log(N) - + ((n[i] * d) / 2) * np.log(2 * np.pi * cl_var) - + ((n[i] - 1) * d / 2) for i in range(m)]) - const_term + return bic + +class KMeans: + + def __init__(self, n_clusters): + self.k = n_clusters # 聚类数目 + self.n_samples = 0 # 样本数目 + self.centroids = None # 聚集中心 + self.clusterAssment = None # 记录样本所属类以及到类中心距离 + pass + + def fit(self, train_data): + + self.n_samples = train_data.shape[0] + # 创建初始的k个质心向量 + self.centroids = createCent(train_data, k=self.k) + # 初始化一个(m,2)全零数组,用来记录每一个样本所属类,距离类中心的距离 + self.clusterAssment = np.zeros((self.n_samples, 2)) + + # 终止条件:所有数据点聚类结果不发生变化 + clusterChanged = True + while clusterChanged: + cnt = 0 + clusterChanged = False + + # 遍历数据集每一个样本向量 + for i in range(self.n_samples): + minDist = float('inf') + minIndex = -1 + # 循环k个类的质心 + for j in range(self.k): + distJI = distEclud(self.centroids[j], train_data[i]) + if distJI < minDist: + minDist = distJI + minIndex = j + # 当前聚类结果中第i个样本的聚类结果发生变化:布尔值置为True,继续聚类算法 + if self.clusterAssment[i, 0] != minIndex: + clusterChanged = True + cnt += 1 + # 更新当前变化样本的聚类结果和平方误差 + self.clusterAssment[i, :] = minIndex, minDist**2 + print("聚类结果发生变化的样本数:", cnt) + + # 重新计算聚集中心 + for i in range(self.k): + # 将数据集中所有属于当前质心类的样本通过条件过滤筛选出来 + ptsInClust = train_data[self.clusterAssment[:, 0] == i] + # 计算这些数据的均值作为该类聚集中心 + self.centroids[i,:] = np.mean(ptsInClust, axis=0) + + def predict(self, test_data): + length = test_data.shape[0] + res = [] + # 遍历数据 + for i in range(length): + minDist = float('inf') + minIndex = -1 + # 遍历每个聚集中心,最近即为所属类 + for j in range(self.k): + distJI = distEclud(self.centroids[j], test_data[i]) + if distJI < minDist: + minDist = distJI + minIndex = j + res.append(minIndex) + res = np.array(res) + return res + +class GaussianMixture: + + def __init__(self, n_clusters, reg_covar: float = 1e-06, max_iter: int = 100): + self.k = n_clusters + self.means_ = None # 高斯分布均值 + self.covariances_ = None # 高斯分布标准差 + self.weights_ = None # 每一簇的权重 + self.reg_covar = reg_covar # 防止出现奇异协方差矩阵 + + def Gaussian(self,x,mean,cov): + """ + 自定义的高斯分布概率密度函数 + :param x: 输入数据 + :param mean: 均值数组 + :param cov: 协方差矩阵 + :return: x的概率 + """ + dim = np.shape(cov)[0] + # cov的行列式为零时的措施 + covdet = np.linalg.det(cov + np.eye(dim) * 0.001) + covinv = np.linalg.inv(cov + np.eye(dim) * 0.001) + xdiff = (x - mean).reshape((1,dim)) + # 概率密度 + prob = 1.0/(np.power(np.power(2*np.pi,dim)*np.abs(covdet),0.5))*\ + np.exp(-0.5*xdiff.dot(covinv).dot(xdiff.T))[0][0] + return prob + + def fit(self, train_data): + n_samples, n_feature = train_data.shape + self.reg_covar = self.reg_covar * np.identity(n_feature) + self.weights_ = np.random.rand(self.k) + self.weights_ /= np.sum(self.weights_) + P_mat = np.zeros((n_samples, self.k)) + + # 使用Kmeans初始化标准差和均值 + Kmeansmodel = KMeans(self.k) + Kmeansmodel.fit(train_data) + result = Kmeansmodel.predict(train_data) + self.means_ = Kmeansmodel.centroids + self.covariances_ = [] + for i in range(self.k): + # 计算kmeans分类结果每一簇的标准差,并生成cov矩阵 + tmp = np.array(train_data[np.where(result == i)] - self.means_[i]) + cov = np.eye(n_feature) + cov_tmp = np.power(np.sum(tmp**2, axis=0), 0.5) + for index in range(len(cov_tmp)): + cov[index][index] = cov_tmp[index] + self.covariances_.append(cov) + + loglikelyhood = 0 + oldloglikelyhood = 1 + cnt = 0 + while np.abs(loglikelyhood - oldloglikelyhood) > 0.0001: + cnt += 1 + oldloglikelyhood = loglikelyhood + #### E-step,计算概率 #### + for n in range(n_samples): + response = [self.weights_[i] * self.Gaussian(train_data[n], self.means_[i], self.covariances_[i]) + for i in range(self.k)] + response = np.array(response) + sum_response = np.sum(response) + sum_response = self.k if sum_response==0 else sum_response + P_mat[n] = (response / sum_response).reshape(P_mat[n].shape) + #### M-step,更新参数 #### + for j in range(self.k): + #nk表示N个样本中有多少属于第k个高斯分布 + nk = np.sum([P_mat[n][j] for n in range(n_samples)]) + # 更新每个高斯分布的概率 + self.weights_[j] = 1.0 * nk / n_samples + # 更新高斯分布的均值 + self.means_[j] = (1.0/nk) * np.sum([P_mat[n][j] * train_data[n] for n in range(n_samples)], axis=0) + xdiffs = train_data - self.means_[j] + # 更新高斯分布的协方差矩阵 + self.covariances_[j] = (1.0/nk)*np.sum([P_mat[n][j]*xdiffs[n].reshape((n_feature,1)).dot(xdiffs[n].reshape((1,n_feature))) for n in range(n_samples)],axis=0) + # 计算最大似然 + loglikelyhood = np.sum( + [np.log(np.sum([self.weights_[k]*self.Gaussian(train_data[n], self.means_[k] + , self.covariances_[k]) for k in range(self.k)])) for n in range(n_samples) ]) + print(cnt, " 原似然: ", oldloglikelyhood, " 新似然: ", loglikelyhood) + + def predict(self, test_data): + n_samples = test_data.shape[0] + P_mat = np.zeros((test_data.shape[0], self.k)) + + # 选择概率最大的高斯分布作为分类结果 + for n in range(n_samples): + response = [self.weights_[i] * self.Gaussian(test_data[n], self.means_[i], self.covariances_[i]) + for i in range(self.k)] + response = np.array(response) + sum_response = np.sum(response) + P_mat[n] = (response / sum_response).reshape(P_mat[n].shape) + return np.argmax(P_mat, axis=1) + +class ClusteringAlgorithm: + + def __init__(self, kmax=10): + self.kmax = kmax # 遍历kmax + self.k = 0 # 选择的k值 + pass + + def fit(self, train_data): + self.bic_list = dict() #记录BIC值的序列 + self.elbow = [] #肘方法 + for i in range(2, self.kmax): + # 训练k=i的Kmeans模型 + kmeans = KMeans(i) + kmeans.fit(train_data) + # 计算BIC + bic = BIC(kmeans, train_data) + self.bic_list[i] = bic + # 肘方法计算总距离 + Sum = 0 + labels = kmeans.clusterAssment[:, 0] + labels = labels.astype("int") + for index in range(len(train_data)): + Sum += distEclud(kmeans.centroids[labels[index]], train_data[index]) ** 2 + self.elbow.append(Sum) + # 选择合适的k + self.bic_list = sorted(self.bic_list.items(), key=lambda x: x[1], reverse=True) + self.k = self.bic_list[0][0] + + def predict(self, test_data): + # 根据选择的k值进行预测 + kmeans = KMeans(self.k) + kmeans.fit(test_data) + self.centroids = kmeans.centroids + return kmeans.predict(test_data) + +def generate_data1(): + mean = (0, 0) + cov = np.array([[10, 0], [0, 10]]) + x = np.random.multivariate_normal(mean, cov, (3500,)) + + mean = (-10, 10) + cov = np.array([[10, 0], [0, 10]]) + y = np.random.multivariate_normal(mean, cov, (750,)) + + mean = (10, 10) + cov = np.array([[10, 0], [0, 10]]) + z = np.random.multivariate_normal(mean, cov, (750,)) + + mean = (-10, -10) + cov = np.array([[10, 0], [0, 10]]) + w = np.random.multivariate_normal(mean, cov, (750,)) + + mean = (10, -10) + cov = np.array([[10, 0], [0, 10]]) + t = np.random.multivariate_normal(mean, cov, (750,)) + data, _ = shuffle(x, y, z, w, t) + return data + +def generate_data2(): + mean = (0, 0, 0) + cov = np.array([[10, 0, 0], [0, 10, 0], [0, 0, 10]]) + x = np.random.multivariate_normal(mean, cov, (500,)) + + mean = (0, 10, 10) + cov = np.array([[10, 0, 0], [0, 10, 0], [0, 0, 10]]) + y = np.random.multivariate_normal(mean, cov, (500,)) + + mean = (0, 10, -10) + cov = np.array([[10, 0, 0], [0, 10, 0], [0, 0, 10]]) + z = np.random.multivariate_normal(mean, cov, (500,)) + + mean = (0, -10, 10) + cov = np.array([[10, 0, 0], [0, 10, 0], [0, 0, 10]]) + w = np.random.multivariate_normal(mean, cov, (500,)) + + mean = (0, -10, -10) + cov = np.array([[10, 0, 0], [0, 10, 0], [0, 0, 10]]) + t = np.random.multivariate_normal(mean, cov, (500,)) + data, _ = shuffle(x, y, z, w, t) + return data + +def generate_data3(): + # 第一簇的数据 + num1, mu1, var1 = 400, [0.5, 0.5], [1, 3] + X1 = np.random.multivariate_normal(mu1, np.diag(var1), num1) + # 第二簇的数据 + num2, mu2, var2 = 600, [5.5, 2.5], [2, 2] + X2 = np.random.multivariate_normal(mu2, np.diag(var2), num2) + # 第三簇的数据 + num3, mu3, var3 = 1000, [1, 7], [6, 2] + X3 = np.random.multivariate_normal(mu3, np.diag(var3), num3) + + data, _ = shuffle(X1, X2, X3) + return data + +if __name__ == "__main__": + # 分析Kmeans + data = generate_data2() + model = KMeans(5) + model.fit(data) + res = model.predict(data) + clusters = [data[np.where(res == i)] for i in range(model.k)] + + # 适用二维数据 + # drawpartition(clusters, model.centroids) + + # 分析ClusteringAlgorithm + data = generate_data3() + model = ClusteringAlgorithm(10) + model.fit(data) + res = model.predict(data) + plt.plot(range(2, 2+len(model.elbow)), model.elbow, 'o-', color='b') #o-:圆形 + plt.savefig("./img/%d.jpg" % randomname(5), dpi=400) + plt.show() + clusters = [data[np.where(res==i)] for i in range(model.k)] + # drawpartition(clusters, model.centroids) \ No newline at end of file diff --git a/assignment-3/submission/18307130130/tester_demo.py b/assignment-3/submission/18307130130/tester_demo.py new file mode 100644 index 0000000000000000000000000000000000000000..19ec0e8091691d4aaaa6b53dbb695fde9e826d89 --- /dev/null +++ b/assignment-3/submission/18307130130/tester_demo.py @@ -0,0 +1,117 @@ +import numpy as np +import sys + +from source import KMeans, GaussianMixture + + +def shuffle(*datas): + data = np.concatenate(datas) + label = np.concatenate([ + np.ones((d.shape[0],), dtype=int)*i + for (i, d) in enumerate(datas) + ]) + N = data.shape[0] + idx = np.arange(N) + np.random.shuffle(idx) + data = data[idx] + label = label[idx] + return data, label + + +def data_1(): + mean = (1, 2) + cov = np.array([[73, 0], [0, 22]]) + x = np.random.multivariate_normal(mean, cov, (800,)) + + mean = (16, -5) + cov = np.array([[21.2, 0], [0, 32.1]]) + y = np.random.multivariate_normal(mean, cov, (200,)) + + mean = (10, 22) + cov = np.array([[10, 5], [5, 10]]) + z = np.random.multivariate_normal(mean, cov, (1000,)) + + data, _ = shuffle(x, y, z) + return (data, data), 3 + + +def data_2(): + train_data = np.array([ + [23, 12, 173, 2134], + [99, -12, -126, -31], + [55, -145, -123, -342], + ]) + return (train_data, train_data), 2 + + +def data_3(): + train_data = np.array([ + [23], + [-2999], + [-2955], + ]) + return (train_data, train_data), 2 + + +def test_with_n_clusters(data_fuction, algorithm_class): + (train_data, test_data), n_clusters = data_fuction() + model = algorithm_class(n_clusters) + model.fit(train_data) + res = model.predict(test_data) + assert len( + res.shape) == 1 and res.shape[0] == test_data.shape[0], "shape of result is wrong" + return res + + +def testcase_1_1(): + test_with_n_clusters(data_1, KMeans) + return True + + +def testcase_1_2(): + res = test_with_n_clusters(data_2, KMeans) + return res[0] != res[1] and res[1] == res[2] + + +def testcase_2_1(): + test_with_n_clusters(data_1, GaussianMixture) + return True + + +def testcase_2_2(): + res = test_with_n_clusters(data_3, GaussianMixture) + return res[0] != res[1] and res[1] == res[2] + + +def test_all(err_report=False): + testcases = [ + ["KMeans-1", testcase_1_1, 4], + ["KMeans-2", testcase_1_2, 4], + # ["KMeans-3", testcase_1_3, 4], + # ["KMeans-4", testcase_1_4, 4], + # ["KMeans-5", testcase_1_5, 4], + ["GMM-1", testcase_2_1, 4], + ["GMM-2", testcase_2_2, 4], + # ["GMM-3", testcase_2_3, 4], + # ["GMM-4", testcase_2_4, 4], + # ["GMM-5", testcase_2_5, 4], + ] + sum_score = sum([case[2] for case in testcases]) + score = 0 + for case in testcases: + try: + res = case[2] if case[1]() else 0 + except Exception as e: + if err_report: + print("Error [{}] occurs in {}".format(str(e), case[0])) + res = 0 + score += res + print("+ {:14} {}/{}".format(case[0], res, case[2])) + print("{:16} {}/{}".format("FINAL SCORE", score, sum_score)) + + +if __name__ == "__main__": + if len(sys.argv) > 1 and sys.argv[1] == "--report": + test_all(True) + else: + test_all()