K-Means算法的目标是将n个数据点分配到k个簇中(k是预先指定的固定数目),使得每个点属于与其最近的均值(即簇中心)对应的簇,以此使得簇内的数据点尽可能相似(即方差尽可能小),而不同簇之间的数据点尽可能不同。
用数学语言表示,则有:
假设数据集由n个数据点组成,记为 ,其中每个数据点 是一个d维的实向量。K-Means算法试图找到k个簇的集合 ,以最小化簇内误差平方和(Within-cluster Sum of Squares, WCSS),公式如下:
其中, 是数据点 与其簇中心 之间的欧氏距离的平方,簇中心 是簇 中所有点的均值。
算法步骤如下:
k-means聚类对数据的主要假设如下:
由于这些假设,通常需要对数据进行预处理以提高k-means的性能:
这里主要讲解肘部法则。
肘部法则这个名字来源于它的图形特点,我们可以通过一个图表来帮助我们找到最合适的簇数。在这个图表中,横轴表示簇数,纵轴表示每个簇内的果实与其质心的距离之和(也就是聚类的紧密度),我们称这个距离之和为“内部类平方和”(WCSS)。
开始时,你只有一堆数据点(簇数为1),这时候所有数据点距离它们的质心非常远,因为它们都被算在同一堆里了。然后随着簇数的增加,每堆里的数据点数量变少了,数据点和它们质心的平均距离自然就减小了。这样,图表的纵轴值(内部类平方和)就开始下降。
但是,随着你继续增加簇数,改善程度会开始减慢,曲线上的下降趋势会变得平缓,就像手臂的肘部一样。这个“肘点”通常意味着再增加簇数带来的改善变得不那么显著了。换句话说,就是再分更多堆也不会显著提高分类的质量了。
所以,肘部法则告诉我们,就像找到手臂的肘部位置一样,我们应该找到图表上的那个“肘点”,在这个点之后,增加簇数的效果会大打折扣。这个“肘点”通常被认为是最佳的簇数。
k-means算法应用肘部法则确定簇数的具体步骤如下:
步骤 1: 初始化
选择一个簇数范围。比如,你可以从1开始,到10或更多结束,具体取决于数据集的大小和特性。
步骤 2: 对每个簇数运行k-means
对于每个簇数k(从你的范围内选择),运行k-means算法。这里的步骤和本页的[K-Means算法的定义和原理]中一样。
每次运行完成后,计算该次运行的内部类平方和 ,公式如下:
其中, 是第 个簇, 是 的质心, 是 中的数据点, 是数据点 到其质心 的欧氏距离。
步骤 3: 绘制肘部曲线
创建一个图表,横轴是簇数k,纵轴是对应的内部类平方和 。随着k的增加, 通常会减少,因为数据点更有可能被分配到它们真正属于的簇中。
步骤 4: 寻找肘点
观察图表,寻找曲线开始变得平坦的点,这通常是内部类平方和的下降速度明显变缓的地方。这个点就像人的肘部,因此被称为肘点。
步骤 5: 选择簇数
选择肘点对应的簇数k作为最终的簇数。这个簇数通常被认为是一个合适的折中选择,因为在这一点上,增加更多的簇不会显著提高数据的聚类效果。
肘部法可能不总是完美的。在某些情况下,曲线可能非常平滑,难以确定肘点。在这种情况下,可能需要结合其他方法或领域知识来确定最佳的簇数。
参考代码:
1import numpy as np
2from sklearn.cluster import KMeans
3import matplotlib.pyplot as plt
4
5# 假设我们有一些二维空间中的数据点
6# 这里我们随机生成一些数据作为示例
7np.random.seed(42)
8data = np.random.rand(200, 2)
9
10# 确定k的范围,比如从1到10
11K_range = range(1, 11)
12
13# 初始化内部类平方和列表
14inertias = []
15
16# 对每个k值运行k-means并计算内部类平方和
17for k in K_range:
18 kmeans = KMeans(n_clusters=k, random_state=42)
19 kmeans.fit(data)
20 # 将每次迭代的内部类平方和添加到列表中
21 inertias.append(kmeans.inertia_)
22
23# 绘制肘部图
24plt.plot(K_range, inertias, 'bo-')
25plt.xlabel('簇数 k')
26plt.ylabel('内部类平方和 (Inertia)')
27plt.title('肘部法则图')
28plt.show()
1import numpy as np
2from sklearn.cluster import KMeans
3import matplotlib.pyplot as plt
4
5# 假设我们有一些二维空间中的数据点
6# 这里我们随机生成一些数据作为示例
7np.random.seed(42)
8data = np.random.rand(200, 2)
9
10# 确定k的范围,比如从1到10
11K_range = range(1, 11)
12
13# 初始化内部类平方和列表
14inertias = []
15
16# 对每个k值运行k-means并计算内部类平方和
17for k in K_range:
18 kmeans = KMeans(n_clusters=k, random_state=42)
19 kmeans.fit(data)
20 # 将每次迭代的内部类平方和添加到列表中
21 inertias.append(kmeans.inertia_)
22
23# 绘制肘部图
24plt.plot(K_range, inertias, 'bo-')
25plt.xlabel('簇数 k')
26plt.ylabel('内部类平方和 (Inertia)')
27plt.title('肘部法则图')
28plt.show()
输出结果:
见评价指标。
通过分析原始数据集testSet_kmeans.txt
,通过多次测试,我们发现讲簇数设为4(k=4)比较合理。
参考代码(具体函数的实现、数据集在本页的[代码]里,这里方便阅读,只展现了主函数):
1if __name__ == '__main__':
2 data_mat = mat(load_data("testSet_kmeans.txt"))
3 centroid, cluster_assment = kMeans(data_mat, 4)
4 wcss = sum(cluster_assment[:,1])
5 print("wcss is ", wcss)
6 plot_cluster(data_mat, cluster_assment, centroid)
1if __name__ == '__main__':
2 data_mat = mat(load_data("testSet_kmeans.txt"))
3 centroid, cluster_assment = kMeans(data_mat, 4)
4 wcss = sum(cluster_assment[:,1])
5 print("wcss is ", wcss)
6 plot_cluster(data_mat, cluster_assment, centroid)
输出结果:
1wcss is 390.49592193262447
1wcss is 390.49592193262447
这看起来聚类效果很好,然而对于有些数据,聚类效果可能会打折扣。让我们回顾一下WCSS的计算公式:
为了使聚类效果更好,我们需要找到一个k使WCSS最小,而这个函数本质上是非凸优化函数,这意味着它会存在局部最优解,求解结果可能会收敛于局部最优解而非全局最优解(关于凸优化函数的具体讲解)。
如下图所示:
可以发现该函数有两个局部最优点,当初始质心点取值不同的时候,最终的聚类效果也不一样。
接下来我们看一个具体的实例。这里用到了数据集testSet2_kmeans.txt
。
参考代码:
1if __name__ == '__main__':
2 data_mat = mat(load_data("testSet2_kmeans.txt"))
3 centroid, cluster_assment = kMeans(data_mat, 3)
4 wcss = sum(cluster_assment[:,1])
5 print("wcss is ", wcss)
6 plot_cluster(data_mat, cluster_assment, centroid)
7 test_diff_k()
1if __name__ == '__main__':
2 data_mat = mat(load_data("testSet2_kmeans.txt"))
3 centroid, cluster_assment = kMeans(data_mat, 3)
4 wcss = sum(cluster_assment[:,1])
5 print("wcss is ", wcss)
6 plot_cluster(data_mat, cluster_assment, centroid)
7 test_diff_k()
输出结果:
1wcss is 653.712046244568
1wcss is 653.712046244568
由于K-Means对初始质心点的位置很敏感,且初始质心点是随机选取,所以有时K-Means的聚类效果不太好,例如第一张图。数据集本是很明显的大概由3簇组成,而K-Means却没有很好地区分开。
同时簇数k的取值也直接影响K-Means的聚类效果。
此外,K-Means算法对于凸性数据具有良好的效果,能够根据距离来讲数据分为球状类的簇,但对于非凸形状的数据点,就无能为力了,当k-means算法在环形数据的聚类时,我们看看会发生什么情况。
参考代码:
1if __name__ == '__main__':
2 data_mat,c = make_moons(n_samples=1000,noise=0.1)
3 centroid, cluster_assment = kMeans(data_mat, 2)
4 wcss = sum(cluster_assment[:,1])
5 print("wcss is ", wcss)
6 plot_cluster(data_mat, cluster_assment, centroid)
1if __name__ == '__main__':
2 data_mat,c = make_moons(n_samples=1000,noise=0.1)
3 centroid, cluster_assment = kMeans(data_mat, 2)
4 wcss = sum(cluster_assment[:,1])
5 print("wcss is ", wcss)
6 plot_cluster(data_mat, cluster_assment, centroid)
输出结果:
1wcss is 265.7778014831852
1wcss is 265.7778014831852
很显然,分类效果很差。
这里的代码源于k-means源码(非本网站开发团队创作)。
用到的数据集为
1# k_means.py
2# -*- coding: utf-8 -*-
3# @Author: huzhu
4# @Date: 2019-10-29 09:31:43
5# @Last Modified by: huzhu
6# @Last Modified time: 2019-11-14 21:14:20
7
8import codecs
9from numpy import *
10import matplotlib.pyplot as plt
11from mpl_toolkits.mplot3d import Axes3D
12from sklearn.datasets import make_moons
13import matplotlib.animation as animation
14from sklearn.cluster import KMeans
15
16# 加载数据的函数
17def load_data(path):
18 """
19 @brief Loads a data.
20 @param path The path
21 @return data set
22 """
23 data_set = list()
24 with codecs.open(path) as f:
25 for line in f.readlines():
26 data = line.strip().split("\t")
27 flt_data = list(map(float, data))
28 data_set.append(flt_data)
29 return data_set
30
31# 计算两个向量的欧几里得距离
32def dist_eucl(vecA, vecB):
33 """
34 @brief the similarity function
35 @param vecA The vector a
36 @param vecB The vector b
37 @return the euclidean distance
38 """
39 return sqrt(sum(power(vecA - vecB, 2)))
40
41# 随机生成初始的质心
42def rand_cent(data_mat, k):
43 """
44 @brief select random centroid
45 @param data_mat The data matrix
46 @param k
47 @return centroids
48 """
49 n = shape(data_mat)[1]
50 centroids = mat(zeros((k, n)))
51 for j in range(n):
52 minJ = min(data_mat[:,j])
53 rangeJ = float(max(data_mat[:,j]) - minJ)
54 centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))
55 return centroids
56
57# K-Means算法的实现
58def kMeans(data_mat, k, dist = "dist_eucl", create_cent = "rand_cent"):
59 """
60 @brief kMeans algorithm
61 @param data_mat The data matrix
62 @param k num of cluster
63 @param dist The distance funtion
64 @param create_cent The create centroid function
65 @return the cluster
66 """
67 m = shape(data_mat)[0]
68 # 初始化点的簇
69 cluster_assment = mat(zeros((m, 2))) # 类别,距离
70 # 随机初始化聚类初始点
71 centroid = eval(create_cent)(data_mat, k)
72 cluster_changed = True
73 # 遍历每个点
74 while cluster_changed:
75 cluster_changed = False
76 for i in range(m):
77 min_index = -1
78 min_dist = inf
79 for j in range(k):
80 distance = eval(dist)(data_mat[i, :], centroid[j, :])
81 if distance < min_dist:
82 min_dist = distance
83 min_index = j
84 if cluster_assment[i, 0] != min_index:
85 cluster_changed = True
86 cluster_assment[i, :] = min_index, min_dist**2
87 # 计算簇中所有点的均值并重新将均值作为质心
88 for j in range(k):
89 per_data_set = data_mat[nonzero(cluster_assment[:,0].A == j)[0]]
90 centroid[j, :] = mean(per_data_set, axis = 0)
91 return centroid, cluster_assment
92
93# 绘制聚类结果
94def plot_cluster(data_mat, cluster_assment, centroid):
95 """
96 @brief plot cluster and centroid
97 @param data_mat The data matrix
98 @param cluster_assment The cluste assment
99 @param centroid The centroid
100 @return
101 """
102 plt.figure(figsize=(15, 6), dpi=80)
103 plt.subplot(121)
104 plt.plot(data_mat[:, 0], data_mat[:, 1], 'o')
105 plt.title("source data", fontsize=15)
106 plt.subplot(122)
107 k = shape(centroid)[0]
108 colors = [plt.cm.get_cmap("Spectral")(each) for each in linspace(0, 1, k)]
109 for i, col in zip(range(k), colors):
110 per_data_set = data_mat[nonzero(cluster_assment[:,0].A == i)[0]]
111 plt.plot(per_data_set[:, 0], per_data_set[:, 1], 'o', markerfacecolor=tuple(col),
112 markeredgecolor='k', markersize=10)
113 for i in range(k):
114 plt.plot(centroid[:,0], centroid[:,1], '+', color = 'k', markersize=18)
115 plt.title("K-Means Cluster, k ="+str(k), fontsize=15)
116 plt.show()
117
118# 绘制非凸优化函数图像
119def plot_noncov():
120 """
121 @brief 绘制非凸优化函数图像
122 @return { description_of_the_return_value }
123 """
124 fig = plt.figure()
125 ax = fig.sub_plot(projection='3d') # 这里对原代码做了改动,就不会报错了
126 x1 = linspace(-2,2,100)
127 x2 = linspace(-2,2,100)
128 mu1 = array([1,1])
129 mu2 = array([-1,-1])
130 Z = zeros((len(x1), len(x2)))
131 for i in range(len(x1)):
132 for j in range(len(x2)):
133 itemx = x1[i]
134 itemy = x2[j]
135 z1 = dist_eucl(mu1, [itemx, itemy])
136 z2 = dist_eucl(mu2, [itemx, itemy])
137 Z[i,j] = min(z1,z2)
138 X1, X2 = meshgrid(x1, x2)
139 ax.plot_surface(X1, X2, Z, rstride=1, cstride=1, cmap='rainbow')
140 plt.show()
141
142# 测试不同的k值对聚类结果的影响
143def test_diff_k():
144 plt.figure(figsize=(15, 4), dpi=80)
145 data_mat = mat(load_data("data/testSet2_kmeans.txt"))
146 centroid, cluster_assment = kMeans(data_mat, 2)
147 plt.subplot(131)
148 k = shape(centroid)[0]
149 colors = [plt.cm.Spectral(each) for each in linspace(0, 1, k)]
150 for i, col in zip(range(k), colors):
151 per_data_set = data_mat[nonzero(cluster_assment[:,0].A == i)[0]]
152 plt.plot(per_data_set[:, 0], per_data_set[:, 1], 'o', markerfacecolor=tuple(col),
153 markeredgecolor='k', markersize=10)
154 for i in range(k):
155 plt.plot(centroid[:,0], centroid[:,1], '+', color = 'k', markersize=18)
156 plt.title("K-Means Cluster, k = 2", fontsize=15)
157
158 centroid, cluster_assment = kMeans(data_mat, 3)
159 plt.subplot(132)
160 k = shape(centroid)[0]
161 colors = [plt.cm.Spectral(each) for each in linspace(0, 1, k)]
162 for i, col in zip(range(k), colors):
163 per_data_set = data_mat[nonzero(cluster_assment[:,0].A == i)[0]]
164 plt.plot(per_data_set[:, 0], per_data_set[:, 1], 'o', markerfacecolor=tuple(col),
165 markeredgecolor='k', markersize=10)
166 for i in range(k):
167 plt.plot(centroid[:,0], centroid[:,1], '+', color = 'k', markersize=18)
168 plt.title("K-Means Cluster, k = 3", fontsize=15)
169
170 centroid, cluster_assment = kMeans(data_mat, 4)
171 plt.subplot(133)
172 k = shape(centroid)[0]
173 colors = [plt.cm.Spectral(each) for each in linspace(0, 1, k)]
174 for i, col in zip(range(k), colors):
175 per_data_set = data_mat[nonzero(cluster_assment[:,0].A == i)[0]]
176 plt.plot(per_data_set[:, 0], per_data_set[:, 1], 'o', markerfacecolor=tuple(col),
177 markeredgecolor='k', markersize=10)
178 for i in range(k):
179 plt.plot(centroid[:,0], centroid[:,1], '+', color = 'k', markersize=18)
180 plt.title("K-Means Cluster, k = 4", fontsize=15)
181 plt.show()
182
183# 绘制并保存GIF图表以展示聚类过程
184def plot_fig(data_mat):
185 """
186 @brief 绘制并保存gif图
187 @param data_mat The data matrix
188 @param k { parameter_description }
189 @return { description_of_the_return_value }
190 """
191 centroid_list = list()
192 cluster_assment_list = list()
193 def sub_kMeans(data_mat, k, dist = "dist_eucl", create_cent = "rand_cent"):
194 m = shape(data_mat)[0]
195 # 初始化点的簇
196 cluster_assment = mat(zeros((m, 2))) # 类别,距离
197 # 随机初始化聚类初始点
198 centroid = eval(create_cent)(data_mat, k)
199 cluster_changed = True
200 # 遍历每个点
201 while cluster_changed:
202 centroid_list.append(array(centroid))
203 cluster_assment_list.append(array(cluster_assment))
204 cluster_changed = False
205 for i in range(m):
206 min_index = -1
207 min_dist = inf
208 for j in range(k):
209 distance = eval(dist)(data_mat[i, :], centroid[j, :])
210 if distance < min_dist:
211 min_dist = distance
212 min_index = j
213 if cluster_assment[i, 0] != min_index:
214 cluster_changed = True
215 cluster_assment[i, :] = min_index, min_dist**2
216 # 计算簇中所有点的均值并重新将均值作为质心
217 for j in range(k):
218 per_data_set = data_mat[nonzero(cluster_assment[:,0].A == j)[0]]
219 centroid[j, :] = mean(per_data_set, axis = 0)
220 return centroid_list,cluster_assment_list
221
222 centroid_list,cluster_assment_list = sub_kMeans(data_mat,4)
223
224 fig, ax = plt.subplots()
225 plt.scatter(data_mat[:, 0].flatten().A[0], data_mat[:, 1].flatten().A[0])
226 plt.title("K-Means Cluster Process", fontsize=15)
227 def update(i):
228 try:
229 ax.lines.pop()
230 except Exception:
231 pass
232 centroid = matrix(centroid_list[i])
233 cluster_assment = matrix(cluster_assment_list[i])
234 k = shape(centroid)[0]
235 colors = [plt.cm.Spectral(each) for each in linspace(0, 1, k)]
236 for i, col in zip(range(k), colors):
237 per_data_set = data_mat[nonzero(cluster_assment[:,0].A == i)[0]]
238 line, = plt.plot(per_data_set[:, 0], per_data_set[:, 1], 'o', markerfacecolor=tuple(col),markeredgecolor='k', markersize=10)
239 line, = plt.plot(centroid[:,0], centroid[:,1], '*', color = 'k', markersize=18)
240 return line,
241
242 anim = animation.FuncAnimation(fig, update, frames=len(centroid_list),interval=1000, repeat_delay=1000)
243 plt.show()
244 anim.save('test_animation.gif',writer='pillow')
245
246# 使用sklearn库中的KMeans进行聚类分析
247def kmeans_lib():
248 data_mat = mat(load_data("testSet2_kmeans.txt"))
249 estimator = KMeans(n_clusters=3)#构造聚类器
250 estimator.fit(data_mat)#聚类
251 label_pred = estimator.labels_ #获取聚类标签
252 print(label_pred)
253 centroids = estimator.cluster_centers_ #获取聚类中心
254 inertia = estimator.inertia_ # 获取聚类准则的总和
255 plot_cluster(data_mat, mat(label_pred), centroids)
256 print(centroids)
257 print(inertia)
1# k_means.py
2# -*- coding: utf-8 -*-
3# @Author: huzhu
4# @Date: 2019-10-29 09:31:43
5# @Last Modified by: huzhu
6# @Last Modified time: 2019-11-14 21:14:20
7
8import codecs
9from numpy import *
10import matplotlib.pyplot as plt
11from mpl_toolkits.mplot3d import Axes3D
12from sklearn.datasets import make_moons
13import matplotlib.animation as animation
14from sklearn.cluster import KMeans
15
16# 加载数据的函数
17def load_data(path):
18 """
19 @brief Loads a data.
20 @param path The path
21 @return data set
22 """
23 data_set = list()
24 with codecs.open(path) as f:
25 for line in f.readlines():
26 data = line.strip().split("\t")
27 flt_data = list(map(float, data))
28 data_set.append(flt_data)
29 return data_set
30
31# 计算两个向量的欧几里得距离
32def dist_eucl(vecA, vecB):
33 """
34 @brief the similarity function
35 @param vecA The vector a
36 @param vecB The vector b
37 @return the euclidean distance
38 """
39 return sqrt(sum(power(vecA - vecB, 2)))
40
41# 随机生成初始的质心
42def rand_cent(data_mat, k):
43 """
44 @brief select random centroid
45 @param data_mat The data matrix
46 @param k
47 @return centroids
48 """
49 n = shape(data_mat)[1]
50 centroids = mat(zeros((k, n)))
51 for j in range(n):
52 minJ = min(data_mat[:,j])
53 rangeJ = float(max(data_mat[:,j]) - minJ)
54 centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))
55 return centroids
56
57# K-Means算法的实现
58def kMeans(data_mat, k, dist = "dist_eucl", create_cent = "rand_cent"):
59 """
60 @brief kMeans algorithm
61 @param data_mat The data matrix
62 @param k num of cluster
63 @param dist The distance funtion
64 @param create_cent The create centroid function
65 @return the cluster
66 """
67 m = shape(data_mat)[0]
68 # 初始化点的簇
69 cluster_assment = mat(zeros((m, 2))) # 类别,距离
70 # 随机初始化聚类初始点
71 centroid = eval(create_cent)(data_mat, k)
72 cluster_changed = True
73 # 遍历每个点
74 while cluster_changed:
75 cluster_changed = False
76 for i in range(m):
77 min_index = -1
78 min_dist = inf
79 for j in range(k):
80 distance = eval(dist)(data_mat[i, :], centroid[j, :])
81 if distance < min_dist:
82 min_dist = distance
83 min_index = j
84 if cluster_assment[i, 0] != min_index:
85 cluster_changed = True
86 cluster_assment[i, :] = min_index, min_dist**2
87 # 计算簇中所有点的均值并重新将均值作为质心
88 for j in range(k):
89 per_data_set = data_mat[nonzero(cluster_assment[:,0].A == j)[0]]
90 centroid[j, :] = mean(per_data_set, axis = 0)
91 return centroid, cluster_assment
92
93# 绘制聚类结果
94def plot_cluster(data_mat, cluster_assment, centroid):
95 """
96 @brief plot cluster and centroid
97 @param data_mat The data matrix
98 @param cluster_assment The cluste assment
99 @param centroid The centroid
100 @return
101 """
102 plt.figure(figsize=(15, 6), dpi=80)
103 plt.subplot(121)
104 plt.plot(data_mat[:, 0], data_mat[:, 1], 'o')
105 plt.title("source data", fontsize=15)
106 plt.subplot(122)
107 k = shape(centroid)[0]
108 colors = [plt.cm.get_cmap("Spectral")(each) for each in linspace(0, 1, k)]
109 for i, col in zip(range(k), colors):
110 per_data_set = data_mat[nonzero(cluster_assment[:,0].A == i)[0]]
111 plt.plot(per_data_set[:, 0], per_data_set[:, 1], 'o', markerfacecolor=tuple(col),
112 markeredgecolor='k', markersize=10)
113 for i in range(k):
114 plt.plot(centroid[:,0], centroid[:,1], '+', color = 'k', markersize=18)
115 plt.title("K-Means Cluster, k ="+str(k), fontsize=15)
116 plt.show()
117
118# 绘制非凸优化函数图像
119def plot_noncov():
120 """
121 @brief 绘制非凸优化函数图像
122 @return { description_of_the_return_value }
123 """
124 fig = plt.figure()
125 ax = fig.sub_plot(projection='3d') # 这里对原代码做了改动,就不会报错了
126 x1 = linspace(-2,2,100)
127 x2 = linspace(-2,2,100)
128 mu1 = array([1,1])
129 mu2 = array([-1,-1])
130 Z = zeros((len(x1), len(x2)))
131 for i in range(len(x1)):
132 for j in range(len(x2)):
133 itemx = x1[i]
134 itemy = x2[j]
135 z1 = dist_eucl(mu1, [itemx, itemy])
136 z2 = dist_eucl(mu2, [itemx, itemy])
137 Z[i,j] = min(z1,z2)
138 X1, X2 = meshgrid(x1, x2)
139 ax.plot_surface(X1, X2, Z, rstride=1, cstride=1, cmap='rainbow')
140 plt.show()
141
142# 测试不同的k值对聚类结果的影响
143def test_diff_k():
144 plt.figure(figsize=(15, 4), dpi=80)
145 data_mat = mat(load_data("data/testSet2_kmeans.txt"))
146 centroid, cluster_assment = kMeans(data_mat, 2)
147 plt.subplot(131)
148 k = shape(centroid)[0]
149 colors = [plt.cm.Spectral(each) for each in linspace(0, 1, k)]
150 for i, col in zip(range(k), colors):
151 per_data_set = data_mat[nonzero(cluster_assment[:,0].A == i)[0]]
152 plt.plot(per_data_set[:, 0], per_data_set[:, 1], 'o', markerfacecolor=tuple(col),
153 markeredgecolor='k', markersize=10)
154 for i in range(k):
155 plt.plot(centroid[:,0], centroid[:,1], '+', color = 'k', markersize=18)
156 plt.title("K-Means Cluster, k = 2", fontsize=15)
157
158 centroid, cluster_assment = kMeans(data_mat, 3)
159 plt.subplot(132)
160 k = shape(centroid)[0]
161 colors = [plt.cm.Spectral(each) for each in linspace(0, 1, k)]
162 for i, col in zip(range(k), colors):
163 per_data_set = data_mat[nonzero(cluster_assment[:,0].A == i)[0]]
164 plt.plot(per_data_set[:, 0], per_data_set[:, 1], 'o', markerfacecolor=tuple(col),
165 markeredgecolor='k', markersize=10)
166 for i in range(k):
167 plt.plot(centroid[:,0], centroid[:,1], '+', color = 'k', markersize=18)
168 plt.title("K-Means Cluster, k = 3", fontsize=15)
169
170 centroid, cluster_assment = kMeans(data_mat, 4)
171 plt.subplot(133)
172 k = shape(centroid)[0]
173 colors = [plt.cm.Spectral(each) for each in linspace(0, 1, k)]
174 for i, col in zip(range(k), colors):
175 per_data_set = data_mat[nonzero(cluster_assment[:,0].A == i)[0]]
176 plt.plot(per_data_set[:, 0], per_data_set[:, 1], 'o', markerfacecolor=tuple(col),
177 markeredgecolor='k', markersize=10)
178 for i in range(k):
179 plt.plot(centroid[:,0], centroid[:,1], '+', color = 'k', markersize=18)
180 plt.title("K-Means Cluster, k = 4", fontsize=15)
181 plt.show()
182
183# 绘制并保存GIF图表以展示聚类过程
184def plot_fig(data_mat):
185 """
186 @brief 绘制并保存gif图
187 @param data_mat The data matrix
188 @param k { parameter_description }
189 @return { description_of_the_return_value }
190 """
191 centroid_list = list()
192 cluster_assment_list = list()
193 def sub_kMeans(data_mat, k, dist = "dist_eucl", create_cent = "rand_cent"):
194 m = shape(data_mat)[0]
195 # 初始化点的簇
196 cluster_assment = mat(zeros((m, 2))) # 类别,距离
197 # 随机初始化聚类初始点
198 centroid = eval(create_cent)(data_mat, k)
199 cluster_changed = True
200 # 遍历每个点
201 while cluster_changed:
202 centroid_list.append(array(centroid))
203 cluster_assment_list.append(array(cluster_assment))
204 cluster_changed = False
205 for i in range(m):
206 min_index = -1
207 min_dist = inf
208 for j in range(k):
209 distance = eval(dist)(data_mat[i, :], centroid[j, :])
210 if distance < min_dist:
211 min_dist = distance
212 min_index = j
213 if cluster_assment[i, 0] != min_index:
214 cluster_changed = True
215 cluster_assment[i, :] = min_index, min_dist**2
216 # 计算簇中所有点的均值并重新将均值作为质心
217 for j in range(k):
218 per_data_set = data_mat[nonzero(cluster_assment[:,0].A == j)[0]]
219 centroid[j, :] = mean(per_data_set, axis = 0)
220 return centroid_list,cluster_assment_list
221
222 centroid_list,cluster_assment_list = sub_kMeans(data_mat,4)
223
224 fig, ax = plt.subplots()
225 plt.scatter(data_mat[:, 0].flatten().A[0], data_mat[:, 1].flatten().A[0])
226 plt.title("K-Means Cluster Process", fontsize=15)
227 def update(i):
228 try:
229 ax.lines.pop()
230 except Exception:
231 pass
232 centroid = matrix(centroid_list[i])
233 cluster_assment = matrix(cluster_assment_list[i])
234 k = shape(centroid)[0]
235 colors = [plt.cm.Spectral(each) for each in linspace(0, 1, k)]
236 for i, col in zip(range(k), colors):
237 per_data_set = data_mat[nonzero(cluster_assment[:,0].A == i)[0]]
238 line, = plt.plot(per_data_set[:, 0], per_data_set[:, 1], 'o', markerfacecolor=tuple(col),markeredgecolor='k', markersize=10)
239 line, = plt.plot(centroid[:,0], centroid[:,1], '*', color = 'k', markersize=18)
240 return line,
241
242 anim = animation.FuncAnimation(fig, update, frames=len(centroid_list),interval=1000, repeat_delay=1000)
243 plt.show()
244 anim.save('test_animation.gif',writer='pillow')
245
246# 使用sklearn库中的KMeans进行聚类分析
247def kmeans_lib():
248 data_mat = mat(load_data("testSet2_kmeans.txt"))
249 estimator = KMeans(n_clusters=3)#构造聚类器
250 estimator.fit(data_mat)#聚类
251 label_pred = estimator.labels_ #获取聚类标签
252 print(label_pred)
253 centroids = estimator.cluster_centers_ #获取聚类中心
254 inertia = estimator.inertia_ # 获取聚类准则的总和
255 plot_cluster(data_mat, mat(label_pred), centroids)
256 print(centroids)
257 print(inertia)
以上是所有定义函数的代码。在主函数部分,我们通过调用不同的函数,给数据进行聚类分析并验证我们的假设。
当然,在实际比赛中,我们不需要自己实现一个k-means算法,因为你可以直接调用第三方库sklearn。下面我们依然使用数据集testSet_kmeans.txt
对其进行k-means聚类。
1import numpy as np
2import matplotlib.pyplot as plt
3from sklearn.cluster import KMeans
4from sklearn.metrics import silhouette_score
5
6# 读取数据集
7X = np.loadtxt('testSet_kmeans.txt')
8
9# 使用肘部法则确定最佳的k值
10wcss = []
11for k in range(1, 11):
12 kmeans = KMeans(n_clusters=k, random_state=0)
13 kmeans.fit(X)
14 wcss.append(kmeans.inertia_)
15
16plt.figure(figsize=(10, 6))
17plt.plot(range(1, 11), wcss, marker='o')
18plt.title('Elbow Method')
19plt.xlabel('Number of clusters (k)')
20plt.ylabel('WCSS')
21plt.xticks(range(1, 11))
22plt.grid(True)
23plt.show()
24
25# 根据肘部图选择k值
26# 这里需要你观察肘部图形后自行决定k值
27k = 4 # 这里观察到的最佳k值是4
28kmeans = KMeans(n_clusters=k, random_state=0)
29y_kmeans = kmeans.fit_predict(X)
30
31# 计算轮廓系数
32silhouette_avg = silhouette_score(X, y_kmeans)
33print(f"对于k={k}, 轮廓系数为:{silhouette_avg}")
34
35# 可视化聚类结果
36plt.figure(figsize=(10, 6))
37plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
38centers = kmeans.cluster_centers_
39plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
40
41plt.title(f'K-Means Clustering with K={k}')
42plt.xlabel('Feature 1')
43plt.ylabel('Feature 2')
44plt.show()
1import numpy as np
2import matplotlib.pyplot as plt
3from sklearn.cluster import KMeans
4from sklearn.metrics import silhouette_score
5
6# 读取数据集
7X = np.loadtxt('testSet_kmeans.txt')
8
9# 使用肘部法则确定最佳的k值
10wcss = []
11for k in range(1, 11):
12 kmeans = KMeans(n_clusters=k, random_state=0)
13 kmeans.fit(X)
14 wcss.append(kmeans.inertia_)
15
16plt.figure(figsize=(10, 6))
17plt.plot(range(1, 11), wcss, marker='o')
18plt.title('Elbow Method')
19plt.xlabel('Number of clusters (k)')
20plt.ylabel('WCSS')
21plt.xticks(range(1, 11))
22plt.grid(True)
23plt.show()
24
25# 根据肘部图选择k值
26# 这里需要你观察肘部图形后自行决定k值
27k = 4 # 这里观察到的最佳k值是4
28kmeans = KMeans(n_clusters=k, random_state=0)
29y_kmeans = kmeans.fit_predict(X)
30
31# 计算轮廓系数
32silhouette_avg = silhouette_score(X, y_kmeans)
33print(f"对于k={k}, 轮廓系数为:{silhouette_avg}")
34
35# 可视化聚类结果
36plt.figure(figsize=(10, 6))
37plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
38centers = kmeans.cluster_centers_
39plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
40
41plt.title(f'K-Means Clustering with K={k}')
42plt.xlabel('Feature 1')
43plt.ylabel('Feature 2')
44plt.show()
输出结果:
1对于k=4, 轮廓系数为:0.6558213071798628
1对于k=4, 轮廓系数为:0.6558213071798628
优点:
缺点: