你现在已经了解了k-means算法,那么让我们来聊聊k-means++算法,它其实是对k-means的一个改进,特别在初始化聚类中心这个环节上。
你有一堆彩色的气球散落在地板上,你想要将这些气球分成几个小组,每个小组的颜色尽可能相似。按照k-means算法,你可能会随便挑几个气球作为每组的代表(称为“质心”),然后根据这些代表将所有气球归类,接着不断调整代表气球的位置,直到找到最好的归类方法。
但是,如果一开始挑的代表气球不够好,你可能会发现自己不得不调整很多次才能得到满意的分组。这时候k-means++就派上用场了,它提供了一种更聪明的方式来选择开始时的代表气球。
k-means++的核心思想是这样的:首先,随机选择一个气球作为第一个质心,之后每次选择下一个质心时,都要考虑到已有质心的位置。具体来说,每个还没有被选为质心的气球被选中的概率与它到最近一个已选择质心距离的平方成正比。这就意味着那些离已有质心较远的气球被选为下一个质心的机率会更大。
举个例子,如果你第一个随机挑了个红色气球作为质心,下一个质心很可能不会是另一个红色气球,因为红色气球彼此之间距离较近。相反,可能会是一个绿色或蓝色的气球,因为它们距离红色气球更远。这样,就能确保初始的质心彼此之间有一定的距离,从而有助于更快地找到一个好的分组方案。
在选择了所有的质心之后,k-means++算法就变回了普通的k-means算法,不断迭代调整质心位置和气球的归类,直到找到最佳分组。
简而言之,k-means++的巧妙之处在于它选择初始质心的策略,这个策略通过考虑到距离因素,使得算法的最终结果更加稳定和准确,同时通常也能加快收敛速度。
K-Means++算法是K-Means聚类算法的一个改进版本,它主要优化了初始质心的选择方法。K-Means算法在聚类时对初始质心的选择非常敏感,如果初始质心选择不当,可能会导致聚类结果不佳或者需要更多的迭代次数才能收敛。K-Means++算法通过一种特定的概率方法来选择初始质心,以期望得到更好的聚类效果。
K-Means++算法的基本原理可以分为以下几个步骤:
K-Means++算法的主要优势在于初始质心的选择更加合理,这通常可以导致聚类过程更快收敛,以及最终的聚类效果更加稳定和准确。
在K-Means++算法中,选择新的质心的概率公式如下:
其中, 是新的质心, 是数据点, 是点 到最近质心的距离,分母是所有数据点到其最近质心距离的平方和。
总的来说,K-Means++算法通过改进初始质心的选择,使得K-Means聚类的效果更好,更快收敛,减少了对初始质心随机选择的依赖。
见评价指标。
源代码链接和数据文件在K-Means算法的[代码]部分中。
1import codecs
2from numpy import *
3import matplotlib.pyplot as plt
4from mpl_toolkits.mplot3d import Axes3D
5
6def load_data(path):
7 """
8 @brief Loads a data.
9 @param path The path
10 @return data set
11 """
12 data_set = list()
13 with codecs.open(path) as f:
14 for line in f.readlines():
15 data = line.strip().split("\t")
16 flt_data = list(map(float, data))
17 data_set.append(flt_data)
18 return data_set
19
20def dist_eucl(vecA, vecB):
21 """
22 @brief the similarity function
23 @param vecA The vector a
24 @param vecB The vector b
25 @return the euclidean distance
26 """
27 return sqrt(sum(power(vecA - vecB, 2)))
28
29def get_closest_dist(point, centroid):
30 """
31 @brief Gets the closest distance.
32 @param point The point
33 @param centroid The centroid
34 @return The closest distance.
35 """
36 # 计算与已有质心最近的距离
37 min_dist = inf
38 for j in range(len(centroid)):
39 distance = dist_eucl(point, centroid[j])
40 if distance < min_dist:
41 min_dist = distance
42 return min_dist
43
44def kpp_cent(data_mat, k):
45 """
46 @brief kmeans++ init centor
47 @param data_mat The data matrix
48 @param k num of cluster
49 @return init centroid
50 """
51 data_set = data_mat.getA()
52 # 随机初始化第一个中心点
53 centroid = list()
54 centroid.append(data_set[random.randint(0,len(data_set))])
55 d = [0 for i in range(len(data_set))]
56 for _ in range(1, k):
57 total = 0.0
58 for i in range(len(data_set)):
59 d[i] = get_closest_dist(data_set[i], centroid)
60 total += d[i]
61 total *= random.rand()
62 # 选取下一个中心点
63 for j in range(len(d)):
64 total -= d[j]
65 if total > 0:
66 continue
67 centroid.append(data_set[j])
68 break
69 return mat(centroid)
70
71def kpp_Means(data_mat, k, dist = "dist_eucl", create_cent = "kpp_cent"):
72 """
73 @brief kpp means algorithm
74 @param data_mat The data matrix
75 @param k num of cluster
76 @param dist The distance funtion
77 @param create_cent The create centroid function
78 @return the cluster
79 """
80 m = shape(data_mat)[0]
81 # 初始化点的簇
82 cluste_assment = mat(zeros((m, 2))) # 类别,距离
83 # 随机初始化聚类初始点
84 centroid = eval(create_cent)(data_mat, k)
85 cluster_changed = True
86 # 遍历每个点
87 while cluster_changed:
88 cluster_changed = False
89 for i in range(m):
90 min_index = -1
91 min_dist = inf
92 for j in range(k):
93 distance = eval(dist)(data_mat[i, :], centroid[j, :])
94 if distance < min_dist:
95 min_dist = distance
96 min_index = j
97 if cluste_assment[i, 0] != min_index:
98 cluster_changed = True
99 cluste_assment[i, :] = min_index, min_dist**2
100 # 计算簇中所有点的均值并重新将均值作为质心
101 for j in range(k):
102 per_data_set = data_mat[nonzero(cluste_assment[:,0].A == j)[0]]
103 centroid[j, :] = mean(per_data_set, axis = 0)
104 return centroid, cluste_assment
105
106def plot_cluster(data_mat, cluste_assment, centroid):
107 """
108 @brief plot cluster and centroid
109 @param data_mat The data matrix
110 @param cluste_assment The cluste assment
111 @param centroid The centroid
112 @return
113 """
114 plt.figure(figsize=(15, 6), dpi=80)
115 plt.subplot(121)
116 plt.plot(data_mat[:, 0], data_mat[:, 1], 'o')
117 plt.title("source data", fontsize=15)
118 plt.subplot(122)
119 k = shape(centroid)[0]
120 colors = [plt.cm.Spectral(each) for each in linspace(0, 1, k)]
121 for i, col in zip(range(k), colors):
122 per_data_set = data_mat[nonzero(cluste_assment[:,0].A == i)[0]]
123 plt.plot(per_data_set[:, 0], per_data_set[:, 1], 'o', markerfacecolor=tuple(col),
124 markeredgecolor='k', markersize=10)
125 for i in range(k):
126 plt.plot(centroid[:,0], centroid[:,1], '+', color = 'k', markersize=18)
127 plt.title("k-Means++ Cluster, k = 3", fontsize=15)
128 plt.show()
129
130
131if __name__ == '__main__':
132 data_mat = mat(load_data("testSet2_kmeans.txt"))
133 centroid, cluster_assment = kpp_Means(data_mat, 3)
134 wcss = sum(cluster_assment[:,1])
135 print("wcss is ", wcss)
136 plot_cluster(data_mat, cluster_assment, centroid)
1import codecs
2from numpy import *
3import matplotlib.pyplot as plt
4from mpl_toolkits.mplot3d import Axes3D
5
6def load_data(path):
7 """
8 @brief Loads a data.
9 @param path The path
10 @return data set
11 """
12 data_set = list()
13 with codecs.open(path) as f:
14 for line in f.readlines():
15 data = line.strip().split("\t")
16 flt_data = list(map(float, data))
17 data_set.append(flt_data)
18 return data_set
19
20def dist_eucl(vecA, vecB):
21 """
22 @brief the similarity function
23 @param vecA The vector a
24 @param vecB The vector b
25 @return the euclidean distance
26 """
27 return sqrt(sum(power(vecA - vecB, 2)))
28
29def get_closest_dist(point, centroid):
30 """
31 @brief Gets the closest distance.
32 @param point The point
33 @param centroid The centroid
34 @return The closest distance.
35 """
36 # 计算与已有质心最近的距离
37 min_dist = inf
38 for j in range(len(centroid)):
39 distance = dist_eucl(point, centroid[j])
40 if distance < min_dist:
41 min_dist = distance
42 return min_dist
43
44def kpp_cent(data_mat, k):
45 """
46 @brief kmeans++ init centor
47 @param data_mat The data matrix
48 @param k num of cluster
49 @return init centroid
50 """
51 data_set = data_mat.getA()
52 # 随机初始化第一个中心点
53 centroid = list()
54 centroid.append(data_set[random.randint(0,len(data_set))])
55 d = [0 for i in range(len(data_set))]
56 for _ in range(1, k):
57 total = 0.0
58 for i in range(len(data_set)):
59 d[i] = get_closest_dist(data_set[i], centroid)
60 total += d[i]
61 total *= random.rand()
62 # 选取下一个中心点
63 for j in range(len(d)):
64 total -= d[j]
65 if total > 0:
66 continue
67 centroid.append(data_set[j])
68 break
69 return mat(centroid)
70
71def kpp_Means(data_mat, k, dist = "dist_eucl", create_cent = "kpp_cent"):
72 """
73 @brief kpp means algorithm
74 @param data_mat The data matrix
75 @param k num of cluster
76 @param dist The distance funtion
77 @param create_cent The create centroid function
78 @return the cluster
79 """
80 m = shape(data_mat)[0]
81 # 初始化点的簇
82 cluste_assment = mat(zeros((m, 2))) # 类别,距离
83 # 随机初始化聚类初始点
84 centroid = eval(create_cent)(data_mat, k)
85 cluster_changed = True
86 # 遍历每个点
87 while cluster_changed:
88 cluster_changed = False
89 for i in range(m):
90 min_index = -1
91 min_dist = inf
92 for j in range(k):
93 distance = eval(dist)(data_mat[i, :], centroid[j, :])
94 if distance < min_dist:
95 min_dist = distance
96 min_index = j
97 if cluste_assment[i, 0] != min_index:
98 cluster_changed = True
99 cluste_assment[i, :] = min_index, min_dist**2
100 # 计算簇中所有点的均值并重新将均值作为质心
101 for j in range(k):
102 per_data_set = data_mat[nonzero(cluste_assment[:,0].A == j)[0]]
103 centroid[j, :] = mean(per_data_set, axis = 0)
104 return centroid, cluste_assment
105
106def plot_cluster(data_mat, cluste_assment, centroid):
107 """
108 @brief plot cluster and centroid
109 @param data_mat The data matrix
110 @param cluste_assment The cluste assment
111 @param centroid The centroid
112 @return
113 """
114 plt.figure(figsize=(15, 6), dpi=80)
115 plt.subplot(121)
116 plt.plot(data_mat[:, 0], data_mat[:, 1], 'o')
117 plt.title("source data", fontsize=15)
118 plt.subplot(122)
119 k = shape(centroid)[0]
120 colors = [plt.cm.Spectral(each) for each in linspace(0, 1, k)]
121 for i, col in zip(range(k), colors):
122 per_data_set = data_mat[nonzero(cluste_assment[:,0].A == i)[0]]
123 plt.plot(per_data_set[:, 0], per_data_set[:, 1], 'o', markerfacecolor=tuple(col),
124 markeredgecolor='k', markersize=10)
125 for i in range(k):
126 plt.plot(centroid[:,0], centroid[:,1], '+', color = 'k', markersize=18)
127 plt.title("k-Means++ Cluster, k = 3", fontsize=15)
128 plt.show()
129
130
131if __name__ == '__main__':
132 data_mat = mat(load_data("testSet2_kmeans.txt"))
133 centroid, cluster_assment = kpp_Means(data_mat, 3)
134 wcss = sum(cluster_assment[:,1])
135 print("wcss is ", wcss)
136 plot_cluster(data_mat, cluster_assment, centroid)
输出结果:
1wcss is 291.8508858651727
1wcss is 291.8508858651727
Python中机器学习库scikit-learn(简称sklearn)提供了包含K-means++初始化方法的K-means算法实现。在使用sklearn.cluster.KMeans
时,可以通过将init
参数设置为'k-means++'
(这是默认设置)来使用K-means++初始化方法。我们依旧使用testSet2_kmeans.txt
数据集。
1import numpy as np
2import matplotlib.pyplot as plt
3from sklearn.cluster import KMeans
4from sklearn.metrics import silhouette_score
5
6X = np.loadtxt('testSet2_kmeans.txt')
7
8# 使用肘部法则确定最佳的k值
9wcss = []
10for k in range(1, 11):
11 kmeans = KMeans(n_clusters=k, random_state=0)
12 kmeans.fit(X)
13 wcss.append(kmeans.inertia_)
14
15plt.figure(figsize=(10, 6))
16plt.plot(range(1, 11), wcss, marker='o')
17plt.title('Elbow Method')
18plt.xlabel('Number of clusters (k)')
19plt.ylabel('WCSS')
20plt.xticks(range(1, 11))
21plt.grid(True)
22plt.show()
23
24k = 3 # 假设我们要将数据聚为3类
25
26# 使用scikit-learn的KMeans类
27kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10, max_iter=300, random_state=42)
28kmeans.fit(X)
29
30labels = kmeans.labels_
31centroids = kmeans.cluster_centers_
32
33# 计算轮廓系数
34silhouette_avg = silhouette_score(X, labels)
35print(f"对于k={k}, 轮廓系数为:{silhouette_avg}")
36
37# 可视化聚类结果
38plt.figure(figsize=(10, 6))
39plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis')
40centers = kmeans.cluster_centers_
41plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
42plt.show()
1import numpy as np
2import matplotlib.pyplot as plt
3from sklearn.cluster import KMeans
4from sklearn.metrics import silhouette_score
5
6X = np.loadtxt('testSet2_kmeans.txt')
7
8# 使用肘部法则确定最佳的k值
9wcss = []
10for k in range(1, 11):
11 kmeans = KMeans(n_clusters=k, random_state=0)
12 kmeans.fit(X)
13 wcss.append(kmeans.inertia_)
14
15plt.figure(figsize=(10, 6))
16plt.plot(range(1, 11), wcss, marker='o')
17plt.title('Elbow Method')
18plt.xlabel('Number of clusters (k)')
19plt.ylabel('WCSS')
20plt.xticks(range(1, 11))
21plt.grid(True)
22plt.show()
23
24k = 3 # 假设我们要将数据聚为3类
25
26# 使用scikit-learn的KMeans类
27kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10, max_iter=300, random_state=42)
28kmeans.fit(X)
29
30labels = kmeans.labels_
31centroids = kmeans.cluster_centers_
32
33# 计算轮廓系数
34silhouette_avg = silhouette_score(X, labels)
35print(f"对于k={k}, 轮廓系数为:{silhouette_avg}")
36
37# 可视化聚类结果
38plt.figure(figsize=(10, 6))
39plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis')
40centers = kmeans.cluster_centers_
41plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
42plt.show()
输出结果:
1对于k=3, 轮廓系数为:0.7031071078946134
1对于k=3, 轮廓系数为:0.7031071078946134
优点:
缺点:
总的来说,k-means++ 聚类算法在提高聚类质量和稳定性方面比原始的 k-means 算法有所改进,但同时也带来了一些计算成本上的增加。