bi-kmeans,也就是二分K均值聚类算法,是一种改进版的K均值聚类方法。这个算法的发明动机主要是为了解决传统K均值算法在处理大规模数据集时遇到的一些问题,比如说效率不高和容易陷入局部最优解。
在原始的K均值算法中,我们需要一开始就指定要分成几个类别,然后算法会随机选择几个点作为中心,接下来就是不断迭代,把数据点分到最近的中心点所代表的类别中,然后再计算新的中心点,如此重复,直到满足停止条件。这个过程在数据量很大或者分布复杂的时候,计算量会非常庞大,而且如果一开始选的中心点不太好,最后得到的结果也可能不是最优的。
bi-kmeans算法就是为了改善这些不足。它的基本思路是,先不急着一次性分好所有的类别,而是先从一个类别开始,把所有的数据点都看成一个大类。然后,这个大类会被一分为二,选择分开后总的误差最小的那种分法。这样一分为二的过程就重复进行,每次都选择当前所有类别中可以进一步分裂并且分裂后能最多减少误差的那个类别进行分裂,直到分出了我们预定的类别数。
这种分而治之的策略让bi-kmeans在很多实际应用场景中都表现得比原始的K均值算法更有效。比如说在处理大数据集的时候,bi-kmeans因为是逐步细分,所以每次处理的数据量相对较小,这就大大提高了算法的运行效率。同时,因为它是通过不断细分来逐步优化类别的,这样也降低了陷入局部最优解的风险。
Bi-Kmeans算法是一种改进的K-means聚类算法。K-means算法是一种经典的聚类分析方法,它的目标是将n个数据点划分为k个簇,使得每个数据点都属于离它最近的簇中心(质心),以此来最小化每个点到簇中心的距离之和。然而,K-means算法需要预先指定簇的个数k,并且对初始簇中心的选择很敏感,容易陷入局部最优解。
Bi-Kmeans(二分K-means)算法则是在K-means的基础上,采用了一种自上而下的分裂策略。它不需要一开始就指定簇的数量,而是从一个大簇开始,逐渐分裂成多个小簇。
Bi-Kmeans算法的基本原理和步骤如下:
初始化:首先选择所有数据点作为一个初始簇。
分裂:对当前的每一个簇,尝试将其分成两个子簇。这里使用K-means算法,即k=2,来实现。对于每个簇,执行以下操作:
评估分裂:对于每次分裂,计算分裂前后的成本函数(WCSS),选择使得成本函数降低最多的那次分裂。成本函数定义为:
其中, 表示簇的数量, 是第 个簇中的点的集合, 是第 个簇的质心, 是簇中的数据点。
重复分裂:重复步骤2和3,直到达到预设的簇的数量或者进一步分裂不会显著降低成本函数。
通过这种自上而下的策略,Bi-Kmeans算法能够逐步细化聚类的结果,并在一定程度上克服了K-means算法对初始质心选择敏感的问题。同时,由于每次都是对单个簇进行操作,算法的计算量通常小于对所有数据点重新聚类的K-means算法,尤其是在处理大规模数据集时。
见评价指标。
1import codecs
2from numpy import *
3import matplotlib.pyplot as plt
4
5def load_data(path):
6 """
7 @brief Loads a data.
8 @param path The path
9 @return data set
10 """
11 data_set = list()
12 with codecs.open(path) as f:
13 for line in f.readlines():
14 data = line.strip().split("\t")
15 flt_data = list(map(float, data))
16 data_set.append(flt_data)
17 return data_set
18
19
20def rand_cent(data_mat, k):
21 """
22 @brief select random centroid
23 @param data_mat The data matrix
24 @param k
25 @return centroids
26 """
27 n = shape(data_mat)[1]
28 centroids = mat(zeros((k, n)))
29 if not data_mat.any():
30 return centroids
31 for j in range(n):
32 minJ = min(data_mat[:,j])
33 rangeJ = float(max(data_mat[:,j]) - minJ)
34 centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))
35 return centroids
36
37
38def dist_eucl(vecA, vecB):
39 """
40 @brief the similarity function
41 @param vecA The vector a
42 @param vecB The vector b
43 @return the euclidean distance
44 """
45 return sqrt(sum(power(vecA - vecB, 2)))
46
47def k_Means(data_mat, k, dist = "dist_eucl", create_cent = "rand_cent"):
48 """
49 @brief kMeans algorithm
50 @param data_mat The data matrix
51 @param k num of cluster
52 @param dist The distance funtion
53 @param create_cent The create centroid function
54 @return the cluster
55 """
56 m = shape(data_mat)[0]
57 # 初始化点的簇
58 cluster_assment = mat(zeros((m, 2))) # 类别,距离
59 # 随机初始化聚类初始点
60 centroid = eval(create_cent)(data_mat, k)
61 cluster_changed = True
62 # 遍历每个点
63 while cluster_changed:
64 cluster_changed = False
65 for i in range(m):
66 min_index = -1
67 min_dist = inf
68 for j in range(k):
69 distance = eval(dist)(data_mat[i, :], centroid[j, :])
70 if distance < min_dist:
71 min_dist = distance
72 min_index = j
73 if cluster_assment[i, 0] != min_index:
74 cluster_changed = True
75 cluster_assment[i, :] = min_index, min_dist**2
76 # 计算簇中所有点的均值并重新将均值作为质心
77 for j in range(k):
78 per_data_set = data_mat[nonzero(cluster_assment[:,0].A == j)[0]]
79 centroid[j, :] = mean(per_data_set, axis = 0)
80 return centroid, cluster_assment
81
82def bi_kMeans(data_mat, k, dist = "dist_eucl"):
83 """
84 @brief kMeans algorithm
85 @param data_mat The data matrix
86 @param k num of cluster
87 @param dist The distance funtion
88 @return the cluster
89 """
90 m = shape(data_mat)[0]
91
92 # 初始化点的簇
93 cluster_assment = mat(zeros((m, 2))) # 类别,距离
94
95 # 初始化聚类初始点
96 centroid0 = mean(data_mat, axis = 0).tolist()[0]
97 cent_list = [centroid0]
98 print(cent_list)
99
100 # 初始化wcss
101 for j in range(m):
102 cluster_assment[j, 1] = eval(dist)(mat(centroid0), data_mat[j, :]) ** 2
103
104 while (len(cent_list) < k):
105 lowest_sse = inf
106 for i in range(len(cent_list)):
107 # 尝试在每一类簇中进行k=2的kmeans划分
108 ptsin_cur_cluster = data_mat[nonzero(cluster_assment[:, 0].A == i)[0],:]
109 centroid_mat, split_cluster_ass = k_Means(ptsin_cur_cluster,k = 2)
110 # 计算分类之后的SSE值
111 sse_split = sum(split_cluster_ass[:, 1])
112 sse_nonsplit = sum(cluster_assment[nonzero(cluster_assment[:, 0].A != i)[0], 1])
113 print("sse_split, sse_nonsplit", sse_split, sse_nonsplit)
114 # 记录最好的划分位置
115 if sse_split + sse_nonsplit < lowest_sse:
116 best_cent_tosplit = i
117 best_new_cents = centroid_mat
118 best_cluster_ass = split_cluster_ass.copy()
119 lowest_sse = sse_split + sse_nonsplit
120 print( 'the bestCentToSplit is: ', best_cent_tosplit)
121 print ('the len of bestClustAss is: ', len(best_cluster_ass))
122 # 更新簇的分配结果
123 best_cluster_ass[nonzero(best_cluster_ass[:, 0].A == 1)[0], 0] = len(cent_list)
124 best_cluster_ass[nonzero(best_cluster_ass[:, 0].A == 0)[0], 0] = best_cent_tosplit
125 cent_list[best_cent_tosplit] = best_new_cents[0, :].tolist()[0]
126 cent_list.append(best_new_cents[1, :].tolist()[0])
127 cluster_assment[nonzero(cluster_assment[:, 0].A == best_cent_tosplit)[0],:] = best_cluster_ass
128 return mat(cent_list), cluster_assment
129
130def plot_cluster(data_mat, cluster_assment, centroid):
131 """
132 @brief plot cluster and centroid
133 @param data_mat The data matrix
134 @param cluster_assment The cluste assment
135 @param centroid The centroid
136 @return
137 """
138 plt.figure(figsize=(15, 6), dpi=80)
139 plt.subplot(121)
140 plt.plot(data_mat[:, 0], data_mat[:, 1], 'o')
141 plt.title("source data", fontsize=15)
142 plt.subplot(122)
143 k = shape(centroid)[0]
144 colors = [plt.cm.Spectral(each) for each in linspace(0, 1, k)]
145 for i, col in zip(range(k), colors):
146 per_data_set = data_mat[nonzero(cluster_assment[:,0].A == i)[0]]
147 plt.plot(per_data_set[:, 0], per_data_set[:, 1], 'o', markerfacecolor=tuple(col),
148 markeredgecolor='k', markersize=10)
149 for i in range(k):
150 plt.plot(centroid[:,0], centroid[:,1], '+', color = 'k', markersize=18)
151 plt.title("bi_KMeans Cluster, k = 3", fontsize=15)
152 plt.show()
153
154if __name__ == '__main__':
155 data_mat = mat(load_data("testSet2_kmeans.txt"))
156 centroid, cluster_assment = bi_kMeans(data_mat, 3)
157 wcss = sum(cluster_assment[:,1])
158 print("wcss is ", wcss)
159 plot_cluster(data_mat, cluster_assment, centroid)
1import codecs
2from numpy import *
3import matplotlib.pyplot as plt
4
5def load_data(path):
6 """
7 @brief Loads a data.
8 @param path The path
9 @return data set
10 """
11 data_set = list()
12 with codecs.open(path) as f:
13 for line in f.readlines():
14 data = line.strip().split("\t")
15 flt_data = list(map(float, data))
16 data_set.append(flt_data)
17 return data_set
18
19
20def rand_cent(data_mat, k):
21 """
22 @brief select random centroid
23 @param data_mat The data matrix
24 @param k
25 @return centroids
26 """
27 n = shape(data_mat)[1]
28 centroids = mat(zeros((k, n)))
29 if not data_mat.any():
30 return centroids
31 for j in range(n):
32 minJ = min(data_mat[:,j])
33 rangeJ = float(max(data_mat[:,j]) - minJ)
34 centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))
35 return centroids
36
37
38def dist_eucl(vecA, vecB):
39 """
40 @brief the similarity function
41 @param vecA The vector a
42 @param vecB The vector b
43 @return the euclidean distance
44 """
45 return sqrt(sum(power(vecA - vecB, 2)))
46
47def k_Means(data_mat, k, dist = "dist_eucl", create_cent = "rand_cent"):
48 """
49 @brief kMeans algorithm
50 @param data_mat The data matrix
51 @param k num of cluster
52 @param dist The distance funtion
53 @param create_cent The create centroid function
54 @return the cluster
55 """
56 m = shape(data_mat)[0]
57 # 初始化点的簇
58 cluster_assment = mat(zeros((m, 2))) # 类别,距离
59 # 随机初始化聚类初始点
60 centroid = eval(create_cent)(data_mat, k)
61 cluster_changed = True
62 # 遍历每个点
63 while cluster_changed:
64 cluster_changed = False
65 for i in range(m):
66 min_index = -1
67 min_dist = inf
68 for j in range(k):
69 distance = eval(dist)(data_mat[i, :], centroid[j, :])
70 if distance < min_dist:
71 min_dist = distance
72 min_index = j
73 if cluster_assment[i, 0] != min_index:
74 cluster_changed = True
75 cluster_assment[i, :] = min_index, min_dist**2
76 # 计算簇中所有点的均值并重新将均值作为质心
77 for j in range(k):
78 per_data_set = data_mat[nonzero(cluster_assment[:,0].A == j)[0]]
79 centroid[j, :] = mean(per_data_set, axis = 0)
80 return centroid, cluster_assment
81
82def bi_kMeans(data_mat, k, dist = "dist_eucl"):
83 """
84 @brief kMeans algorithm
85 @param data_mat The data matrix
86 @param k num of cluster
87 @param dist The distance funtion
88 @return the cluster
89 """
90 m = shape(data_mat)[0]
91
92 # 初始化点的簇
93 cluster_assment = mat(zeros((m, 2))) # 类别,距离
94
95 # 初始化聚类初始点
96 centroid0 = mean(data_mat, axis = 0).tolist()[0]
97 cent_list = [centroid0]
98 print(cent_list)
99
100 # 初始化wcss
101 for j in range(m):
102 cluster_assment[j, 1] = eval(dist)(mat(centroid0), data_mat[j, :]) ** 2
103
104 while (len(cent_list) < k):
105 lowest_sse = inf
106 for i in range(len(cent_list)):
107 # 尝试在每一类簇中进行k=2的kmeans划分
108 ptsin_cur_cluster = data_mat[nonzero(cluster_assment[:, 0].A == i)[0],:]
109 centroid_mat, split_cluster_ass = k_Means(ptsin_cur_cluster,k = 2)
110 # 计算分类之后的SSE值
111 sse_split = sum(split_cluster_ass[:, 1])
112 sse_nonsplit = sum(cluster_assment[nonzero(cluster_assment[:, 0].A != i)[0], 1])
113 print("sse_split, sse_nonsplit", sse_split, sse_nonsplit)
114 # 记录最好的划分位置
115 if sse_split + sse_nonsplit < lowest_sse:
116 best_cent_tosplit = i
117 best_new_cents = centroid_mat
118 best_cluster_ass = split_cluster_ass.copy()
119 lowest_sse = sse_split + sse_nonsplit
120 print( 'the bestCentToSplit is: ', best_cent_tosplit)
121 print ('the len of bestClustAss is: ', len(best_cluster_ass))
122 # 更新簇的分配结果
123 best_cluster_ass[nonzero(best_cluster_ass[:, 0].A == 1)[0], 0] = len(cent_list)
124 best_cluster_ass[nonzero(best_cluster_ass[:, 0].A == 0)[0], 0] = best_cent_tosplit
125 cent_list[best_cent_tosplit] = best_new_cents[0, :].tolist()[0]
126 cent_list.append(best_new_cents[1, :].tolist()[0])
127 cluster_assment[nonzero(cluster_assment[:, 0].A == best_cent_tosplit)[0],:] = best_cluster_ass
128 return mat(cent_list), cluster_assment
129
130def plot_cluster(data_mat, cluster_assment, centroid):
131 """
132 @brief plot cluster and centroid
133 @param data_mat The data matrix
134 @param cluster_assment The cluste assment
135 @param centroid The centroid
136 @return
137 """
138 plt.figure(figsize=(15, 6), dpi=80)
139 plt.subplot(121)
140 plt.plot(data_mat[:, 0], data_mat[:, 1], 'o')
141 plt.title("source data", fontsize=15)
142 plt.subplot(122)
143 k = shape(centroid)[0]
144 colors = [plt.cm.Spectral(each) for each in linspace(0, 1, k)]
145 for i, col in zip(range(k), colors):
146 per_data_set = data_mat[nonzero(cluster_assment[:,0].A == i)[0]]
147 plt.plot(per_data_set[:, 0], per_data_set[:, 1], 'o', markerfacecolor=tuple(col),
148 markeredgecolor='k', markersize=10)
149 for i in range(k):
150 plt.plot(centroid[:,0], centroid[:,1], '+', color = 'k', markersize=18)
151 plt.title("bi_KMeans Cluster, k = 3", fontsize=15)
152 plt.show()
153
154if __name__ == '__main__':
155 data_mat = mat(load_data("testSet2_kmeans.txt"))
156 centroid, cluster_assment = bi_kMeans(data_mat, 3)
157 wcss = sum(cluster_assment[:,1])
158 print("wcss is ", wcss)
159 plot_cluster(data_mat, cluster_assment, centroid)
输出结果:
1[[-0.15772275000000002, 1.2253301166666664]]
2sse_split, sse_nonsplit 584.8476068364444 0.0
3the bestCentToSplit is: 0
4the len of bestClustAss is: 60
5sse_split, sse_nonsplit 40.51109841484639 441.31482269559115
6sse_split, sse_nonsplit 51.845405378816565 143.53278414085307
7the bestCentToSplit is: 1
8the len of bestClustAss is: 34
9wcss is 195.3781895196697
1[[-0.15772275000000002, 1.2253301166666664]]
2sse_split, sse_nonsplit 584.8476068364444 0.0
3the bestCentToSplit is: 0
4the len of bestClustAss is: 60
5sse_split, sse_nonsplit 40.51109841484639 441.31482269559115
6sse_split, sse_nonsplit 51.845405378816565 143.53278414085307
7the bestCentToSplit is: 1
8the len of bestClustAss is: 34
9wcss is 195.3781895196697
优点:
缺点:
Bi-Kmeans算法是对K-Means的一种改进,它在处理大规模数据集时效率较高,但仍然存在局部最优解和对异常值敏感等问题。