1.算法思想


k-平均聚类的目的是:把n个点划分到k个聚类中,使得每个点都属于离他最近的均值(即聚类中心)对应的聚类。

2.聚类算法和分类算法

分类:根据文本的特征或属性,划分到已有的类别中。也就是说,这些类别是已知的,通过对已知分类的数据进行训练和学习,找到这些不同类的特征,再对未分类的数据进行分类。
聚类:压根不知道数据会分为几类,通过聚类分析将数据或者说用户聚合成几个群体,那就是聚类了。聚类不需要对数据进行训练和学习。就像小时候分校区,你家这块离哪个小学近,就去那个小学上学。

3.监督学习和非监督学习

image.png
在分类算法中,标签就是类别;而在回归算法中,标签就是具体的y值。

4.算法步骤

image.png

5.sklearn代码

  1. # coding:utf-8
  2. """
  3. ====================================
  4. Demonstration of k-means assumptions
  5. ====================================
  6. This example is meant to illustrate situations where k-means will produce
  7. unintuitive and possibly unexpected clusters. In the first three plots, the
  8. input data does not conform to some implicit assumption that k-means makes and
  9. undesirable clusters are produced as a result. In the last plot, k-means
  10. returns intuitive clusters despite unevenly sized blobs.
  11. """
  12. print(__doc__)
  13. # Author: Phil Roth <mr.phil.roth@gmail.com>
  14. # License: BSD 3 clause
  15. import numpy as np
  16. import matplotlib.pyplot as plt
  17. from sklearn.cluster import KMeans
  18. from sklearn.datasets import make_blobs
  19. plt.figure(figsize=(12, 12))
  20. n_samples = 1500
  21. random_state = 170
  22. X, y = make_blobs(n_samples=n_samples, random_state=random_state) # 为聚类产生数据集
  23. # Incorrect number of clusters
  24. y_pred = KMeans(n_clusters=2, random_state=random_state).fit_predict(X)
  25. plt.subplot(221)
  26. plt.scatter(X[:, 0], X[:, 1], c=y_pred)
  27. plt.title("Incorrect Number of Blobs")
  28. # Anisotropicly distributed data
  29. transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]
  30. X_aniso = np.dot(X, transformation)
  31. y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_aniso)
  32. plt.subplot(222)
  33. plt.scatter(X_aniso[:, 0], X_aniso[:, 1], c=y_pred)
  34. plt.title("Anisotropicly Distributed Blobs")
  35. # Different variance
  36. X_varied, y_varied = make_blobs(n_samples=n_samples,
  37. cluster_std=[1.0, 2.5, 0.5],
  38. random_state=random_state)
  39. y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_varied)
  40. plt.subplot(223)
  41. plt.scatter(X_varied[:, 0], X_varied[:, 1], c=y_pred)
  42. plt.title("Unequal Variance")
  43. # Unevenly sized blobs
  44. X_filtered = np.vstack((X[y == 0][:500], X[y == 1][:100], X[y == 2][:10]))
  45. y_pred = KMeans(n_clusters=3,
  46. random_state=random_state).fit_predict(X_filtered)
  47. plt.subplot(224)
  48. plt.scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_pred)
  49. plt.title("Unevenly Sized Blobs")
  50. plt.show()

最后跑出来的图
image.png
由于聚类算法目前在业务中需要了解的并不多,目前对这些代码暂时不深究,后续有需求再来完善本部分。