使用频谱共聚算法对文档进行聚合

翻译者:@N!no 校验者:待校验

这个例子演示了20个新闻组数据集上的光谱协聚类算法。‘comp.os.ms-windows.misc’ 类别被排除在外,因为它包含许多只包含数据的帖子。

TF-IDF 矢量帖构成一个词频矩阵,然后使用 Dhillon 光谱协聚算法对其进行重组。由此产生的文档词双聚类表明在这些子集文档中被使用频率更高的子集词。

对于一些最好的双聚类来说,它最常见的文档类别和十个最重要的单词会被打印出来。最佳双类别由其归一化的切割决定。最好的单词是通过比较它们在两区内和两区外的总和来确定的。

为了进行比较,我们还使用 MiniBatchKMeans 对文档进行集群。从双聚类衍生出的文档聚类比使用 MiniBatchKMeans 得到的聚类具有更好的 V-measure。

  1. Vectorizing...
  2. Coclustering...
  3. Done in 2.75s. V-measure: 0.4387
  4. MiniBatchKMeans...
  5. Done in 5.69s. V-measure: 0.3344
  6. Best biclusters:
  7. ----------------
  8. bicluster 0 : 1829 documents, 2524 words
  9. categories : 22% comp.sys.ibm.pc.hardware, 19% comp.sys.mac.hardware, 18% comp.graphics
  10. words : card, pc, ram, drive, bus, mac, motherboard, port, windows, floppy
  11. bicluster 1 : 2391 documents, 3275 words
  12. categories : 18% rec.motorcycles, 17% rec.autos, 15% sci.electronics
  13. words : bike, engine, car, dod, bmw, honda, oil, motorcycle, behanna, ysu
  14. bicluster 2 : 1887 documents, 4232 words
  15. categories : 23% talk.politics.guns, 19% talk.politics.misc, 13% sci.med
  16. words : gun, guns, firearms, geb, drugs, banks, dyer, amendment, clinton, cdt
  17. bicluster 3 : 1146 documents, 3263 words
  18. categories : 29% talk.politics.mideast, 26% soc.religion.christian, 25% alt.atheism
  19. words : god, jesus, christians, atheists, kent, sin, morality, belief, resurrection, marriage
  20. bicluster 4 : 1732 documents, 3967 words
  21. categories : 26% sci.crypt, 23% sci.space, 17% sci.med
  22. words : clipper, encryption, key, escrow, nsa, crypto, keys, intercon, secure, wiretap
  1. from collections import defaultdict
  2. import operator
  3. from time import time
  4. import numpy as np
  5. from sklearn.cluster import SpectralCoclustering
  6. from sklearn.cluster import MiniBatchKMeans
  7. from sklearn.datasets import fetch_20newsgroups
  8. from sklearn.feature_extraction.text import TfidfVectorizer
  9. from sklearn.metrics.cluster import v_measure_score
  10. print(__doc__)
  11. def number_normalizer(tokens):
  12. """ 将所有数字标记映射到占位符。
  13. 对于许多应用程序来说,以数字开头的令牌并没有直接的用处,但是这样的令牌存在的事实可能是相关的。通过应用这种降维形式,一些方法可能会表现得更好。
  14. """
  15. return ("#NUMBER" if token[0].isdigit() else token for token in tokens)
  16. class NumberNormalizingVectorizer(TfidfVectorizer):
  17. def build_tokenizer(self):
  18. tokenize = super().build_tokenizer()
  19. return lambda doc: list(number_normalizer(tokenize(doc)))
  20. # 不包含 'comp.os.ms-windows.misc' 类别
  21. categories = ['alt.atheism', 'comp.graphics',
  22. 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
  23. 'comp.windows.x', 'misc.forsale', 'rec.autos',
  24. 'rec.motorcycles', 'rec.sport.baseball',
  25. 'rec.sport.hockey', 'sci.crypt', 'sci.electronics',
  26. 'sci.med', 'sci.space', 'soc.religion.christian',
  27. 'talk.politics.guns', 'talk.politics.mideast',
  28. 'talk.politics.misc', 'talk.religion.misc']
  29. newsgroups = fetch_20newsgroups(categories=categories)
  30. y_true = newsgroups.target
  31. vectorizer = NumberNormalizingVectorizer(stop_words='english', min_df=5)
  32. cocluster = SpectralCoclustering(n_clusters=len(categories),
  33. svd_method='arpack', random_state=0)
  34. kmeans = MiniBatchKMeans(n_clusters=len(categories), batch_size=20000,
  35. random_state=0)
  36. print("Vectorizing...")
  37. X = vectorizer.fit_transform(newsgroups.data)
  38. print("Coclustering...")
  39. start_time = time()
  40. cocluster.fit(X)
  41. y_cocluster = cocluster.row_labels_
  42. print("Done in {:.2f}s. V-measure: {:.4f}".format(
  43. time() - start_time,
  44. v_measure_score(y_cocluster, y_true)))
  45. print("MiniBatchKMeans...")
  46. start_time = time()
  47. y_kmeans = kmeans.fit_predict(X)
  48. print("Done in {:.2f}s. V-measure: {:.4f}".format(
  49. time() - start_time,
  50. v_measure_score(y_kmeans, y_true)))
  51. feature_names = vectorizer.get_feature_names()
  52. document_names = list(newsgroups.target_names[i] for i in newsgroups.target)
  53. def bicluster_ncut(i):
  54. rows, cols = cocluster.get_indices(i)
  55. if not (np.any(rows) and np.any(cols)):
  56. import sys
  57. return sys.float_info.max
  58. row_complement = np.nonzero(np.logical_not(cocluster.rows_[i]))[0]
  59. col_complement = np.nonzero(np.logical_not(cocluster.columns_[i]))[0]
  60. # 注意:接下来的操作等同于 X[rows[:, np.newaxis], cols].sum()
  61. # 但是会针对于 scipy <= 0.16 的版本更快一些
  62. weight = X[rows][:, cols].sum()
  63. cut = (X[row_complement][:, cols].sum() +
  64. X[rows][:, col_complement].sum())
  65. return cut / weight
  66. def most_common(d):
  67. """默认字典有最大值的项。
  68. 在 Python >= 2.7 中类似于 Counter.most_common 。
  69. """
  70. return sorted(d.items(), key=operator.itemgetter(1), reverse=True)
  71. bicluster_ncuts = list(bicluster_ncut(i)
  72. for i in range(len(newsgroups.target_names)))
  73. best_idx = np.argsort(bicluster_ncuts)[:5]
  74. print()
  75. print("Best biclusters:")
  76. print("----------------")
  77. for idx, cluster in enumerate(best_idx):
  78. n_rows, n_cols = cocluster.get_shape(cluster)
  79. cluster_docs, cluster_words = cocluster.get_indices(cluster)
  80. if not len(cluster_docs) or not len(cluster_words):
  81. continue
  82. # 种类
  83. counter = defaultdict(int)
  84. for i in cluster_docs:
  85. counter[document_names[i]] += 1
  86. cat_string = ", ".join("{:.0f}% {}".format(float(c) / n_rows * 100, name)
  87. for name, c in most_common(counter)[:3])
  88. # 单词
  89. out_of_cluster_docs = cocluster.row_labels_ != cluster
  90. out_of_cluster_docs = np.where(out_of_cluster_docs)[0]
  91. word_col = X[:, cluster_words]
  92. word_scores = np.array(word_col[cluster_docs, :].sum(axis=0) -
  93. word_col[out_of_cluster_docs, :].sum(axis=0))
  94. word_scores = word_scores.ravel()
  95. important_words = list(feature_names[cluster_words[i]]
  96. for i in word_scores.argsort()[:-11:-1])
  97. print("bicluster {} : {} documents, {} words".format(
  98. idx, n_rows, n_cols))
  99. print("categories : {}".format(cat_string))
  100. print("words : {}\n".format(', '.join(important_words)))