Contents

• Clustering overview
• K-Means
• Evaluating clustering

Unsupervised Learning 非监督学习

What if we don’t have labeled data? 如果我们没有带标签的数据怎么办
Then we are doing unsupervised learning 使用非监督学习

Can we still assign reasonable labels?
– We could try to cluster documents (cluster = class) 我们可以尝试聚类分析文档
– Clustering is a type of unsupervised learning 聚类分析是一种非监督学习

Other types of unsupervised learning (not covered in this course): 其它类型的非监督学习
- Autoencoders, Rule Mining, PCA 自动编码器,规则挖掘,主成分分析

Clustering: Definition 聚类分析:定义

(Document) clustering is the process of grouping a set of documents into clusters of similar documents. (文档)聚类是将一组文档分组为相似文档簇的过程。 集群内的文档应该相似。
• Documents within a cluster should be similar. 集群内的文档应该相似。
• Documents from different clusters should be dissimilar. 来自不同集群的文档应该不同。
• Clustering is a very common form of unsupervised learning.聚类是一种非常常见的无监督学习形式。

image.png

Classification vs. Clustering 分类 vs 聚类

• Classification: supervised learning 分类:监督学习
• Clustering: unsupervised learning 聚类:非监督学习
• Classification: Classes are human-defined and part of the input to the learning algorithm. 分类:类别是认为定义的,是学习算法的一部分
• Clustering: Clusters are inferred from the data without human input. 聚类:聚类是由数据推导的,没有人为的输入。
– There are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents, algorithm hyperparameters 影响聚类结果的方式有很多:聚类数量、相似性度量、文档表示、算法超参数

Desiderata for clustering 聚类的需要

• General goal: put related docs in the same cluster, put unrelated docs in different clusters. 将相关的文档放在同一个聚类下,不同的放在不同的聚类下
– How do we formalize this?
• The number of clusters should be appropriate for the data set we are clustering. 聚类的数量应该适合于我们正在聚类分析的数据集
– Initially, we will assume the number of clusters K is given. 初始化,假设聚类的数量K已经给定
– Ther are (semi)automatic methods for determining K 也有自动设定聚类数量k的方法
• Secondary goals in clustering (not always necessary) 聚类中的次级目标(非必要)
– Avoid very small and very large clusters 避免非常小或者非常大的聚类
– Defined clusters that are easy to explain to the user 定义容易向用户解释的聚类

Flat vs. Hierarchical clustering 平面集群与分层集群

• Flat algorithms 平面算法
– Usually start with a random (partial) partitioning of docs into groups 通常从随机部分将文档分组开始
– Refine iteratively 迭代优化
– Common algorithm: K-means 通常算法:k均值
• Hierarchical algorithms 分层级算法
– Create a hierarchy 创建层次结构
• Cluster the clusters 聚类分析
– Bottom-up, agglomerative 自下而上,聚集
– Top-down, divisive 自上向下,分裂

Hard vs. Soft clustering

• Hard clustering: Each document in exactly one cluster. 每个文档应该在一个确定的聚类中
– More common and generally easier to do 更通用和简单的做法
– Example algorithm: K-means 通常算法:k均值
• Soft clustering: A document is assigned a membership probability for each cluster. 为文档分配每个集群的成员概率
– A higher score means the document is: more likely to belong to that cluster or exhibit characteristics of that cluster 更高分分数意味着该文档可能属于这个聚类,或者表现出该聚类的特征
– Documents belong to multiple clusters 文档属于多个聚类
– Example algorithm: Gaussian Mixture Models (via EM) 算法:Gaussian Mixture Models (via EM)
We will only consider flat, hard clustering in this course 只学习hard clustering

Flat clustering algorithms 平面聚类算法

Flat algorithms compute a partition of N documents into a set of K clusters. 平面算法将N个文档划分成一组K个簇。
• Given: a set of documents and the number K 给定:文档库+聚类数量K
• Find: a partition into K clusters that optimizes the chosen partitioning criterion 将分区划分为K个集群,优化所选的分区标准
• Global optimization: exhaustively enumerate partitions, pick optimal one 全局优化:详尽列举分区,选择最佳分区
• Effective heuristic method: K-means algorithm 有效的启发式方法:K-均值算法

Document representation for clustering 聚类的文档表示

• Represent documents as vectors 将文档用向量表示
– Sparse count vectors, LSA, aggregated word2vec, Pre-trained language model (e.g. BERT) 稀疏计数向量、LSA、聚合单词2、预先训练的语言模型(例如BERT)
• We can measure similarity between vectors by Euclidean distance which is almost equivalent to cosine distance. 我们可以通过欧几里得距离来衡量向量之间的相似性,欧几里得距离几乎等同于余弦距离。
– Except: cosine distance is length-normalized
• Some clustering algorithms (e.g. Affinity propagation) only require similarities between documents (e.g. edit distance) 一些聚类算法(例如相似性传播)只要求文档之间的相似性(例如编辑距离)

Centroids in K-means k均值算法的质心

• Each cluster in K-means is defined by a centroid. k均值法中的每个聚类都有质心定义
• Definition of centroid: image.png
image.png
where we use ω to denote a cluster. 这里我们用ω来表示一个簇。
• Centroids act as centers of gravity, abstractions of the class质心作为重心,是类的抽象

K-means

• Objective/partitioning criterion: minimize the average squared difference from the centroid 目标/划分标准:最小化与质心的平均平方差
• We try to find the minimum average squared difference by iterating two steps: 我们尝试通过迭代两个步骤来找到最小平均平方差:
–Reassignment: assign each vector to its closest centroid 重新分配:将每个向量分配到其最近的质心
– Re-computation: recompute each centroid as the average of the vectors that were assigned to it 重新计算:将每个质心重新计算为分配给它的向量的平均值
image.png
image.png
image.png
image.png
image.png
image.png
image.png
image.png

K-means algorithm k均值算法

image.png

K-means is guaranteed to converge k均值方法可以保证收敛
RSS(residual sum of squares) = sum of all squared distances between document vector and closest centroid 文档向量和最近质心之间所有平方距离的总和
• RSS decreases during each reassignment step. – because each point is assigned to a closer centroid 在每个重新分配步骤中,RSS会降低。–因为每个点都被分配给一个更接近的质心
• RSS decreases during each re-computation step. – see next slide
• There are only a finite number of clusterings 集群数量有限
• Thus: We must reach a fixed point.因此:我们必须达到一个固定点。

Re-computaton decreases average distance

image.png
Re-computaton decreases average distance

K-means is guaranteed to converge k均值保证收敛

• But we don’t know how long convergence will take! 但是我们不知道需要多长时间才可以收敛
• If we don’t care about a few docs switching back and forth, then convergence is usually fast (< 10- 20 iterations). 如果我们不在乎一些文档的来回切换,那么收敛通常会很快(< 10- 20次迭代)。
• However, complete convergence can take many more iterations. 但是如果是完全收敛则会需要更多的迭代

Optimality of K-means k均值的最优性

• Convergence does not mean that we converge to the optimal clustering! 收敛并不意味着我们收敛到了最优聚类
• This is the great weakness of K-means. 这是k均值方法最大的劣势
• If we start with a bad set of seeds, the resulting clustering can be terrible. 如果我们以一个坏的初始开始,那聚类可能效果很差

Initialization of K-means k均值方法的初始化方法
• Random seed selection is just one of many ways K-means can be initialized. 随机选择只是众多k均值初始化的方法之一
• Random seed selection is not very robust: It’s easy to get a suboptimal clustering. 随机选择法的鲁棒性不好,很容易得到次优聚类。
• Better ways of computing initial centroids: 更好的初始化质心点的方法
– Select seeds using a heuristic (e.g., ignore outliers or find a set of seeds that has “good coverage” of the document space) 使用启发式方法选择种子(例如,忽略异常值或找到一组对文档空间具有“良好覆盖”的种子)
– Use hierarchical clustering to find good seeds 使用层次聚类来寻找好种子
– Select i (e.g., i = 10) different random sets of seeds, do a K-means clustering for each, select the clustering with lowest RSS 选择I(例如,i = 10)个不同的随机种子集,对每个种子集进行K均值聚类,选择RSS最低的聚类

Evaluation 评估

• Internal criteria 内部标准
– Example of an internal criterion: RSS in K-means 内部标准示例:千位平均值中的RSS
• But an internal criterion often does not evaluate the actual utility of a clustering to an application. 但是内部标准通常不会评估集群对应用程序的实际效用。
• Alternative: External criteria 备选方案:外部标准
– Evaluate with respect to a human-defined classification根据人类定义的分类进行评估

External criterion: Purity 外部标准:纯净度

如果我们有真实值,我们可以用纯净度来衡量

image.png
正确分类的点除以总点数

Example for computing purity

image.png

How many clusters?

• Number of clusters K is given in many applications. 在许多应用中给出了簇数K。
– E.g., there may be an external constraint on K. Example: In the case of web search results, it might be optimal for the users to present more than 10 -15 topics. 例如,k可能有外部限制。示例:在网络搜索结果的情况下,用户最好呈现10 -15个以上的主题。
• What if there is no external constraint? Is there a “right” number of clusters?
• One way: define an optimization criterion that penalizes a larger number of clusters 一种方法:定义一个对大量集群不利的优化标准
– Given docs, find K for which the optimum is reached. 给定文档,找到达到最佳的K。
– We can’t use RSS or average squared distance from centroid as the criterion because it would always choose K = N clusters 我们不能使用RSS或距质心的平均平方距离作为标准,因为它总是选择K = N个聚类