“How can we modify the k-means algorithm to diminish sensitivity to outliers?”。我们如何修改k-means算法来降低对异常值的敏感度?”
K-medoids
Instead of taking the mean value of the object in a cluster as a reference point, we can pick actual objects to represent the clusters. 我们可以选取实际的对象来表示聚类,而不是将聚类中对象的平均值作为参考点。
The k-medoids method is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean. 在存在噪声和异常值的情况下,k-medoids方法比k-均值更稳健,因为与均值相比,medoid受异常值或其他极值的影响更小。
Partitioning Around Medoids (PAM) 分割环绕物件法
PAM algorithm is a popular realisation of k-medoids clustering.
Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering.
The quality of clustering can be measured by an absolute-error criterion (total cost):,
where is the sum of the absolute error for all objects
in the dataset, and
is the representative object of
.
PAM works effectively for small data sets, but does not scale well for large data sets (due to the computational complexity)。PAM对小数据集有效,但对大数据集不能很好地扩展(由于计算复杂性)
Algorithm
- Arbitrarily choose k objects in D as the initial representative objects。 任意选择D中的k个对象作为初始代表对象
- Assign each remaining object to the cluster with the nearest representative object。 将剩余的每个对象分配给具有最接近的代表对象的集群
- Randomly select a non-representative object,
随机选择一个不具有代表性的对象,·
- For each representative object
,
- Repeat 2-4 until there is no change.