What is Good Clustering?
- A good clustering method will produce high quality clusters
- high intra-class similarity: cohesive within clusters
- low inter-class similarity: distinctive between clusters
- The quality of a clustering method depends on
- the similarity measure used by the method
- its implementation, and
- Its ability to discover some or all of the hidden patterns
Measure the Quality of Clustering
- Dissimilarity/Similarity metric 衡量相似性和独特性
- Similarity is expressed in terms of a distance function, typically metric: d(i, j) 用距离方程来衡量相似性
- The definitions of distance functions are usually rather different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables
- 距离函数的定义对于区间尺度变量、布尔变量、范畴变量、序数比变量和向量变量通常是相当不同的
- Weights should be associated with different variables based on applications and data semantics 权重应该根据应用程序和数据语义与不同的变量关联
- Quality of clustering:
- There is usually a separate “quality” function that measures the “goodness” of a cluster.
- It is hard to define “similar enough” or “good enough”
- The answer is typically highly subjective