What is Good Clustering?

    • A good clustering method will produce high quality clusters
      • high intra-class similarity: cohesive within clusters
      • low inter-class similarity: distinctive between clusters
    • The quality of a clustering method depends on
      • the similarity measure used by the method
      • its implementation, and
      • Its ability to discover some or all of the hidden patterns

    Measure the Quality of Clustering

    • Dissimilarity/Similarity metric 衡量相似性和独特性
      • Similarity is expressed in terms of a distance function, typically metric: d(i, j) 用距离方程来衡量相似性
      • The definitions of distance functions are usually rather different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables
      • 距离函数的定义对于区间尺度变量、布尔变量、范畴变量、序数比变量和向量变量通常是相当不同的
      • Weights should be associated with different variables based on applications and data semantics 权重应该根据应用程序和数据语义与不同的变量关联
    • Quality of clustering:
      • There is usually a separate “quality” function that measures the “goodness” of a cluster.
      • It is hard to define “similar enough” or “good enough”
        • The answer is typically highly subjective