Evaluation of Clustering - Assessing Clustering Tendency - 《机器学习》

Clustering tendency assessment determines whether a given data set has a non-random structure, which may lead to meaningful clusters.

Assess if non-random structure exists in the data by measuring the probability that the data is generated by a uniform data distribution
Test spatial randomness by statistical test: Hopkins Statistic 通过统计检验检验空间随机性:霍普金斯统计
- Given a dataset regarded as a sample of a random variable , determine how far away is from being uniformly distributed in the data space
- Sample points, , uniformly from the range of . For each , find its nearest neighbour in where in
  - For example, if consists of real valued observations whose minimum value is 0.5 and maximum value is 6.2, then is a random value sampled uniformly between 0.5 and 6.2.
- Sample points, , uniformly from (). For each , find its nearest neighbour in where in and
  - Unlike , is one of the existing values in (i.e. ).
- Calculate the Hopkins Statistic:

  - If [![](https://cdn.nlark.com/yuque/0/2021/png/21710893/1623665712782-c3075a9f-86e4-414b-8e71-f08872ac2aed.png#align=left&display=inline&height=14&margin=%5Bobject%20Object%5D&originHeight=14&originWidth=15&size=0&status=done&style=none&width=15)](https://wattlecourses.anu.edu.au/filter/tex/displaytex.php?texexp=D) is uniformly distributed, [![](https://cdn.nlark.com/yuque/0/2021/png/21710893/1623665712757-d659de5d-e498-4b7a-8006-a9f76853f3b9.png#align=left&display=inline&height=20&margin=%5Bobject%20Object%5D&originHeight=20&originWidth=39&size=0&status=done&style=none&width=39)](https://wattlecourses.anu.edu.au/filter/tex/displaytex.php?texexp=%5Csum%20x_i) and [![](https://cdn.nlark.com/yuque/0/2021/png/21710893/1623665712754-8623b6b3-f1d9-4ecf-ab36-2b5206819ec2.png#align=left&display=inline&height=20&margin=%5Bobject%20Object%5D&originHeight=20&originWidth=38&size=0&status=done&style=none&width=38)](https://wattlecourses.anu.edu.au/filter/tex/displaytex.php?texexp=%5Csum%20y_i) will be close to each other and [![](https://cdn.nlark.com/yuque/0/2021/png/21710893/1623665712729-5ad6520d-de43-4f67-af80-38c027d57ce9.png#align=left&display=inline&height=14&margin=%5Bobject%20Object%5D&originHeight=14&originWidth=16&size=0&status=done&style=none&width=16)](https://wattlecourses.anu.edu.au/filter/tex/displaytex.php?texexp=H) is close to 0.5.

If is highly skewed, is c lose to
- If is uniformly distributed 均匀分布 then it contains no meaningful clusters.