Hopkins statistic

From HandWiki

The Hopkins statistic (introduced by Brian Hopkins and John Gordon Skellam) is a way of measuring the cluster tendency of a data set.[1] It belongs to the family of sparse sampling tests. It acts as a statistical hypothesis test where the null hypothesis is that the data is generated by a Poisson point process and are thus uniformly randomly distributed.[2] If individuals are aggregated, then its value approaches 0, and if they are randomly distributed, the value tends to 0.5.[3]

Preliminaries

A typical formulation of the Hopkins statistic follows.[2]

Let X be the set of n data points.
Generate a random sample X of mn data points sampled without replacement from X.
Generate a set Y of m uniformly randomly distributed data points.
Define two distance measures,
ui, the minimum distance (given some suitable metric) of yiY to its nearest neighbour in X, and
wi, the minimum distance of xiXX to its nearest neighbour xjX,xixj.

Definition

With the above notation, if the data is d dimensional, then the Hopkins statistic is defined as:[4]

H=i=1muidi=1muid+i=1mwid

Under the null hypotheses, this statistic has a Beta(m,m) distribution.

Notes and references

  1. Hopkins, Brian; Skellam, John Gordon (1954). "A new method for determining the type of distribution of plant individuals". Annals of Botany (Annals Botany Co) 18 (2): 213–227. doi:10.1093/oxfordjournals.aob.a083391. 
  2. 2.0 2.1 Banerjee, A. (2004). "Validating clusters using the Hopkins statistic". 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542). 1. pp. 149–153. doi:10.1109/FUZZY.2004.1375706. ISBN 0-7803-8353-2. 
  3. Aggarwal, Charu C. (2015) (in en). Data Mining. Cham: Springer International Publishing. pp. 158. doi:10.1007/978-3-319-14142-8. ISBN 978-3-319-14141-1. http://link.springer.com/10.1007/978-3-319-14142-8. 
  4. Cross, G.R.; Jain, A.K. (1982). "Measurement of clustering tendency". Theory and Application of Digital Control: 315-320. doi:10.1016/B978-0-08-027618-2.50054-1.