Hopkins statistic

From HandWiki

The Hopkins statistic (introduced by Brian Hopkins and John Gordon Skellam) is a way of measuring the cluster tendency of a data set.[1] It belongs to the family of sparse sampling tests. It acts as a statistical hypothesis test where the null hypothesis is that the data is generated by a Poisson point process and are thus uniformly randomly distributed.[2] If individuals are aggregated, then its value approaches 0, and if they are randomly distributed, the value tends to 0.5.[3]

Preliminaries

A typical formulation of the Hopkins statistic follows.[2]

Let [math]\displaystyle{ X }[/math] be the set of [math]\displaystyle{ n }[/math] data points.
Generate a random sample [math]\displaystyle{ \overset{\sim}{X} }[/math] of [math]\displaystyle{ m \ll n }[/math] data points sampled without replacement from [math]\displaystyle{ X }[/math].
Generate a set [math]\displaystyle{ Y }[/math] of [math]\displaystyle{ m }[/math] uniformly randomly distributed data points.
Define two distance measures,
[math]\displaystyle{ u_i, }[/math] the minimum distance (given some suitable metric) of [math]\displaystyle{ y_i \in Y }[/math] to its nearest neighbour in [math]\displaystyle{ X }[/math], and
[math]\displaystyle{ w_i, }[/math] the minimum distance of [math]\displaystyle{ \overset{\sim}{x}_i \in \overset{\sim}{X}\subseteq X }[/math] to its nearest neighbour [math]\displaystyle{ x_j \in X,\, \overset{\sim}{x_i}\ne x_j. }[/math]

Definition

With the above notation, if the data is [math]\displaystyle{ d }[/math] dimensional, then the Hopkins statistic is defined as:[4]

[math]\displaystyle{ H=\frac{\sum_{i=1}^m{u_i^d}}{\sum_{i=1}^m{u_i^d}+\sum_{i=1}^m{w_i^d}} \, }[/math]

Under the null hypotheses, this statistic has a Beta(m,m) distribution.

Notes and references

  1. Hopkins, Brian; Skellam, John Gordon (1954). "A new method for determining the type of distribution of plant individuals". Annals of Botany (Annals Botany Co) 18 (2): 213–227. doi:10.1093/oxfordjournals.aob.a083391. 
  2. 2.0 2.1 Banerjee, A. (2004). "Validating clusters using the Hopkins statistic". 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542). 1. pp. 149–153. doi:10.1109/FUZZY.2004.1375706. ISBN 0-7803-8353-2. 
  3. Aggarwal, Charu C. (2015) (in en). Data Mining. Cham: Springer International Publishing. pp. 158. doi:10.1007/978-3-319-14142-8. ISBN 978-3-319-14141-1. http://link.springer.com/10.1007/978-3-319-14142-8. 
  4. Cross, G.R.; Jain, A.K. (1982). "Measurement of clustering tendency". Theory and Application of Digital Control: 315-320. doi:10.1016/B978-0-08-027618-2.50054-1. 

External links