Cophenetic correlation
In statistics, and especially in biostatistics, cophenetic correlation[1] (more precisely, the cophenetic correlation coefficient) is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points. Although it has been most widely applied in the field of biostatistics (typically to assess cluster-based models of DNA sequences, or other taxonomic models), it can also be used in other fields of inquiry where raw data tend to occur in clumps, or clusters.[2] This coefficient has also been proposed for use as a test for nested clusters.[3]
Calculating the cophenetic correlation coefficient
Suppose that the original data {Xi} have been modeled using a cluster method to produce a dendrogram {Ti}; that is, a simplified model in which data that are "close" have been grouped into a hierarchical tree. Define the following distance measures.
- [math]\displaystyle{ x(i,j) = |X_i-X_j| }[/math], the Euclidean distance between the ith and jth observations.
- [math]\displaystyle{ t(i,j) }[/math], the dendrogrammatic distance between the model points [math]\displaystyle{ T_i }[/math] and [math]\displaystyle{ T_j }[/math]. This distance is the height of the node at which these two points are first joined together.
Then, letting [math]\displaystyle{ \bar{x} }[/math] be the average of the x(i, j), and letting [math]\displaystyle{ \bar{t} }[/math] be the average of the t(i, j), the cophenetic correlation coefficient c is given by[4]
- [math]\displaystyle{ c = \frac {\sum_{i\lt j} [x(i,j) - \bar{x}][t(i,j) - \bar{t}]}{\sqrt{\sum_{i\lt j}[x(i,j)-\bar{x}]^2 \sum_{i\lt j}[t(i,j)-\bar{t}]^2}}. }[/math]
Software implementation
It is possible to calculate the cophenetic correlation in R using the dendextend R package.[5]
In Python, the SciPy package also has an implementation.[6]
In MATLAB, the Statistic and Machine Learning toolbox contains an implementation. [7]
See also
References
- ↑ Sokal, R. R. and F. J. Rohlf. 1962. The comparison of dendrograms by objective methods. Taxon, 11:33-40
- ↑ Dorthe B. Carr, Chris J. Young, Richard C. Aster, and Xioabing Zhang, Cluster Analysis for CTBT Seismic Event Monitoring (a study prepared for the U.S. Department of Energy)
- ↑ Rohlf, F. J. and David L. Fisher. 1968. Test for hierarchical structure in random data sets. Systematic Zool., 17:407-412 (link)
- ↑ Mathworks statistics toolbox
- ↑ "Introduction to dendextend". https://cran.r-project.org/web/packages/dendextend/vignettes/dendextend.html.
- ↑ "scipy.cluster.hierarchy.cophenet — SciPy v0.14.0 Reference Guide". https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.cophenet.html.
- ↑ "Cophenetic correlation coefficient - MATLAB cophenet". https://www.mathworks.com/help/stats/cophenet.html.
External links
Original source: https://en.wikipedia.org/wiki/Cophenetic correlation.
Read more |