Correlate summation analysis

From HandWiki

Correlate summation analysis is a data mining method. It is designed to find the variables that are most covariant with all of the other variables being studied, relative to clustering. Aggregate correlate summation is the product of the totaled negative logarithm of the p-values for all of the correlations to a given variable and its (normalized) standard deviation-to-mean quotient. Discrete correlate summation is the product of the totaled absolute value of the logarithm of the p-value ratios between two groups' correlations to a given variable and its absolute value of the logarithm of the group mean ratios.

Correlate summation template

This zipped Excel template performs a correlate summation analysis for up to 100 variables for 4 groups of 15 subjects:


The paper [1] describing the method is embedded in the spreadsheet.

Discrete correlate summation

Given two groups, a correlation matrix (m by m) was constructed for m variables for each group. Each column represents all of the correlations (r) between a given variable and each of the other variables. For variables with either heterogeneous or homogeneous numbers of data points (n), the n for each individual correlation was calculated by assigning each data point with a value of one and taking the sum of the products for each pair in that correlation.

The correlations were tested for linearity using Student's t-distribution to evaluate:

[math]\displaystyle{ t=\frac{|r|}{\sqrt{\frac{1-r^2}{n-2}}} }[/math]

for (n − 2) degrees of freedom, returning two tails.[2]

The correlation matrices were thus transformed into linear probability matrices. For the two groups, the absolute value of the logarithm of the ratio of each comparison’s p-value gives a log correlation ratio that is larger as the ratio approaches zero or infinity. Each column was totaled to form the discrete correlate summation array. As in the log correlation ratio (logcr), the log mean ratio (logmr) for the two groups’ means was acquired for each variable. The correlate summation was then multiplied by the log mean ratio, to yield the discrete mean-correlate summation (DCΣx).[1]

Aggregate correlate summation

As in the discrete correlate summation, a linear probability matrix was calculated for all of the data (no grouping). The negative logarithm was taken for all of the p-values; the columns were totaled to give the aggregate correlate summation (ACΣ) array. The standard deviation for each variable is divided by its mean to normalize the variances between variables. Data with a bimodal distribution will have a larger normalized standard deviation (nSD) than will data with a normal distribution. The nSD array multiplied by the ACΣ array yielded the aggregate mean-correlate summation (ACΣx).[1]

Non-linear modeling

A linear correlation between variables for a given sample set is typically the initial step in the investigation of relationships, which may lead to an underlying mechanism. The variation (either inherent or in response to a challenge) in a given population gives rise to correlations of variables of which only a portion of the sigmoidal (control) relationship may be evident. Generally in the face of data that defies linear regression, data patterns indicate power relationship of the general type:

[math]\displaystyle{ y=mx^a }[/math]

Type 1: a < 0 is a hyperbolic function

Type 2: a = 0 is a horizontal line

Type 3: 0 < a < 1 is a root function

Type 4: a = 1 is actually a linear function

Type 5: a > 1 is a power function

(In all five cases a log-log plot yields a linear curve.) [3]

On a positive sigmoidal/logistic curve, the initial, intermediate and late portions resemble power, linear and root functions, respectively. Also, the late portion of a negative control function is reminiscent of a hyperbolic curve.

In an analysis of variable correlation, the sigmoidal relationship of the entire (unsampled in some cases) data range should be considered. This type of analysis is accomplished by regression with either a logistic curve or simple linear regression with further investigation of the Type 1, 3 and 5 power relationships.[1]


  1. 1.0 1.1 1.2 1.3 Westwood, B; Chappell, M. (2006). "Application of correlate summation to data clustering in the estrogen- and salt-sensitive female mRen2.Lewis rat". Proceedings of the 1st international workshop on Text mining in bioinformatics - TMBIO '06. TMBIO '06 (ACM). pp. 21–26. doi:10.1145/1183535.1183542. ISBN 1-59593-526-6. 
  2. Swinscow, T. (1997) Statistics at Square One. BMJ Publishing Group.
  3. Mandel, J. (1984) The Statistical Analysis of Experimental Data. Dover Publications, Mineola, NY.