Gaussian copula mixture model

From HandWiki

A Gaussian copula mixture model (GCMM) consists of a weighted sum of a finite number of joint distributions, each of which contains a Gaussian copula. It is a generalization of the usual a Gaussian mixture model (GMM). When the marginal distributions are restricted to be Gaussian, the model reduces to a GMM. To begin, the multivariate Gaussian copula is defined by the following probability function:

[math]\displaystyle{ F(u|P)=\int_{-\infty}^{\Psi^{-1}(u_{1})}...\int_{-\infty}^{\Psi^{-1}(u_{d})}\frac{1}{(2\pi)^{n/2}P^{1/2}}exp(-\frac{1}{2}v^TP^{-1}v)dv }[/math] whose density is given by

[math]\displaystyle{ f(u|P)=\frac{1}{(2\pi)^{n/2}P^{1/2}}exp(-\frac{1}{2}u^TP^{-1}u)\prod_{d=1}^D\frac{1}{\frac{1}{\sqrt{2\pi}}exp(-\frac{1}{2}(\Psi^{-1}(u_d))^2)} }[/math]

where

[math]\displaystyle{ \begin{array}{ll} \Psi & \text{is the one dimensional cumulative distribution function for a standard normal distribution with density} \psi;\\ P & \text{is the copula parameter matrix;}\\ d & \text{is the number of dimension.}\\ \end{array} }[/math]

Then, with the Gaussianlization of original data on each dimension, a GCMM for the joint distribution of a random vector [math]\displaystyle{ X }[/math] can be defined as follows: [math]\displaystyle{ F(X|\pi)=\sum_{k=1}^{K}\pi_k\int_{-\infty}^{Y_{k1}}...\int_{-\infty}^{Y_{kd}}\frac{1}{(2\pi)^{n/2}P_k^{1/2}}exp(-\frac{1}{2}Y^TP_kY)dY }[/math]

where

[math]\displaystyle{ \begin{array}{ll} x=[x_1\ldots x_d] & \text{is the marginal observation.}\\ Y_{k}=[Y_{1d} \ldots Y_{kd}] & \text{is the vector of the transferred data.}\\ Y_{kd}=\Psi^{-1}(F_{kd}(x_{d})) & \text{is the d-th dimension of the transferred data.}\\ Z_{kd}=f_{kd}(x_{d})=\frac{\partial F_{kd}}{\partial x}(x_{d}) & \text{is the density of the marginal distribution.}\\ \pi_k & \text{is the weight to the k-th copula.}\\ \end{array} }[/math]

Its density is given by

[math]\displaystyle{ f(X|\pi)=\sum_{k=1}^{K} \pi_k\frac{1}{(2\pi)^{n/2}P_k^{1/2}}exp(-\frac{1}{2}Y_{k}^TP_kY_{k})\prod_{d=1}^D\frac{Z_{kd}}{\frac{1}{\sqrt{2\pi}}exp(-\frac{1}{2}(Y_{kd})^2)} }[/math]

The density above is defined conditioned on the cumulative probability values and Gaussianized random variables which are both determined by the marginal distributions. The marginal distribution on each dimension for each component can be estimated via nonparametric methods such as kernel smoothing (Bowman 1998 [1]).

Basic Properties of GCMM

A GCMM is defined based on the separation of the mixture of copulas and marginal distributions, which may potentially lead to different behavior from GMM. To understand the properties of GCMM, its likelihood function is studied so that appropriate estimation algorithms can be designed. The major properties of GCMM are discussed below:

1 A GCMM has a bounded likelihood function value on bounded domains and tractable derivatives conditioned on the estimated marginal probability functions. The likelihood function is given below:

[math]\displaystyle{ L=\sum_{n=1}^{N} ln (\sum_{k=1}^{K} \pi_k \frac{1}{(2\pi)^{n/2}P^{1/2}} exp(-\frac{1}{2}(Y_{n,k})^TPY_{n,k})\prod_{i=1}^D \frac{Z_{n,ki}}{\frac{1}{\sqrt{2\pi}}exp(-\frac{1}{2}(Y_{n,ki})^2) } ) }[/math]

We provide the following theorem to demonstrate the features of such a likelihood function and the proof is given in the appendix.

Under suitable conditions, the likelihood function is bounded above in bounded region; non-decreasing and negative semi-definite w.r.t density}[math]\displaystyle{ Z_{n,k_i} }[/math] ; may contain both local minimum and local maximum w.r.t transformed variables}[math]\displaystyle{ Y_{n,k}. }[/math]

2 The value of its likelihood function is nondecreasing during iterations of Expectation-Maximum algorithms that are applied with GCMM and the algorithms converge globally to local maximums under mild conditions(Wu 1983 [11]). The design and properties of these Expectation-Maximum algorithms are discussed in the next section.

3 Model selection can be conducted through Akaike information criteria (Fan 2009 [5]) and cluster methods such as k-means or hierarchy clustering can be used to set the initial parameters of each component.

Linkage with GMM

Gaussian mixture models have been used widely in various applications and the Expectation Maximum algorithm has been utilized for estimating their parameters. The convergence properties of such Expectation Maximum algorithms have been discussed in Lei (1996 [12]). However, each component of a GMM is a multivariate gaussian distribution that cannot effectively capture heavy tails and the number of components become sensitive w.r.t heavy tails. The introduction of more flexible components may help to further reduce number of components when working with heavy-tailed data.

On the other hand, copulas have been used in research for model dependence. The definition of a copula in the two dimensional case is given as below:

Let [math]\displaystyle{ P }[/math] be a conditional bivariate distribution function with continuous margins [math]\displaystyle{ F_X }[/math] and [math]\displaystyle{ F_Y }[/math], and let [math]\displaystyle{ \mathcal{F} }[/math] be some conditioning set. There then exists a unique conditional copula [math]\displaystyle{ C:[0,1]\times[0,1] }[/math] such that (Sklar (1959) [9]): [math]\displaystyle{ P(x,y|F)=C(F_X(x|\mathcal{F}),F_Y(y|\mathcal{F})|\mathcal{F}), \forall x, y \in R }[/math] The definitions above can easily be generalized to higher dimensions. The advantages of the copula method include the following:

1 Heavy-tailed joint distributions can be modeled; 2 Marginal distributions and their dependence structure can be studied separately; 3 Copulas can be calibrated to data sets that are sparse and unevenly distributed.

Upper tail dependence can be studied using copulas (Nelson 2006 [7]) and copulas can be estimated using a two-step maximum likelihood method the properties of which are discussed in White (1994 [10]). In the two dimensional case, Archimedean copulas such as BB1 are more flexible than Gaussian in capturing heavy tails while the estimation of higher dimensional Archimedean copulas may not be as fully studied as in the two dimensional case (Marius 2012 [6]). Factor models have been introduced to control model complexity as well (DongHwan 2011 [4]). In transportation research, different copula family are reviewed and applied to model residential neighborhood choices (Bhat 2009 [14]). Within this context, a mixture of Gaussian copulas presents an effective alternative method for improving model performance if one wants to study complex dependence structures based on simple copulas.

Finally, Gaussian Copula Mixture Models are developed to meet both needs. Gaussian Copula Mixture Models can be viewed as extension of Gaussian Mixture Models (\cite{Pekka2006} and \cite{Yang1998GMM}), which aim to address the following two concerns:

1 Heavy-tailed data require increasing numbers of clusters to fit GMMs. To capture the control number of clusters, heavy tails on marginal distributions may lead to greater number of clusters in GMM. However, if the heavy-tailed data appear independently on each dimension, we should not use increasing number of clusters to describe them; in another word, multidimensional cluster should be introduced in copula space instead of the original data space and heavy tailed marginal distributions should be modeled separately. These intuition leads to GCMM, in which marginal distributions can be updated using non-parametric methods, and mixture models are used to model the dependent structure. Such a model potentially leads to fewer number of clusters.

2 GMMs are usually applied to a synchronized panel data matrix of dimension [math]\displaystyle{ M }[/math] and number of observations [math]\displaystyle{ N }[/math]. In many problems, there are numerous unsynchronized data on each dimension, the number of which is denoted as [math]\displaystyle{ n_m }[/math] for the [math]\displaystyle{ m }[/math] -th dimension. Such data should be utilized to update the joint distribution shared by the different dimensions. For a concrete example, if we have 500 observations (of travel time) on Link A and 400 observations on Link B, with 300 by 2 observations which are synchronized data between A and B, GMM will utilize the 300 by 2 observations to update the mixture model and GCMM will also utilize 300 by 2 observations points to update the mixture copula structure. But GCMM will further utilize the unsynchronized 200 observations for A and 100 observations for B to update their marginal travel time distributions respectively, which further contributes to the estimation of the copula mixture during iteration.

Expectation Maximum Algorithms for GCMM

1 The Base Case Algorithm with Synchronized Data

The algorithm updates the mixture of copulas and the marginal distributions separately. Essentially when estimating GMMs, the weights [math]\displaystyle{ \pi^m_{k} }[/math] \& correlation matrixes of components [math]\displaystyle{ P^m_k }[/math] and the sufficient statistics (mean [math]\displaystyle{ \mu^m_{ki} }[/math] and standard deviation [math]\displaystyle{ \sigma^m_{ki} }[/math]) of the marginal normal distributions are updated (Dempster 1977 [3]) based on the posterior probability [math]\displaystyle{ \gamma^m_{nk} }[/math]. In GCMMs, the sufficient statistics of marginal normal distributions are replaced with non-parametric estimators to the marginal pdf [math]\displaystyle{ f^m_{ki} }[/math] and cdf [math]\displaystyle{ F^m_{ki} }[/math] to improve flexility,

The major challenge of algorithm design lies in how the marginal distributions should be updated considering the posterior probability. An updating formula is developed and given by the following theorem:

In the GCMM base case, the updating of the marginal distributions follows the following formula with necessary normalizations: [math]\displaystyle{ F'_{ki}(c)=\sum_n \gamma_{nk} 1_{x_{ni} \leq c} }[/math]

Based on the theorem, the algorithm is further developed below:

1 Expectation Step: [math]\displaystyle{ D^m_{nk}=\prod_{i=1}^D\frac{Z^m_{n,ki}}{\frac{1}{\sqrt{2\pi}}\exp(\frac{1}{2}(Y^m_{n,ki})^2)} r^m_{nk}=\frac{\pi_k\frac{D^m_{n,k}}{|P^m_k|^{1/2}}exp(-\frac{1}{2}(Y^m_{nk})^TP_k^{m,-1}Y^m_{nk})}{\sum_{j=1}^K \pi_j\frac{D^m_{n,j}}{|P^m_j|^{1/2}}exp(-\frac{1}{2}(Y^m_{nj})^TP_j^{m,-1}Y^m_{nj})} }[/math]

2 Maximum Step: [math]\displaystyle{ \pi_k^m=\frac{\sum_{n=1}^N r^m_{nk}}{N} P_k^m=\frac{\sum_{n=1}^{N}r^m_{nk}Y^m_{nk}(Y^m_{nk})^T}{\sum_{n=1}^{N}r^m_{nk}} F_{ki}^m(y)=\frac{\sum_n r_{nk}^m 1_{x_{ni} \leq y}}{\sum_n r_{nk}^m} \forall k }[/math] -th copula, [math]\displaystyle{ i }[/math]-th dimension

The issue here is that heavy tail phenomenons may be categorized into two classes: the heavy tails in the marginal distribution and the heavy tails in the dependence structure. GCMM separates the estimation for them and control the number of clusters purely based on the complexity of heavy tails in dependence structure (the latter). In this manner, the number of clusters could be further reduced and the mixture of copulas are robust towards heavy tails on the marginal distributions (the former).

2 Algorithm with Unsynchronized Data

GCMMs with unsynchronized data are developed based on the rationale that unsynchronized data in each dimension can be used to update the marginal distribution, given the estimation of marginal distribution is separated from the mixture of copulas. An additional posterior probability [math]\displaystyle{ \gamma^{'m}_{n_i,k} }[/math]is introduced to represent the probability of [math]\displaystyle{ n_i }[/math] -th unsynchronized data on the [math]\displaystyle{ i }[/math]-th dimension belonging to the [math]\displaystyle{ k }[/math]-th component. An additional loop is then inserted into the Expectation Maximum algorithm for GCMM base case which further updates [math]\displaystyle{ \gamma^{'m}_{n_i,k} }[/math] based on new information,

The major challenge of algorithm design lies in how the marginal distributions should be further updated given unsynchronized data and the existing nonparametric estimator. An updating formula is developed and given by the following theorem:

In the GCMM with unsynchronized data, the updating formula of marginal distribution follows by the following formula with necessary normalizations: [math]\displaystyle{ r'_{n_i,k}=\frac{\pi_k f_{ki}(x_{n_i})}{\sum_{k=1}^K \pi_k f_{ki}(x_{n_i})} F'_{ki}(c)=\sum_n \gamma_{nk} 1_{x_{ni} \leq c}+\sum_{n_i} \gamma'_{n_i,k} 1_{x_{n_i} \leq c} }[/math]

Based on the theorem, the algorithm is further developed below

1 Generate an initial cluster of the link travel time distribution for [math]\displaystyle{ X_i }[/math] and classify the marginal data to each such clusters.

2 Expectation step:

2.1 update [math]\displaystyle{ r^m_{nk} }[/math] for synchronized data;

2.2 update [math]\displaystyle{ r^{'m}_{n_i,k} }[/math] for un-synchronized data using the following Bayes formula: [math]\displaystyle{ r^{'m}_{n_i,k}=\frac{\pi_k f^{m-1}_{ki}(x_{n_i})}{\sum_{k=1}^K \pi_k f^{m-1}_{ki}(x_{n_i})} }[/math]

3 Maximum step:

3.1 Convert the marginal observations into probability values using the marginal cumulative distribution functions.

3.2 Conduct maximum likelihood estimation for the Gaussian copula mixture model given [math]\displaystyle{ r_{n,k} }[/math].

3.3 Update the marginal link travel time cumulative distribution functions [math]\displaystyle{ F_n }[/math] according to the posterior classification probability [math]\displaystyle{ r^m_{n,k} }[/math] and [math]\displaystyle{ r^{'m}_{n_i,k} }[/math]. [math]\displaystyle{ F^m_{ki}(y)=\frac{\sum_n r^m_{nk} 1_{x_{ni}\leq y}+\sum_{n_i} r^{'m}_{n_i,k} 1_{x_{n_i}\leq y}}{\sum_n r^m_{nk}+\sum_{n_i} r^{'m}_{n_i,k}} \forall\ kth\ copula,ith\ dimension }[/math] (kernel smoothing can be applied)

4 Check convergence. If not, goto Step 2 . \end{enumerate}

In comparison with the Base case algorithm, the differences are highlighted below:

1 In Expectation step: 1.1 update [math]\displaystyle{ r_{nk}^m }[/math]for synchronized data; 1.2 update [math]\displaystyle{ r^{'m}_{n_i,k} }[/math]for un-synchronized data using the following Bayes formula: [math]\displaystyle{ r^{'m}_{n_i,k}=\frac{\pi_k^m f_{ki}^m(x_{n_i})}{\sum_{k=1}^K \pi_k^m f_{ki}^m(x_{n_i})} }[/math] In each iteration, update the marginal cdfs [math]\displaystyle{ F_n }[/math] according to [math]\displaystyle{ r_{nk} }[/math] and [math]\displaystyle{ r'_{n'k} }[/math]. [math]\displaystyle{ \forall }[/math] k-th copula, i-th dimension: [math]\displaystyle{ F^m_{ki}(y)=\frac{\sum_n r^m_{nk} 1_{x_{ni}\leq y}+\sum_{n_i} r^{'m}_{n_i,k} 1_{x_{n_i}\leq y}}{\sum_n r^m_{nk}+\sum_{n_i} r^{'m}_{n_i,k}} }[/math]

Overall, the philosophical issue here is whether synchronized data truly represent the joint distribution adequately and whether the unsynchronized data may add to our understanding of it. To bring unsynchronized data into the whole Expectation Maximum algorithm enlarges the information set of the probability space ([math]\displaystyle{ \Omega,\mathcal{F},P }[/math]) so that deeper elaboration of the data is possible (Cinlar 2011 [2]). This is a significant improvement from GMM beyond the flexibility applied to the marginal distribution.

References

[1] [2] [3][4][5][6][7] [8] [9][10][11]




  1. Gaussian copula model slides http://dataspace.princeton.edu/jspui/handle/88435/dsp01vd66w212h
  2. Full disertation on GCMM, http://dataspace.princeton.edu/jspui/handle/88435/dsp01vd66w212h
  3. Xu, L. and M. I. Jordan, On convergence properties of the EM algorithm for Gaussian mixtures. Neural computation, Vol. 8, No. 1, 1996, pp. 129–151.
  4. Sklar, M., Fonctions de répartition à n dimensions et leurs marges. Université Paris 8, 1959.
  5. Nelsen, R., An Introduction to Copulas. Springer Science+ Business Media, Inc., 2006.
  6. White, H., Estimation, Inference and Specification Analysis. Cambridge University Press, 1994.
  7. Marius Hofert, A. J. M., Martin Machler, Estimators for Archimedean copulas in high dimensions, 2012
  8. . Dong, H. O. and J. P. Andrew, Modelling Dependence in High Dimensions with Factor Copulas, 2011.
  9. Bowman, A., P. Hall, and T. Prvan, Bandwidth selection for the smoothing of distribution functions. Biometrika, Vol. 85, No. 4, 1998, p. 799
  10. Pekka Paalanen, J. I. H. K., Joni-Kristian Kamarainen, Feature representation and discrimination based on Gaussian mixture model probability densities, practices and algorithms. Pattern Recognition, , Volume 39, Issue 7, July 2006, Pages 1346-1358, 2006.
  11. Ming-Hsuan Yang, N. A., Gaussian mixture model for human skin color and its applications in image and video databases Proc. SPIE 3656, Storage and Retrieval for Image and Video Databases VII, 458 (December 17, 1998); doi:10.1117/12.333865, 2006