Biology:Denoising Algorithm based on Relevance network Topology

From HandWiki

Denoising Algorithm based on Relevance network Topology (DART) is an unsupervised algorithm that estimates an activity score for a pathway in a gene expression matrix, following a denoising step.[1] In DART, a weighted average is used where the weights reflect the degree of the nodes in the pruned network.[1] The denoising step removes prior information that is inconsistent with a data set. This strategy substantially improves unsupervised predictions of pathway activity that are based on a prior model, which was learned from a different biological system or context.[1]

Pre-existing methods such as gene set enrichment analysis method attempt to infer.[2] However, it did not construct a structured list of genes. SPIA (Signaling Pathway Impact analysis)[3] is a method that uses the phenotype information to evaluate the pathway activity between two phenotypes. However, it does not identify the pathway gene subset that could be used to differentiate individual samples.[3] CORG is used to identify a relevant gene subset. It is a supervised method, which does not perform as well as DART in analyzing independent data set[1]

Understanding molecular pathway activity is crucial for risk assessment, clinical diagnosis and treatment. Meta-analysis of complex genomic data is often associated with difficulties such as extracting useful information from big data, eliminating confounding factors and providing more sensible interpretation. Different approaches have been taken to highlight the identification of relevant pathway in order to provide better gene expression prediction.

Method

Strategy

  1. Build a network of all genes that are involved in the pathway
  2. Evaluate the consistency of the prior regulatory information
  3. Remove inconsistent prior information-the denoising step
  4. Estimate pathway activity

Pearson correlations were first computed between regulatory genes at the level of transcription and a gene expression data set. The correlation coefficient then underwent a Fisher's transform:

[math]\displaystyle{ \gamma_{ij}=\frac{1}{2} \log{\frac{1+c_{ij}}{1-c_{ij}}} }[/math]

Where cij is the correlation coefficient between gene i and j, and where γij is the variable that under the null hypothesis, its mean is zero and standard deviation 1/n_s-3, where ns is the number of tumor samples. The threshold of p-value was set at 0.0001. Gene pairs with significant correlation will be considered relevant in the network. To predict the activity score in which genes that are nearby are also taken into consideration:

[math]\displaystyle{ \vec S_{W\;AV}=\frac 1\sqrt{\sum_{i\in N}k_i^2}\sum_{i\in N}\sigma_ik_i\vec Z_i }[/math]

Where ki is the number of neighbors of gene i, zi is the normalized z-score and σi is a binary variable ( i.e 1 means upregulated upon activation and -1 means downregulated). This step is to estimate the activation level, in which sw AV is the activity score. A linear regression model was then applied to estimate the pathway activation levels. Thus, tij and pij denote the t-statistics and p-value associated with, whereas p<0.05 indicates a significance. To assess the consistency in a validation data set D, the performance measure Vij is denoted:

[math]\displaystyle{ V_{ij}=\sum_{d\in D}\sigma_{ij}^{(d)}\left|t_{ij}^{(d)}\right|S\left(p_{ij}^{d}\right) }[/math]

Where S is defined by

[math]\displaystyle{ S(p_{ij})= \begin{cases} 1 & p_{ij}\le 0.05 \\ 0 & p_{ij}\gt 0.05 \end{cases} }[/math] S is the threshold function of a given pair of pathways. And where

[math]\displaystyle{ \sigma_{ij}^{(d)}= \begin{cases} 1 & \mathrm{sign}\left(t_{ij}^{(train)}\right)=\mathrm{sign}\left(t_{ij}^{(d)}\right),\;d\in D \\ -1 & \mathrm{sign}\left(t_{ij}^{(train)}\right)=-\mathrm{sign}\left(t_{ij}^{(d)}\right),\;d\in D \end{cases} }[/math]

σij is the score that tells the directionality of a correlation, in which an opposite prediction will be panelized by given a value of -1. tij is the t-statistics of interpathway correlation. The performance measure Vij accounts for the significance of correlation between pathways, the direction of correlation, and the weights in the magnitude of the correlation. A two-tailed paired Wilcoxon test is performed to compare the distribution under hypothesis. Advantages and limitation: DART gives an improved performance and higher accuracy in inferring pathway activity from prior information of pathway databases. Pre-existed information and large database are needed in order for DART to run. In other words, DART requires well-established prior gene expression data to start with, and then it can proceed evaluation of consistency and denoise any irrelevant information.

Application

DART is an algorithm that is applicable and used successfully in Cancer Genomics. The DART algorithm has been shown to be a strong method for estimating the pathway activity and perturbation signature activity in breast and lung cancer gene expression data sets.[1] Imaging traits such as mammography (Mammography is the process of using low-energy X-rays to examine the human breast tissue) plays an important role in cancer tumor diagnosis. Studies have shown that women with increased mammographic density have a higher risk of developing Breast cancer.[4] Estrogen receptor alpha gene 1 encodes Estrogen Receptor-alpha, which is activated by estrogen. Polymorphisms in ESR1 are associated with breast cancer risk through differences in different level of breast density. DART successfully predicted an inverse correlation between ESR1 signaling and MMD. It can be used in simulated and real multidimensional cancer genomic data. It gives more reliable prediction about pathway activation, which would be helpful in association studies.

References

  1. 1.0 1.1 1.2 1.3 1.4 Jiao, yan; Katherine Lawler (19 October 2011). "DART: Denoising Algorithm based on Relevance network Topology improves molecular pathway activity inference". BMC Bioinformatics 12: 403. doi:10.1186/1471-2105-12-403. PMID 22011170. 
  2. Subramanian, Tamayo; Mukherjee, Ebert BL (Sep 30, 2005). "Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles". PNAS 102 (43): 15545–50. doi:10.1073/pnas.0506580102. PMID 16199517. 
  3. 3.0 3.1 Tarca AL, Draghici; Khatri P; Hassan SS (2009). "A novel signaling pathway impact analysis". Bioinformatics 25 (1): 75–82. doi:10.1093/bioinformatics/btn577. PMID 18990722. 
  4. Li J, Eriksson L; Humphreys K; Czene K (2010). "Genetic variation in the estrogen metabolic pathway and mammographic density as an intermediate phenotype of breast cancer". Breast Cancer Res 12 (2): R19. doi:10.1186/bcr2488. PMID 20214802.