Software:Mega2, the Manipulation Environment for Genetic Analysis

From HandWiki
Mega2
Original author(s)Previous Programmers: Charles P. Kollar, Nandita Mukhopadhyay, Lee Almasy, Mark Schroeder, William P. Mulvihill.
Developer(s)Daniel E. Weeks, Robert V. Baron, Justin R. Stickel.
Initial release16 January 2000; 24 years ago (2000-01-16)
Stable release
5.0.1 / 13 December 2018; 5 years ago (2018-12-13)
Written inC++
Operating systemLinux, Mac OS X, Microsoft Windows
TypeApplied statistical genetics, Bioinformatics
LicenseGNU General Public License version 3
Websitewatson.hgen.pitt.edu/register/

Mega2 (short for manipulation environment for genetic analysis) allows the applied statistical geneticist to convert one's data from several input formats to a large number output formats suitable for analysis by commonly used software packages.[1][2][3][4] In a typical human genetics study, the analyst often needs to use a variety of different software programs to analyze the data, and these programs usually require that the data be formatted to their precise input specifications. Conversion of one's data into these multiple different formats can be tedious, time-consuming, and error-prone. Mega2, by providing validated conversion pipelines, can accelerate the analyses while reducing errors.

Mega2 produces a common intermediate data representation using SQLite3, which enables the data to be accessed by other programs and languages. In particular, the Mega2R R package converts the SQLite3 data into R data frames. Several R functions are provided that illustrate how data can be extracted from the data frames for common R analysis, such as SKAT and pedgene. The key is being able to efficiently extract genotypes corresponding to chosen subsets of markers so as to facilitate gene-based association testing by automating looping over genes in the genome. Another function converts to VCF format and another converts the data to GenABEL format. For more information about the Mega2R package, see here.

Mega2 has been used to facilitate genetic analyses of a wide variety of human traits, including hereditary dystonia,[5] Ehlers-Danlos syndrome,[6] multiple sclerosis,[7] and gliomas.[8] A list of PubMed Central articles citing Mega2 can be seen here.

Mega2, which focusses on data reformatting, should not be confused with the MEGA, Molecular Evolutionary Genetics Analysis program, which focuses on molecular evolution and phylogenetics.

Input file formats

Mega2 accepts input data in a variety of widely used file formats. These contain, at a minimum, data about the phenotypes, the marker genotypes, any family structures, and map positions of the markers.

Input format Description Links
pre-Makeped or post-Makeped formats Linkage User Guide (PDF), LINKAGE format
Mega2[1][2][3][4] simplified/augmented LINKAGE-format Mega2 format
VCF or BCF[9] Variant Call Format or Binary Variant Call Format Variant Call Format (Wikipedia entry), BCF documentation
IMPUTE2[10][11] IMPUTE2 GEN and BGEN Formats IMPUTE2 documentation, GEN format, BGEN format

Output file formats

Mega2 supports conversion to the following output formats.

Output format Links
ASPEX format ASPEX
Beagle format[12][13] BEAGLE
CRANEFOOT format[14] CRANEFOOT
Eigenstrat format[15][16] EIGENSOFT
FBAT format[17] FBAT
GeneHunter
GeneHunter-Plus
IQLS/Idcoefs format[18][19] IQLS,Idcoefs
Linkage format[20][21][22][23] Linkage User Guide (PDF), LINKAGE format
Loki
MaCH/minimac3 format[24] [25] MaCH, minimac3
MLB-QTL
Mega2 annotated format[1][2][3][4] Mega2 format
Mendel format[26] Mendel
Merlin format[27] Merlin
Merlin/SimWalk2-NPL format[27][28] Merlin SimWalk2
PANGAEA MORGAN format[29][30] MORGAN
PAP format[31] PAP
PLINK format[32] (bed, lgen, or ped formats) PLINK
PREST format[33][34] PREST
PSEQ format PSEQ
Pre-makeped LINKAGE format[20][21][22][23] Linkage User Guide (PDF), LINKAGE format
ROADTRIPS format[35] ROADTRIPS
SAGE format SAGE, openSAGE
SHAPEIT format[36][37][38][39][40] SHAPEIT
SIMULATE format[41] SIMULATE
FASTSLINK
SOLAR
SPLINK
SUP format[42][43] SUP
SimWalk2 format[28] SimWalk2
Structure
VCF format[9] Variant Call Format (Wikipedia entry)
Vintage Mendel format[26][44] Vintage Mendel
Vitesse

Documentation

The Mega2 documentation is available here in HTML format, and here in PDF format.

References

  1. 1.0 1.1 1.2 Mukhopadhyay, N; Almasy L; Schroeder M; Mulvihill WP; Weeks DE (1999). "Mega2, a data-handling program for facilitating genetic linkage and association analyses". Am J Hum Genet 65: A436. 
  2. 2.0 2.1 2.2 Mukhopadhyay, N; Almasy L; Schroeder M; Mulvihill WP; Weeks DE (2005). "Mega2: data-handling for facilitating genetic linkage and association analyses". Bioinformatics 21 (10): 2556–2557. doi:10.1093/bioinformatics/bti364. PMID 15746282. 
  3. 3.0 3.1 3.2 Kollar, CP; Baron RV; Mukhopadhyay N; Weeks DE (October 2013). "Mega2: enhanced data-handling for facilitating genetic linkage and association analyses". Presented at the 63rd Annual Meeting of the American Society of Human Genetics, Boston: Abstract 1831. http://abstracts.ashg.org/cgi-bin/2013/ashg13s.pl?author=kollar&sort=ptimes&sbutton=Detail&absno=130121140&sid=32111. 
  4. 4.0 4.1 4.2 "Mega2: validated data-reformatting for linkage and association analyses". Source Code Biol Med 9 (1): 26. 2014. doi:10.1186/s13029-014-0026-y. PMID 25687422. 
  5. "Mutations in the autoregulatory domain of beta-tubulin 4a cause hereditary dystonia". Ann Neurol 73 (4): 546–553. 2013. doi:10.1002/ana.23832. PMID 23424103. 
  6. "Mutations in FKBP14 cause a variant of Ehlers-Danlos syndrome with progressive kyphoscoliosis, myopathy, and hearing loss". Am J Hum Genet 90 (2): 201–216. 2012. doi:10.1016/j.ajhg.2011.12.004. PMID 22265013. 
  7. "Exome sequencing identifies a novel multiple sclerosis susceptibility variant in the TYK2 gene". Neurology 79 (5): 406–411. 2012. doi:10.1212/wnl.0b013e3182616fc4. PMID 22744673. 
  8. "Genome-wide high-density SNP linkage search for glioma susceptibility loci: results from the Gliogene Consortium". Cancer Res 71 (24): 7568–7575. 2011. doi:10.1158/0008-5472.can-11-0013. PMID 22037877. 
  9. 9.0 9.1 "The variant call format and VCFtools.". Bioinformatics 27 (15): 2156–8. 2011. doi:10.1093/bioinformatics/btr330. PMID 21653522. 
  10. "A flexible and accurate genotype imputation method for the next generation of genome-wide association studies". PLOS Genet 5 (6): e1000529. 2009. doi:10.1371/journal.pgen.1000529. PMID 19543373. 
  11. "Genotype imputation for genome-wide association studies". Nat Rev Genet 11 (7): 499–511. 2010. doi:10.1038/nrg2796. PMID 20517342. 
  12. "Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering". Am J Hum Genet 81 (5): 1084–1097. 2007. doi:10.1086/521987. PMID 17924348. 
  13. "A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals". Am J Hum Genet 84 (2): 210–223. 2009. doi:10.1016/j.ajhg.2009.01.005. PMID 19200528. 
  14. "High-throughput pedigree drawing". Eur J Hum Genet 13 (8): 987–989. 2005. doi:10.1038/sj.ejhg.5201430. PMID 15870825. 
  15. "Principal components analysis corrects for stratification in genome-wide association studies". Nat Genet 38 (8): 904–909. 2006. doi:10.1038/ng1847. PMID 16862161. 
  16. "Population structure and eigenanalysis". PLOS Genet 2 (12): e190. 2006. doi:10.1371/journal.pgen.0020190. PMID 17194218. 
  17. "Implementing a unified approach to family-based tests of association". Genet Epidemiol 19 (Suppl 1): S36–42. 2000. doi:10.1002/1098-2272(2000)19:1+<::aid-gepi6>3.3.co;2-d. PMID 11055368. 
  18. "An Incomplete-Data Quasi-likelihood Approach to Haplotype-Based Genetic Association Studies on Related Individuals". J Am Stat Assoc 104 (487): 1251–1260. 2009. doi:10.1198/jasa.2009.tm08507. PMID 20428335. 
  19. Abney M (2009). "A graphical algorithm for fast computation of identity coefficients and generalized kinship coefficients". Bioinformatics 25 (12): 1561–1563. doi:10.1093/bioinformatics/btp185. PMID 19359355. 
  20. 20.0 20.1 Cite error: Invalid <ref> tag; no text was provided for refs named LINKAGE1984
  21. 21.0 21.1 Cite error: Invalid <ref> tag; no text was provided for refs named LINKAGE1985
  22. 22.0 22.1 Cite error: Invalid <ref> tag; no text was provided for refs named LINKAGE1986
  23. 23.0 23.1 Cite error: Invalid <ref> tag; no text was provided for refs named LINKAGE1988
  24. "Fast and accurate genotype imputation in genome-wide association studies through pre-phasing". Nat Genet 44 (8): 955–959. 2012. doi:10.1038/ng.2354. PMID 22820512. 
  25. "minimac2: faster genotype imputation". Bioinformatics 31 (5): 782–784. 2015. doi:10.1093/bioinformatics/btu704. PMID 25338720. 
  26. 26.0 26.1 "Mendel: the Swiss army knife of genetic analysis programs". Bioinformatics 29 (12): 1568–1570. 2013. doi:10.1093/bioinformatics/btt187. PMID 23610370. 
  27. 27.0 27.1 "Merlin--rapid analysis of dense genetic maps using sparse gene flow trees". Nat Genet 30 (1): 97–101. 2002. doi:10.1038/ng786. PMID 11731797. 
  28. 28.0 28.1 "Descent graphs in pedigree analysis: Applications to haplotyping, location scores, and marker-sharing statistics". Am J Hum Genet 58 (6): 1323–1337. 1996. PMID 8651310. 
  29. Thompson EA (1994). "Monte Carlo likelihood in the genetic mapping of complex traits". Philos Trans R Soc Lond B Biol Sci 344 (1310): 345–350; discussion 350–341. doi:10.1098/rstb.1994.0073. PMID 7800704. 
  30. Thompson EA (1994). "Monte Carlo likelihood in genetic mapping". Statistical Science 9 (3): 355–366. doi:10.1214/ss/1177010381. 
  31. Hasstedt SJ (2005). "jPAP: Document-driven software for genetic analysis". Genet Epidemiol 29: 255. 
  32. "Statistical tests for detection of misspecified relationships by use of genome-screen data". Am J Hum Genet 66 (3): 1076–1094. 2000. doi:10.1086/302800. PMID 10712219. 
  33. "Enhanced pedigree error detection". Hum Hered 54 (2): 99–110. 2002. doi:10.1159/000067666. PMID 12566741. 
  34. "ROADTRIPS: case-control association testing with partially or completely unknown population and pedigree structure". Am J Hum Genet 86 (2): 172–184. 2010. doi:10.1016/j.ajhg.2010.01.001. PMID 20137780. 
  35. "A linear complexity phasing method for thousands of genomes.". Nat Methods 9 (2): 179–81. 2012. doi:10.1038/nmeth.1785. PMID 22138821. 
  36. "Improved whole-chromosome phasing for disease and population genetic studies.". Nat Methods 10 (1): 5–6. 2013. doi:10.1038/nmeth.2307. PMID 23269371. 
  37. "Haplotype estimation using sequencing reads.". Am J Hum Genet 93 (4): 687–96. 2013. doi:10.1016/j.ajhg.2013.09.002. PMID 24094745. 
  38. "A general approach for haplotype phasing across the full spectrum of relatedness.". PLOS Genet 10 (4): e1004234. 2014. doi:10.1371/journal.pgen.1004234. PMID 24743097. 
  39. "Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel". Nat Commun 5: 3934. 2014. doi:10.1038/ncomms4934. PMID 25653097. Bibcode2014NatCo...5.3934.. 
  40. "A chromosome-based method for rapid computer simulation". Am J Hum Genet 51: A202. 1992. 
  41. Lemire M (2006). "SUP: an extension to SLINK to allow a larger number of marker loci to be simulated in pedigrees conditional on trait values". BMC Genet 7: 40. doi:10.1186/1471-2156-7-40. PMID 16803631. 
  42. "Programs for pedigree analysis: MENDEL, FISHER, and dGENE". Genet Epidemiol 5 (6): 471–472. 1988. doi:10.1002/gepi.1370050611. PMID 3061869. https://deepblue.lib.umich.edu/bitstream/2027.42/101847/1/1370050611_ftp.pdf. 

External links