Biology:Plant genome assembly

From HandWiki

A plant genome assembly represents the complete genomic sequence of a plant species, which is assembled into chromosomes and other organelles by using DNA (deoxyribonucleic acid) fragments that are obtained from different types of sequencing technology.

Structure

The genome of plants can vary in their structure and complexity from small genomes like green algae (15 Mbp).[1] to very large and complex genomes that have typically much higher ploidy, higher rates of heterozygosity and repetitive elements than species from other kingdoms.[2] One of the most complex plant genome assemblies available is that of loblolly pine (22 Gbp).[3] Due to their complexity, the plants’ genome sequences can't be assembled back into chromosomes using only short reads provided by next-generation- sequencing technologies (NGS),[4][5] and therefore most plant genome assemblies available that used NGS alone are highly fragmented, contain large numbers of contigs, and genome regions are not finished. Highly repetitive sequences, often larger than 10kbp, are the main challenge in plants.[6][7] Most of the chromosomal sequences are produced by the activity of mobile genetic elements (MGEs) in the plant genomes.[8] MGEs are divided into two classes: class I or retrotransposons, and class II or DNA transposons. In plants, long- terminal repeat (LTR) retrotransposons are predominant and constitute from 15%[9] to 90% of the genome.[10] Polyploidy is another challenge in assembling a plant genome, and it is estimated that ~80% of plants are polyploids.[11]

Assemblies

The first complete plant genome assembly, that of Arabidopsis thaliana, was finished in 2000,[12] being the third multicellular eukaryotic genome published after C. elegans[13] and D. melanogaster.[14] Arabidopsis, unlike other plants’ genomes (e.g. Malus) has convenient traits, such as a small nuclear genome (135Mbp) and a short generation time (8 weeks from seed to seed). The genome has five chromosomes reflecting approximately 4% of the human genome size. The genome was sequenced and annotated by the Arabidopsis Genome Initiative (AGI).

The initiative for sequencing the genome of rice (Oryza sativa),[15] began in September 1997, when scientists from many nations agreed to an international collaboration to sequence the rice genome, forming “The International Rice Genome Sequencing Project” (IRGSP). At an estimated size between 400-430 Mb, approximatively four times larger in dimensions than A. thaliana, rice has the smallest of the major cereal crop genomes.[15]

Between 2000 and 2008 in total 10 plant genomes were published while in 2012 alone, 13 plant genomes were published. Since then the number was constantly increasing, and now more than 400 plant genomes are available in the NCBI genome database, of which 72 were re-annotated [NCBI].

Databases

EnsemblPlants[16] is part of EnsemblGenome database and contains resources for a reduced number of sequenced plant species (45, Oct. 2017). It mainly provides genome sequences, gene models, functional annotations and polymorphic loci. For some of the plant species, additional information is provided including population structure, individual genotypes, linkage, and phenotype data.

Gramene[17] is an online web database resource for plant comparative genomics and pathway analysis based on Ensembl technology.

Plant Genome DataBase Japan[18] (PGDBj) is a website that contains information related to genomes of model and crop plants from databases. It has three main components: ortholog db, DNA marker and linkage map db, and plant resource db, where multiple plant resources accumulated by different institutes are integrated. The aim is “to provide a platform, enabling comparative searches of different resources” (pgdbj.jp).

PlantsDB[19] is a resource for analysing and storing genetic and genomic information from various plants, and offers tools to query these data and to perform comparative analysis with the help of in-house tools.

PLAZA[20][21] is another online resource for comparative genomics that integrates plant sequence data and comparative genomic methods, and performs evolutionary analysis within the green plant lineage (Viridiplantae).

The Arabidopsis Information Resource (TAIR)[22] maintains a web database of the “model higher plant Arabidopsis Thaliana “.

Assembly strategies

In general, for sequencing and assembling large and complex genomes like plants, different strategies are used, based on the technologies available at that time when the project started.

Sanger clone-by-clone

Clone-by-clone sequencing strategies are based on the construction of a map for each chromosome before the sequencing, and rely on libraries made from large-insert clones. The most common type of large-insert clone is the bacterial artificial chromosome (BAC).

With BAC, the genome is first split into smaller pieces with the location recorded. The pieces of DNA are then inserted into BAC clones that are further multiplied by inserting them into bacterial cells that grow very fast. These pieces are further fragmented into overlapping smaller pieces that are placed into a vector and then sequenced. The small pieces are then assembled into contigs by overlapping them. Next, using the map from the first step the contigs are assembled back into the chromosomes.

The first complete plant genome assembly (also the first plant genome published) that used this type of technique was Arabidopsis thaliana, in 2000.[12] Different large-insert libraries like BACs, P1 artificial chromosomes (PAC), yeast artificial chromosome (YAC) and transformation-competent artificial chromosomes (TACs) were combined to assemble the genome. From clones with restriction fragment fingerprint, by comparison of the patterns and hybridization or polymerase chain reaction (PCR) the physical maps were constructed. The physical maps were integrated together with genetic maps to identify contig positions and orientations. End sequences from 47,788 BAC clones were used to extend contigs from anchored BACs and to select a minimum tiling path. A total of 1,569 clones found in minimum tiling path were selected and sequenced. Direct PCR products were used to clone remaining gaps, and YACs allowed the characterization of telomere sequences. The resulting sequenced regions were 115.4 Mb of the 125 Mb predicted size of the genome and a total of 25,498 of protein-coding genes.

To sequence and assemble the genome of Oryza sativa (japonica),[15] the same strategy was used. For Oryza sativa a total of 3,401 mapped clones in a minimum tiling path were selected from the physical map and assembled.

One of the most important crops in the world, maize (Zea mays), is the last plant genome project primarily based on Sanger BAC-by-BAC strategy.[23] The genome size of Maize, 2.3 Gb and 10 chromosomes,[23] is significantly larger than that of rice and Arabidopsis.[23] To assemble the genome of maize a set of 16,848 minimally overlapping BAC clones derived

from combinations of physical and genetic map were selected and sequenced. The assembly on maize was performed in addition with external information data. The data was obtained from cDNA and sequences from libraries with methyl-filtered DNA (libraries that uses the knowledge that the bases in genic sequences tends to be less heavily methylated than those in non-genic regions) and high C0 t techniques.

Sanger clone-by-clone strategy has the advantage of working in small units, which reduces the complexity and computational requirements, as well as minimized problems associated with the misassembly of highly repetitive DNA and therefore is an attractive solution in assembling plant genomes and other complex eukaryotic genomes. The main disadvantages of this method are the costs and the resources required. The cost of the first plant genome assemblies was estimated between 70 million dollars[24] and 200 million dollars per assembly.[25]

Sanger whole-genome shotgun (WGS)

In the WGS sequencing technology there is no order for the fragments that are sequenced. The DNA is randomly sheared and cloned fragments are sequenced and assembled using computational methods. This technology reduced the cost and the time associated with construction of the maps and relies on computational resources.

A considerable number of important plant genomes like grapevine (Vitis Vinifer),[26] papaya (Carica papaya),[27] and cottonwood (Populus trichocarpa)[28] were sequenced and assembled with Sanger WGS strategy.

The draft genome of grapevine[26] is the fourth genome published for a flowering plant and the first from a fruit crop. The sequences of the genome were obtained from different types of libraries, like plasmids, fosmids and BACs. All the data were generated by paired-end sequencing of cloned insert using Sanger technology on ABI3730x1 sequencers. To assemble the reads, Arachne, 2002,[29] a software designed to analyze reads obtained from both ends of plasmid clones, was used. In total 6.2 million paired-end tag reads were produced. The software produced 20.784 contigs that were combined into 3,830 supercontigs, having an N50 value of 64kb. Supercontigs had a total size of 498 Mb.

The anchorage of the supercontigs along the genome was performed first by joining supercontigs together using paired BAC end sequences. The resulting ultracontigs and the remained supercontigs were then aligned along the genetic map of the genome. Later improvements of this strategy enabled the sequencing of Brachypodium distachyon,[30] Sorghum bicolor[31] and soybean.[32]

Next-generation sequencing

Due to its relatively cheap cost in comparison to previous methods, most of the recent plant genomes were sequenced and assembled using data from NGS (next-generation- sequencing) technology. In general the NGS data are used in combination with Sanger Sequencing technology or long-reads obtained from the third generation sequencing. The genome of the cucumber, (Cucumis sativus),[33] was one of the plant genomes that used the NGS Illumina reads in combination with Sanger sequences. 72.2-fold genome coverage high quality base pairs were generated from which 3.9-fold coverage was provided from Sanger and the Illumina GA reads provided 68.3-fold coverage. From this two assemblies were produced based on the sequencing technology. The resulting contigs were compared between them, resulting in a total length of the assembled genome of 243.5 Mb. The result is about 30% smaller than the genome size estimated by flow cytometry of isolated nuclei stained with propidium iodide (367 Mb). A genetic map was constructed to anchor the assembled genome. 72.8% of the assembled sequences were successfully anchored onto the seven chromosomes. Another plant genome that combined NGS with Sanger sequencing was the genome of Theobroma cacao, 2010,[34] an economically important tropical fruit tree crop and the primary source of cocoa. The genome was sequenced in a consortium, “The International Cocoa Genome Sequencing consortium (ICGS) “ and produced a total of 17.6 million 454 single end reads, 8.8 million 454 paired-end reads, 398.0 million Illumina paired-end reads and about 88,000 Sanger BAC reads. First by using genome assembly software, Newbler, an assembly was produced with 25,912 contigs and 4,792 scaffolds from the reads obtained from Roche/454 and Sanger raw data. This had a total length of 326.9 Mb, which represents 76% of the estimated genome size. The Illumina reads were used to complement the 454 assembly, by aligning the short reads on the cocoa genome assembly using the SOAP software. A similar strategy that combined NGS reads and Sanger Sequencing was used for other important plant species like the first published apple genome (Malus domestica),[35] cotton (Gossypium Raimond),[36] draft genome of sweet orange (Citrus sinensis)[37] and the domesticated tomato (Solanum lycopersicum) genome[38]

Third-generation

With the emergence of third-generation sequencing (TGS) some of the limitations from previous methods of sequencing and assembling plant genomes have started to be addressed. This technology is characterized by the parallel sequencing of single molecules of DNA, that results in sequences up to 54 kbp length (PacBio RS 2).[39] In general, long reads from TGS have relatively high error rates (~10% on average)[40] and therefore repeated sequencing of the same DNA fragments is required. The price of such technology is still quite high and therefore is generally used in combination with short reads from NGS. One of the first plant genome that used long-reads from TGS, Pacific Biosciences in combination with short reads from NGS was the genome of spinach[41] having a genome size estimated at 989 Mb. For this, a 60× coverage of the genome was generated, with 20% of the reads larger than 20 kb. Data were assembled using PacBio’s hierarchical genome assembly process (HGAP),[42] and showed that long-read assemblies revealed a 63-fold improvement in contig size over an Illumina-only assembly. Another plant genome that was recently published that used long reads in combination with short reads is the improved assembly of the apple genome.[43] In this project a hybrid approach was used, combining different data types from sequencing technologies. The sequences used came from: PacBio RS II, Illumina paired-end reads (PE) and Illumina mate- pair reads (MP). As a first step an assembly from Illumina paired-end reads was performed using a well-known de novo assembly software SOAPdevo.[44] Then using a hybrid assembly pipeline DBG2OLC.[45] the contigs obtained at the first step and the long reads from PacBio were combined. The assembly was then polished with the help of Illumina paired-end reads by mapping them to the contigs using BWA-MEM.[46] By mapping the mate-pair reads on the corrected contigs they scaffold the assembly. Further BioNano (https://bionanogenomics.com/) optical mapping analysis with a total length of 649.7 Mb, were used in the hybrid assembly pipeline together with the scaffolds obtained from the previous step. The resulting scaffolds were anchored to a genetic map constructed from 15,417 single-nucleotide polymorphisms (SNPs) markers. For better understanding of the number and diversity of genes that were identified, ribonucleic acid RNA-seq, were used. The resulted genome has a dimension of 643.2 Mb getting closer to the estimated genome size than the previous published assembly[35] and a smaller number of protein-coding- genes.

The use of long reads in the plant genome assemblies became more popular, for reducing the number of scaffolds and increasing the quality of the genome by improving the assembly and coverage in regions that are not clearly defined by NGS assembly.

References

  1. "Gene functionalities and genome structure in Bathycoccus prasinos reflect cellular specializations at the base of the green lineage". Genome Biology 13 (8): R74. August 2012. doi:10.1186/gb-2012-13-8-r74. PMID 22925495. 
  2. "The C-value enigma in plants and animals: a review of parallels and an appeal for partnership". Annals of Botany 95 (1): 133–146. January 2005. doi:10.1093/aob/mci009. PMID 15596463. 
  3. "Sequencing and assembly of the 22-gb loblolly pine genome". Genetics 196 (3): 875–890. March 2014. doi:10.1534/genetics.113.159715. PMID 24653210. 
  4. "Utilization of next-generation sequencing platforms in plant genomics and genetic variant discovery". Molecular Breeding 25 (4): 553–570. 2010-04-01. doi:10.1007/s11032-009-9357-9. 
  5. "Next-generation DNA sequencing". Nature Biotechnology 26 (10): 1135–1145. October 2008. doi:10.1038/nbt1486. PMID 18846087. 
  6. "Repetitive DNA and next-generation sequencing: computational challenges and solutions". Nature Reviews. Genetics 13 (1): 36–46. November 2011. doi:10.1038/nrg3117. PMID 22124482. 
  7. "Centromeric repetitive DNA sequences in the genus Brassica". Theoretical and Applied Genetics 90 (2): 157–165. February 1995. doi:10.1007/BF00222197. PMID 24173886. 
  8. "Sequencing the extrachromosomal circular mobilome reveals retrotransposon activity in plants". PLOS Genetics 13 (2): e1006630. February 2017. doi:10.1371/journal.pgen.1006630. PMID 28212378. 
  9. "Progress, challenges and the future of crop genomes". Current Opinion in Plant Biology 24: 71–81. April 2015. doi:10.1016/j.pbi.2015.02.002. PMID 25703261. 
  10. "Molecular organization of genes and repeats in the large cereal genomes and implications for the isolation of genes by chromosome walking". Chromosomes Today. Dordrecht: Springer. 1993. pp. 199–213. doi:10.1007/978-94-011-1510-0_16. ISBN 9789401046602. 
  11. "On the abundance of polyploids in flowering plants". Evolution; International Journal of Organic Evolution 60 (6): 1198–1206. June 2006. doi:10.1554/05-629.1. PMID 16892970. 
  12. 12.0 12.1 The Arabidopsis Genome Initiative (December 2000). "Analysis of the genome sequence of the flowering plant Arabidopsis thaliana". Nature 408 (6814): 796–815. doi:10.1038/35048692. PMID 11130711. Bibcode2000Natur.408..796T. 
  13. "Genome sequence of the nematode C. elegans: a platform for investigating biology". Science 282 (5396): 2012–2018. December 1998. doi:10.1126/science.282.5396.2012. PMID 9851916. Bibcode1998Sci...282.2012.. 
  14. "The genome sequence of Drosophila melanogaster". Science 287 (5461): 2185–2195. March 2000. doi:10.1126/science.287.5461.2185. PMID 10731132. Bibcode2000Sci...287.2185.. 
  15. 15.0 15.1 15.2 "A draft sequence of the rice genome (Oryza sativa L. ssp. japonica)". Science 296 (5565): 92–100. April 2002. doi:10.1126/science.1068275. PMID 11935018. Bibcode2002Sci...296...92G. 
  16. "Ensembl Plants: Integrating Tools for Visualizing, Mining, and Analyzing Plant Genomics Data". Plant Bioinformatics. Methods in Molecular Biology. 1374. Humana Press, New York, NY. 2016. pp. 115–140. doi:10.1007/978-1-4939-3167-5_6. ISBN 9781493931668. 
  17. "Gramene Database: Navigating Plant Comparative Genomics Resources". Current Plant Biology 7-8: 10–15. November 2016. doi:10.1016/j.cpb.2016.12.005. PMID 28713666. 
  18. "Plant Genome DataBase Japan (PGDBJ)". Plant Genomics Databases. Methods in Molecular Biology. 1533. New York, NY: Humana Press. 2017. pp. 45–77. doi:10.1007/978-1-4939-6658-5_3. ISBN 9781493966561. 
  19. "PGSB/MIPS PlantsDB Database Framework for the Integration and Analysis of Plant Genome Data". Plant Genomics Databases. Methods in Molecular Biology. 1533. New York, NY: Humana Press. 2017. pp. 33–44. doi:10.1007/978-1-4939-6658-5_2. ISBN 9781493966561. 
  20. "A Guide to the PLAZA 3.0 Plant Comparative Genomic Database". Plant Genomics Databases. Methods in Molecular Biology. 1533. Humana Press, New York, NY. 2017. pp. 183–200. doi:10.1007/978-1-4939-6658-5_10. ISBN 9781493966561. 
  21. "PLAZA 5.0: extending the scope and power of comparative and functional genomics in plants". Nucleic Acids Research 50 (D1): D1468–D1474. January 2022. doi:10.1093/nar/gkab1024. PMID 34747486. 
  22. "Sustainable funding for biocuration: The Arabidopsis Information Resource (TAIR) as a case study of a subscription-based funding model". Database 2016: baw018. 2016-01-01. doi:10.1093/database/baw018. PMID 26989150. 
  23. 23.0 23.1 23.2 "The B73 maize genome: complexity, diversity, and dynamics". Science 326 (5956): 1112–1115. November 2009. doi:10.1126/science.1178534. PMID 19965430. Bibcode2009Sci...326.1112S. 
  24. "Crop genome sequencing: lessons and rationales". Trends in Plant Science 16 (2): 77–88. February 2011. doi:10.1016/j.tplants.2010.10.005. PMID 21081278. 
  25. "US firm's bid to sequence rice genome causes stir in Japan". Nature 398 (6728): 545. April 1999. doi:10.1038/19123. PMID 10217128. 
  26. 26.0 26.1 "The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla". Nature 449 (7161): 463–467. September 2007. doi:10.1038/nature06148. PMID 17721507. Bibcode2007Natur.449..463J. 
  27. "The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus)". Nature 452 (7190): 991–996. April 2008. doi:10.1038/nature06856. PMID 18432245. Bibcode2008Natur.452..991M. 
  28. "The genome of black cottonwood, Populus trichocarpa (Torr. & Gray)". Science 313 (5793): 1596–1604. September 2006. doi:10.1126/science.1128691. PMID 16973872. Bibcode2006Sci...313.1596T. https://escholarship.org/uc/item/3101x2rn. 
  29. "High-throughput gene mapping in Caenorhabditis elegans". Genome Research 12 (7): 1100–1105. July 2002. doi:10.1101/gr.208902. PMID 12097347. 
  30. The International Brachypodium Initiative et al. (February 2010). "Genome sequencing and analysis of the model grass Brachypodium distachyon". Nature 463 (7282): 763–768. doi:10.1038/nature08747. PMID 20148030. Bibcode2010Natur.463..763T. 
  31. "The Sorghum bicolor genome and the diversification of grasses". Nature 457 (7229): 551–556. January 2009. doi:10.1038/nature07723. PMID 19189423. Bibcode2009Natur.457..551P. 
  32. "Genome sequence of the palaeopolyploid soybean". Nature 463 (7278): 178–183. January 2010. doi:10.1038/nature08670. PMID 20075913. 
  33. "The genome of the cucumber, Cucumis sativus L". Nature Genetics 41 (12): 1275–1281. December 2009. doi:10.1038/ng.475. PMID 19881527. 
  34. "The genome of Theobroma cacao". Nature Genetics 43 (2): 101–108. February 2011. doi:10.1038/ng.736. PMID 21186351. 
  35. 35.0 35.1 "The genome of the domesticated apple (Malus × domestica Borkh.)". Nature Genetics 42 (10): 833–839. October 2010. doi:10.1038/ng.654. PMID 20802477. 
  36. "The draft genome of a diploid cotton Gossypium raimondii". Nature Genetics 44 (10): 1098–1103. October 2012. doi:10.1038/ng.2371. PMID 22922876. 
  37. "The draft genome of sweet orange (Citrus sinensis)". Nature Genetics 45 (1): 59–66. January 2013. doi:10.1038/ng.2472. PMID 23179022. 
  38. "The tomato genome sequence provides insights into fleshy fruit evolution". Nature 485 (7400): 635–641. May 2012. doi:10.1038/nature11119. PMID 22660326. Bibcode2012Natur.485..635T. 
  39. "Third generation sequencing: technology and its potential impact on evolutionary biodiversity research". Systematics and Biodiversity. 2015. 
  40. "Third-generation sequencing and the future of genomics". bioRxiv: 048603. 2016-04-13. doi:10.1101/048603. 
  41. "Using spinach to compare technologies for whole genome assemblies". Plant & Animal Genomics XXIII Conference. 2015. 
  42. "Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data". Nature Methods 10 (6): 563–569. June 2013. doi:10.1038/nmeth.2474. PMID 23644548. 
  43. "High-quality de novo assembly of the apple genome and methylome dynamics of early fruit development". Nature Genetics 49 (7): 1099–1106. July 2017. doi:10.1038/ng.3886. PMID 28581499. 
  44. "SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler". GigaScience 1 (1): 18. December 2012. doi:10.1186/2047-217X-1-18. PMID 23587118. 
  45. "DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies". Scientific Reports 6 (1): 31900. August 2016. doi:10.1038/srep31900. PMID 27573208. Bibcode2016NatSR...631900Y. 
  46. Li H (2013). "Aligning sequence reads, clone sequences and assembly contigs with BWA- MEM". arXiv:1303.3997 [q-bio.GN].