Biology:Rfam

From HandWiki
Short description: Online database of non-coding RNA and other RNA elements
Rfam
Rfam logo.png
Content
DescriptionThe Rfam database provides alignments, consensus secondary structures and covariance models for RNA families.
Data types
captured
RNA families
Organismsall
Contact
Research centreEBI
Primary citationPMID 33211869
Access
Data formatStockholm format
Websiterfam.org
Download URLFTP
Miscellaneous
Software licensePublic domain
Bookmarkable
entities
yes

Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database originally developed at the Wellcome Trust Sanger Institute in collaboration with Janelia Farm,[1][2][3][4] and currently hosted at the European Bioinformatics Institute.[5] Rfam is designed to be similar to the Pfam database for annotating protein families.

Unlike proteins, ncRNAs often have similar secondary structure without sharing much similarity in the primary sequence. Rfam divides ncRNAs into families based on evolution from a common ancestor. Producing multiple sequence alignments (MSA) of these families can provide insight into their structure and function, similar to the case of protein families. These MSAs become more useful with the addition of secondary structure information. Rfam researchers also contribute to Wikipedia's RNA WikiProject.[4][6]

Uses

The Rfam database can be used for a variety of functions. For each ncRNA family, the interface allows users to: view and download multiple sequence alignments; read annotation; and examine species distribution of family members. There are also links provided to literature references and other RNA databases. Rfam also provides links to Wikipedia so that entries can be created or edited by users.

The interface at the Rfam website allows users to search ncRNAs by keyword, family name, or genome as well as to search by ncRNA sequence or EMBL accession number.[7] The database information is also available for download, installation and use using the INFERNAL software package.[8][9][10] The INFERNAL package can also be used with Rfam to annotate sequences (including complete genomes) for homologues to known ncRNAs.

Methods

A theoretical ncRNA alignment from 6 species. Secondary structure base pairs are coloured in blocks and identified in the secondary structure consensus sequence (bottom line) by the < and > symbols.

In the database, the information of the secondary structure and the primary sequence, represented by the MSA, is combined in statistical models called profile stochastic context-free grammars (SCFGs), also known as covariance models. These are analogous to hidden Markov models used for protein family annotation in the Pfam database.[1] Each family in the database is represented by two multiple sequence alignments in Stockholm format and a SCFG.

The first MSA is the "seed" alignment. It is a hand-curated alignment that contains representative members of the ncRNA family and is annotated with structural information. This seed alignment is used to create the SCFG, which is used with the Rfam software INFERNAL to identify additional family members and add them to the alignment. A family-specific threshold value is chosen to avoid false positives.

Until release 12, Rfam used an initial BLAST filtering step because profile SCFGs were too computationally expensive. However, the latest versions of INFERNAL are fast enough[10] so that the BLAST step is no longer necessary.[11]

The second MSA is the “full” alignment, and is created as a result of a search using the covariance model against the sequence database. All detected homologs are aligned to the model, giving the automatically produced full alignment.

History

Version 1.0 of Rfam was launched in 2003 and contained 25 ncRNA families and annotated about 50 000 ncRNA genes. In 2005, version 6.1 was released and contained 379 families annotating over 280 000 genes. In August 2012, version 11.0 contained 2208 RNA families, while the current version (14.9, released in November 2022) annotates 4108[7] families.

Major releases and publications

  • 2003 - Rfam: an RNA family database.[1]
  • 2005 - Rfam: annotating non-coding RNAs in complete genomes.[2]
  • 2008 - The RNA WikiProject: community annotation of RNA families.[6]
  • 2008 - Rfam: updates to the RNA families database.[3]
  • 2011 - Rfam: Wikipedia, clans and the “decimal” release.[4]
  • 2012 - Rfam 11.0: 10 years of RNA families.[12]
  • 2014 - Rfam 12.0: updates to the RNA families database. [3]
  • 2017 - Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families.[13]
  • 2020 - Rfam 14: expanded coverage of metagenomic, viral and microRNA families.[14]

Problems

  1. The genomes of higher eukaryotes contain many ncRNA-derived pseudogenes and repeats. Distinguishing these non-functional copies from functional ncRNA is a formidable challenge.[2]
  2. Introns are not modeled by covariance models.

References

  1. 1.0 1.1 1.2 "Rfam: an RNA family database". Nucleic Acids Res. 31 (1): 439–41. 2003. doi:10.1093/nar/gkg006. PMID 12520045. 
  2. 2.0 2.1 2.2 "Rfam: annotating non-coding RNAs in complete genomes". Nucleic Acids Res. 33 (Database issue): D121–4. 2005. doi:10.1093/nar/gki081. PMID 15608160. 
  3. 3.0 3.1 3.2 "Rfam: updates to the RNA families database". Nucleic Acids Research 37 (Database issue): D136–D140. October 2008. doi:10.1093/nar/gkn766. PMID 18953034. 
  4. 4.0 4.1 4.2 "Rfam: Wikipedia, clans and the "decimal" release.". Nucleic Acids Res 39 (Database issue): D141–5. 2011. doi:10.1093/nar/gkq1129. PMID 21062808. 
  5. "Moving to xfam.org". Xfam Blog. http://xfam.wordpress.com/2014/05/01/moving-to-xfam-org/. 
  6. 6.0 6.1 "The RNA WikiProject: community annotation of RNA families". RNA 14 (12): 2462–4. December 2008. doi:10.1261/rna.1200508. PMID 18945806. 
  7. 7.0 7.1 "Rfam families". http://rfam.xfam.org/. 
  8. "RNA sequence analysis using covariance models". Nucleic Acids Research 22 (11): 2079–88. June 1994. doi:10.1093/nar/22.11.2079. PMID 8029015. 
  9. Eddy SR (2002). "A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure". BMC Bioinformatics 3: 18. doi:10.1186/1471-2105-3-18. PMID 12095421. 
  10. 10.0 10.1 "Infernal 1.1: 100-fold faster RNA homology searches.". Bioinformatics 29 (22): 2933–5. 2013. doi:10.1093/bioinformatics/btt509. PMID 24008419. 
  11. "Rfam 12.0: updates to the RNA families database". Nucleic Acids Res 43 (Database issue): D130–7. January 2015. doi:10.1093/nar/gku1063. PMID 25392425. 
  12. "Rfam 11.0: 10 years of RNA families". Nucleic Acids Res 41 (Database issue): D226–32. January 2013. doi:10.1093/nar/gks1005. PMID 23125362. 
  13. "Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families". Nucleic Acids Res 46 (D1): D335–D342. January 2018. doi:10.1093/nar/gkx1038. PMID 29112718. 
  14. "Rfam 14: expanded coverage of metagenomic, viral and microRNA families". Nucleic Acids Res 49 (D1): D192–D200. January 2021. doi:10.1093/nar/gkaa1047. PMID 33211869. 

External links