Software:Snakemake

From HandWiki
Snakemake
Snakemake logo, black text
Repositorygithub.com/snakemake/snakemake
Written inPython, YAML
PlatformWindows, Linux, macOS
LicenseMIT License
Websitesnakemake.github.io

Snakemake is a scientific workflow management framework that uses a domain-specific language (DSL) to specify allowed data transformations for automatic generation of processing pipelines. Subscribing to the declarative paradigm, usage invloves stating desired products instead of an explicit course of action. Individual steps, however, must still be defined as rules through syntax that is inspired from that of GNU Make. Since Snakemake's DSL is implimented as an extension of python,[1] these rules are written in a Snakefile using YAML and python syntax. Execution of the generated pipeline is done by selecting and running each step as prerquisites become satisfied. Steps with no interdependence may be executed in parallel.

The goal of Snakemake is to enhance reproducibility, transparency, and adaptability in bioinformatic analysis.[2] These goals manifest in the following characteristics: Native integration of external tools like Conda and Singularity borrow their functionalities, namely dependency resolution and containerization. Being platform agnostic allows it to run the same workflow on local or high performance compute (HPC) environments. Reports can be generated to describe and log the conditions and actions performed at each step.

Operation

Rules

The set of actions that Snakemake can take towards generating a desired data product is specified in a file called the Snakefile. These actions are called "rules" and include a name, inputs, outputs, and how to start the represented action.[1] Inputs and outputs are declared as explicit file paths, cloud storage, or sets of locations through patterns and wildcards. Several methods exist to start an action, including a shell command, python script, or Snakemake wrapper.[3] A wrapper is a predefined configuration that allows Snakemake to directly use prevalent software such as Samtools.[4] In the Snakefile, YAML adhereing to Snakemake's DSL and native python appears in the syntax,[3] but upon execution, the rules are translated into python via a finite automation.[1]

Example workflow

Example rule:

rule user_defined_name:
    input:
        "path/to/input.file"
    output:
        "path/to/output.file"
    shell:
        "shell_command {input} > {output}"

Workflows and Execution

When given a set of rules and target outputs, Snakemake will first either look for files that match the targets or search for defined rules that produce them. By then redefining the targets to the inputs of found rules and repeating recursively until all inputs are existing files, an implicit workflow is generated in the form of a directed acycic graph (DAG) of steps.[2] The execution order of these steps is computed by optimizing a mixed integer linear program for various parameters such as runtime, total disk space required for temporary files, and user defined priority.[5] Where possible, intermediate data product from previous runs can be used to skip ahead if prerequisite steps are found to be equivalent. Equivalence is determined by comparing the prior steps used to generate the data, recorded through a ledger system analogous to that of block chains.[5]

Goals

Example Snakefile and tool wrapper reuse

Reproducibility

A cornerstone of science is the need to validate experimental results so that confidence can be gained on the result's correctness. Snakemake utilizes Conda to reconstruct the compute environment with required software dependencies[6] and supports containerization via Docker or Singularity.[2] Reports can also be generated to summarize the various parameters and events at each processing step, including an image of the computed DAG.[2]

Transparency

Information gained from science must be shared so that findings can be built upon and utilized, therefore the experiment and results must be comprehensible. In addition to the report generation mentioned above, the developers also claim that a typical Snakefile contains a high proportion of commonly understood terms rather than technical or Snakemake specific language.[2] This supports transparency by reducing the amount of domain knowledge required to understand what an experiment did and what the results mean.

Adaptability

Experimental methods themselves can contribute to science if they can be repurposed, in whole or in part, to new research. To this end, previously written Snakefiles can be referenced in new Snakefiles, thus allowing rules to be reused.[2] Certain rules, called tool wrappers, fully integrate external software as a set of possible steps for workflow generation and are publicly availible.[4]

Comparisons

Snakemake belongs to a family of implicit convention frameworks,[7] members of which include Make and Nextflow. The common characteristic of this group of tools is the ability to generate specific workflows from a collection of discrete steps. Specifically, the GNU derivative of Make used for building source code, is considered to be the source of inspiration for Snakemake and contribute to its model of operation and certain features like wildcards.[8] Computationally, Snakemake was found to have a competitve runtime but failed to consistently utilize the CPU while allowing excess context switching[9]. Usage wise, Snakemake opts for text-based configuration instead of a GUI, like that of Galaxy, and attempts to be environment agnostic.[10][2]

Usage Examples

Bioinformatics

  • SnakeChunks is a set of Snakemake rules used for processing Next-Generation Sequencing data, including quality control, peak calling, and visualization steps. The authors demonstrated its use on transcriptome (RNA-seq) and genome wide location (ChIP-seq) data.[11]
  • Recount3 is a database of human and mouse RNA-seq data that was curated and annotated with a Snakemake based workflow.[12]
  • VIPER is a snakemake based pipeline for producing graphical reports of RNA-seq data.[13]
  • RiboDoc is a ribosomal profiling package that also involves complex steps of quality control and data-intensive statistical analysis managed by Snakemake and Docker.[14]
  • Dadasnake integrates amplicon sequencing analysis with DADA2 using Snakemake. The implied workflow is for phylogenetic analysis of specific genes.[15]
  • In order to compile timelapses of selective plane illumination microscopy images of cells, Snakemake was used to parallelize the task on high performanc compute clusters.[16]

Other

  • HEXME is a dataset of tetrahedral meshes generated from CAD models, using Snakemake. Rules were specified to generate three variants for each input.[17]

References

  1. 1.0 1.1 1.2 Köster, Johannes; Rahmann, Sven (2012) (in en). Building and Documenting Workflows with Python-Based Snakemake. Marc Herbstritt. pp. 8 pages. doi:10.4230/OASICS.GCB.2012.49. http://drops.dagstuhl.de/opus/volltexte/2012/3717/. 
  2. 2.0 2.1 2.2 2.3 2.4 2.5 2.6 Mölder, Felix; Jablonski, Kim Philipp; Letcher, Brice; Hall, Michael B.; Tomkins-Tinch, Christopher H.; Sochat, Vanessa; Forster, Jan; Lee, Soohyun et al. (2021-04-19). "Sustainable data analysis with Snakemake" (in en). F1000Research 10: 33. doi:10.12688/f1000research.29032.2. ISSN 2046-1402. PMID 34035898. PMC 8114187. https://f1000research.com/articles/10-33/v2. 
  3. 3.0 3.1 "Snakemake — Snakemake 7.0.0 documentation" (in en). https://snakemake.readthedocs.io/. 
  4. 4.0 4.1 "The Snakemake Wrappers repository — Snakemake Wrappers tags/v1.2.0 documentation". https://snakemake-wrappers.readthedocs.io/en/stable/. 
  5. 5.0 5.1 Köster, Johannes; Rahmann, Sven (2012-10-01). "Snakemake—a scalable bioinformatics workflow engine". Bioinformatics 28 (19): 2520–2522. doi:10.1093/bioinformatics/bts480. ISSN 1367-4803. https://doi.org/10.1093/bioinformatics/bts480. 
  6. Strozzi, Francesco; Janssen, Roel; Wurmus, Ricardo; Crusoe, Michael R.; Githinji, George; Di Tommaso, Paolo; Belhachemi, Dominique; Möller, Steffen et al. (2019), Anisimova, Maria, ed., "Scalable Workflows and Reproducible Data Analysis for Genomics" (in en), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology (New York, NY: Springer): pp. 723–745, doi:10.1007/978-1-4939-9074-0_24, ISBN 978-1-4939-9074-0, https://doi.org/10.1007/978-1-4939-9074-0_24, retrieved 2022-02-27 
  7. Leipzig, Jeremy (2017-05-01). "A review of bioinformatic pipeline frameworks". Briefings in Bioinformatics 18 (3): 530–536. doi:10.1093/bib/bbw020. ISSN 1467-5463. PMID 27013646. PMC 5429012. https://doi.org/10.1093/bib/bbw020. 
  8. "Using prototyping to choose a bioinformatics workflow management system" (in en). PLOS Computational Biology 17 (2): e1008622. 2021-02-25. doi:10.1371/journal.pcbi.1008622. ISSN 1553-7358. https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008622. 
  9. Larsonneur, Elise; Mercier, Jonathan; Wiart, Nicolas; Floch, Edith Le; Delhomme, Olivier; Meyer, Vincent (2018-12-06). "Evaluating Workflow Management Systems: A Bioinformatics Use Case". 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM): 2773–2775. doi:10.1109/BIBM.2018.8621141. https://ieeexplore.ieee.org/document/8621141. 
  10. Spjuth, Ola; Bongcam-Rudloff, Erik; Hernández, Guillermo Carrasco; Forer, Lukas; Giovacchini, Mario; Guimera, Roman Valls; Kallio, Aleksi; Korpelainen, Eija et al. (2015-08-19). "Experiences with workflows for automating data-intensive bioinformatics". Biology Direct 10 (1): 43. doi:10.1186/s13062-015-0071-8. ISSN 1745-6150. PMID 26282399. PMC 4539931. https://doi.org/10.1186/s13062-015-0071-8. 
  11. Rioualen, Claire; Charbonnier-Khamvongsa, Lucie; Helden, Jacques van (2017-07-19) (in en). SnakeChunks: modular blocks to build Snakemake workflows for reproducible NGS analyses. pp. 165191. doi:10.1101/165191. https://www.biorxiv.org/content/10.1101/165191v1. 
  12. Wilks, Christopher; Zheng, Shijie C.; Chen, Feng Yong; Charles, Rone; Solomon, Brad; Ling, Jonathan P.; Imada, Eddie Luidy; Zhang, David et al. (2021-11-29). "recount3: summaries and queries for large-scale RNA-seq expression and splicing". Genome Biology 22 (1): 323. doi:10.1186/s13059-021-02533-6. ISSN 1474-760X. PMID 34844637. PMC 8628444. https://doi.org/10.1186/s13059-021-02533-6. 
  13. Cornwell, MacIntosh; Vangala, Mahesh; Taing, Len; Herbert, Zachary; Köster, Johannes; Li, Bo; Sun, Hanfei; Li, Taiwen et al. (2018-04-12). "VIPER: Visualization Pipeline for RNA-seq, a Snakemake workflow for efficient and complete RNA-seq analysis". BMC Bioinformatics 19 (1): 135. doi:10.1186/s12859-018-2139-9. ISSN 1471-2105. PMID 29649993. PMC 5897949. https://doi.org/10.1186/s12859-018-2139-9. 
  14. François, Pauline; Arbes, Hugo; Demais, Stéphane; Baudin-Baillieu, Agnès; Namy, Olivier (2021-01-01). "RiboDoc: A Docker-based package for ribosome profiling analysis" (in en). Computational and Structural Biotechnology Journal 19: 2851–2860. doi:10.1016/j.csbj.2021.05.014. ISSN 2001-0370. PMID 34093996. PMC 8141510. https://www.sciencedirect.com/science/article/pii/S2001037021001951. 
  15. Weißbecker, Christina; Schnabel, Beatrix; Heintz-Buschart, Anna (2020-11-30). "Dadasnake, a Snakemake implementation of DADA2 to process amplicon sequencing data for microbial ecology". GigaScience 9 (12): giaa135. doi:10.1093/gigascience/giaa135. ISSN 2047-217X. https://doi.org/10.1093/gigascience/giaa135. 
  16. Schmied, Christopher; Steinbach, Peter; Pietzsch, Tobias; Preibisch, Stephan; Tomancak, Pavel (2016-04-01). "An automated workflow for parallel processing of large multiview SPIM recordings". Bioinformatics 32 (7): 1112–1114. doi:10.1093/bioinformatics/btv706. ISSN 1367-4803. https://doi.org/10.1093/bioinformatics/btv706. 
  17. Beaufort, Pierre-Alexandre; Reberol, Maxence; Liu, Heng; Ledoux, Franck; Bommes, David (2021-11-19). "Hex Me If You Can". arXiv:2111.10295 [cs]. doi:10.48550/arxiv.2111.10295. http://arxiv.org/abs/2111.10295.