Scientific workflow system
A scientific workflow system is a specialized form of a workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or workflow, in a scientific application.
Applications
Distributed scientists can collaborate on conducting large scale scientific experiments and knowledge discovery applications using distributed systems of computing resources, data sets, and devices. Scientific workflow systems play an important role in enabling this vision.
More specialized scientific workflow systems, e.g. Discovery Net, Apache Taverna and Kepler, provide a visual programming front end enabling users to easily construct their applications as a visual graph by connecting nodes together, and tools have also been developed to build such applications in a platform-independent manner.[1] Each directed edge in the graph of a workflow typically represents a connection from the output of one application to the input of the next. A sequence of such edges may be called a pipeline.
A bioinformatics workflow management system is a specialized scientific workflow system focused on bioinformatics.
Scientific workflows
The simplest computerized scientific workflows are scripts that call in data, programs, and other inputs and produce outputs that might include visualizations and analytical results. These may be implemented in programs such as R or MATLAB, or using a scripting language such as Python or Perl with a command-line interface.
There are many motives for differentiating scientific workflows from traditional business process workflows. These include:
- providing an easy-to-use environment for individual application scientists themselves to create their own workflows
- providing interactive tools for the scientists enabling them to execute their workflows and view their results in real-time
- simplifying the process of sharing and reusing workflows between the scientists.
- enabling scientists to track the provenance of the workflow execution results and the workflow creation steps.
By focusing on the scientists, the focus of designing scientific workflow system shifts away from the workflow scheduling activities, typically considered by grid computing environments for optimizing the execution of complex computations on predefined resources, to a domain-specific view of what data types, tools and distributed resources should be made available to the scientists and how can one make them easily accessible and with specific Quality of Service requirements [2]
Scientific workflows are now recognized[by whom?] as a crucial element of the cyberinfrastructure, facilitating e-Science. Typically sitting on top of a middleware layer, scientific workflows are a means by which scientists can model, design, execute, debug, re-configure and re-run their analysis and visualization pipelines. Part of the established scientific method is to create a record of the origins of a result, how it was obtained, experimental methods used, machine calibrations and parameters, etc. It is the same in e-Science, except provenance data are a record of the workflow activities invoked, services and databases accessed, data sets used, and so forth. Such information is useful for a scientist to interpret their workflow results and for other scientists to establish trust in the experimental result.[3]
Examples
There are many examples of scientific workflow systems:[4]
- AiiDA, developed primarily for use in computational materials science
- Anduril, bioinformatics and image analysis
- ASKALON, a workflow system for Cloud and Grid executions of workflows[5]
- Apache Airavata, a general purpose workflow management system[6][7]
- Apache Taverna, widely used in bioinformatics, astronomy, biodiversity
- Autosubmit, a Python-based tool that allows creating, launching and monitoring weather, air quality and climate experiments.
- BioBIKE, a cloud-based bioinformatics platform
- Bioclipse, a graphical workbench, with a scripting environment that lets you perform complex actions as a kind of workflow.
- Collective Knowledge, a Python-based general workflow and experiment crowdsourcing framework with JSON API and cross-platform package manager
- Common Workflow Language, a community-developed YAML-based workflow language, supported by multiple engine implementations.
- Cuneiform, a functional workflow language.
- Cylc, a workflow engine for cycling systems, with extensive support for research experiments and production systems in the atmospheric and related sciences.
- Discovery Net, one of the earliest examples of a scientific workflow system
- Ergatis, workflow creation and monitoring interface
- FireWorks,[8] a system for defining, managing, and executing workflows
- Galaxy, initially targeted at genomics
- GenePattern, a powerful scientific workflow system that provides access to hundreds of genomic analysis tools. [9]
- HyperFlow,[10] a scientific workflow system combining a simple workflow description with low-level scripting programming, based on Node.js
- JS4Cloud (JavaScript for Cloud),[11] a scripting language for programming data analysis scientific workflows based on JavaScript
- Kepler scientific workflow system
- KNIME, an open-source data analytics platform
- Nextflow, a DSL for data-driven computational pipelines
- Nipype,[12] a Python-based workflow system with specific support for brain imaging
- OnlineHPC, online scientific workflow designer and high performance computing toolkit
- OpenMOLE,[13] a scientific workflow system with transparent scaling from a multi-threaded execution up to grid computing execution
- Orange, open source data visualization and analysis
- Pegasus Workflow Management System [14][15]
- Pipeline Pilot, graphical programming with many tools to address Cheminformatics workflows [16]
- SciCumulus, a scientific workflow system, made for HPC environments which stores provenance data in a structured database, queryable at runtime. Used in bioinformatics, astronomy, seismic, deep water oil exploitation, etc.
- Swift parallel scripting language, a scripting language with many of the capabilities of scientific workflow systems built-in.
- Tavaxy,[17] a cloud-based workflow system that integrates features from both Taverna and Galaxy.
- TimeStudio,[18] a general purpose, agile, scientific workflow system fully implemented in MATLAB.
- VisTrails, a scientific workflow system developed in Python
- Workspace, a cross-platform workflow framework by CSIRO
- Yabi, a Python-based general workflow system integrating any command line tool
A survey and comparison of some of the above systems can be found in the paper, "Scientific workflow systems – can one size fit all?"[19]
Sharing workflows
Social networking communities such as myExperiment have developed to facilitate sharing and collaborative development of scientific workflows. Galaxy provide collaborative mechanisms for editing and publication of workflow definitions and workflow results directly on the Galaxy installation.
Analysis
A key assumption underlying all scientific workflow systems is that the scientists themselves will be able to use a workflow system to develop their applications based on visual flowcharting, logic diagramming, or, as a last resort, writing code to describe the workflow logic. Powerful workflow systems make it easy for non-programmers to first sketch out workflow steps using simple flowcharting tools, and then hook in various data acquisition, analysis, and reporting tools. For maximum productivity, details of the underlying programming code should normally be hidden.
Workflow analysis techniques can be used to analyze the properties of such workflows to verify certain properties before executing them. An example of a theoretical formal analysis framework for the verification and profiling of the control-flow aspects of scientific workflows and their data flow aspects for the Discovery Net system is described in the paper, "The design and implementation of a workflow analysis tool" by Curcin et al.[20]
The authors note that introducing program analysis and verification into the workflow world requires detailed understanding of execution semantics of workflow language, including execution properties of nodes and arcs in the workflow graph, understanding functional equivalencies between workflow patterns, and many other issues. Doing such analysis is difficult, and addressing these issues requires building on formal methods used in computer science research (e.g. Petri nets) and building on these formal methods to develop user-level tools to reason about the properties of both workflows and workflow systems. The lack of such tools in the past stopped automated workflow management solutions from maturing from nice-to-have academic toys to production-level tools used outside the narrow circle of early adopters and workflow enthusiasts.
See also
- Bioinformatics workflow management systems
- e-Science
- Grid computing
- Workflow engine
References
- Belhajjame, Khalid; Wolstencroft, Katy; Corcho, Óscar; Oinn, Tom; Tanoh, Franck; Williams, Alan; Goble, Carole A.. "Metadata Management in the Taverna Workflow System". CCGRID 2008: 651–656. doi:10.1109/CCGRID.2008.17. http://doi.ieeecomputersociety.org/10.1109/CCGRID.2008.17.
- ↑ D. Johnson (December 2009). "A middleware independent Grid workflow builder for scientific applications". 2009 5th IEEE International Conference on E-Science Workshops (IEEE): 86–91. doi:10.1109/ESCIW.2009.5407993. https://dx.doi.org/10.1109/ESCIW.2009.5407993.
- ↑ "An innovative workflow mapping mechanism for Grids in the frame of Quality of Service". Future Generation Computer Systems 24: 498–511. doi:10.1016/j.future.2007.07.009.
- ↑ Automatic capture and efficient storage of e-Science experiment provenance. Concurrency Computat.: Pract. Exper. 2008; 20:419–429
- ↑ Barker, Adam; Van Hemert, Jano (2008), Scientific Workflow: A Survey and Research Directions, Lecture Notes in Computer Science, 4967, Gdansk, Poland: Springer Berlin / Heidelberg, pp. 746–753, doi:10.1007/978-3-540-68111-3_78, ISBN 978-3-540-68105-2, http://www.springerlink.com/content/52053063k0708658/
- ↑ "Askalon Programming Environment for Cloud and Grid Computing". 2013-01-10. http://www.askalon.org. Retrieved 2016-12-04.
- ↑ "Apache Airavata". http://airavata.apache.org/. Retrieved 2016-12-04.
- ↑ "Apache airavata". 2011-11-18. doi:10.1145/2110486.2110490. http://dl.acm.org/citation.cfm?id=2110490. Retrieved 2016-12-04.
- ↑ "Introduction to FireWorks (workflow software) — FireWorks 1.3.9 documentation". https://pythonhosted.org/FireWorks/. Retrieved 2016-12-04.
- ↑ Reich, Michael; Liefeld, Ted; Gould, Joshua; Lerner, Jim; Tamayo, Pablo; Mesirov, Jill P. "GenePattern 2.0". Nature Genetics 38 (5): 500–501. doi:10.1038/ng0506-500. http://www.nature.com/doifinder/10.1038/ng0506-500.
- ↑ "dice-cyfronet/hyperflow: HyperFlow: a distributed workflow engine". https://github.com/dice-cyfronet/hyperflow. Retrieved 2016-12-04.
- ↑ Marozzo, Fabrizio; Talia, Domenico; Trunfio, Paolo (2015), "JS4Cloud: Script-based Workflow Programming for Scalable Data Analysis on Cloud Platforms", Concurrency and Computation: Practice and Experience (Wiley InterScience) 27: 5214–5237, doi:10.1002/cpe.3563, http://onlinelibrary.wiley.com/doi/10.1002/cpe.3563/abstract
- ↑ "Nipype : Neuroimaging in Python Pipelines and Surfaces". http://nipy.org/nipype. Retrieved 2016-12-04.
- ↑ "scientific workflow, distributed computing, parameter tuning". OpenMOLE. http://openmole.org/. Retrieved 2016-12-04.
- ↑ Vahi, Karan. "Pegasus WMS – Automate, recover, and debug scientific computations". http://pegasus.isi.edu/. Retrieved 2016-12-04.
- ↑ "Pegasus: A Framework for Mapping Complex Scientific Workflows onto Distributed Systems". Scientific Programming 13: 219–237. doi:10.1155/2005/128026.
- ↑ "BIOVIA Pipeline Pilot | Scientific Workflow Authoring Application for Data Analysis". http://accelrys.com/products/pipeline-pilot/. Retrieved 2016-12-04.
- ↑ Abouelhoda, M.; Issa, S.; Ghanem, M. (2012). "Tavaxy: Integrating Taverna and Galaxy workflows with cloud computing support". BMC Bioinformatics 13: 77. doi:10.1186/1471-2105-13-77. PMID 22559942.
- ↑ "Distributable transparent science". Time Studio Project. http://timestudioproject.com/. Retrieved 2016-12-04.
- ↑ Curcin, V; Ghanem, M (2008), Scientific workflow systems – can one size fit all?, Biomedical Engineering Conference, 2008. CIBEC 2008, IEEE, doi:10.1109/CIBEC.2008.4786077, ISBN 978-1-4244-2695-9, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4786077
- ↑ Curcin, V.; Ghanem, M.; Guo, Y. (2010). "The design and implementation of a workflow analysis tool". Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 368 (1926): 4193–4208. doi:10.1098/rsta.2010.0157. Bibcode: 2010RSPTA.368.4193C.
External links
- "A taxonomy of scientific workflow systems for grid computing". ACM SIGMOD Record 34: 44. doi:10.1145/1084805.1084814.
- Scientific workflow systems - can one size fit all? paper in CIBEC'08 comparing the features of multiple scientific workflow systems.
- List of software tools related to scientific workflows on the DataONE website
- Cylc
- Ergatis
- Mobyle
- OnlineHPC
- Pipeline Pilot
- Swift
- Tavaxy
- Triana