Scientific workflow system#Scientific workflows

{{short description|Specialized form of workflow management in a scientific environment}}A scientific workflow system is a specialized form of a workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or workflow, in a scientific application.{{Cite journal|last1=Sun|first1=LiewChee|last2=P|first2=AtkinsonMalcolm|last3=GaleaMichelle|last4=Fong|first4=AngTan|last5=MartinPaul|last6=Van|first6=HemertJano I.|date=2016-12-12|title=Scientific Workflows|url=https://dl.acm.org/doi/abs/10.1145/3012429|journal=ACM Computing Surveys|volume=49|issue=4|pages=1–39|language=EN|doi=10.1145/3012429|hdl=20.500.11820/774ef69e-a499-4bd2-a609-09f050e682ae|s2cid=9408644|hdl-access=free}} Scientific workflow systems are generally developed for use by scientists from different disciplines like astronomy, earth science, and bioinformatics. All such systems are based on an abstract representation of how a computation proceeds in the form of a directed graph, where each node represents a task to be executed and edges represent either data flow or execution dependencies between different tasks. Each system typically provides a visual front-end, allowing the user to build and modify complex applications with little or no programming expertise.{{Cite journal |last1=Oinn |first1=T. |last2=Greenwood |first2=M. |last3=Addis |first3=M. |last4=Alpdemir |first4=M. N. |last5=Ferris |first5=J. |last6=Glover |first6=K. |last7=Goble |first7=C. |author-link7=Carole Goble |last8=Goderis |first8=A. |last9=Hull |first9=D. |last10=Marvin |first10=D. |last11=Li |first11=P. |last12=Lord |first12=P. |last13=Pocock |first13=M. R. |last14=Senger |first14=M. |last15=Stevens |first15=R. |year=2006 |title=Taverna: Lessons in creating a workflow environment for the life sciences |url=https://eprints.soton.ac.uk/260908/1/taverna-ccpe-reviewed.pdf |journal=Concurrency and Computation: Practice and Experience |volume=18 |issue=10 |pages=1067–1100 |doi=10.1002/cpe.993 |s2cid=10219281 |last16=Wipat |first16=A. |last17=Wroe |first17=C.}}{{Cite journal |last1=Yu |first1=J. |last2=Buyya |first2=R. |year=2005 |title=A taxonomy of scientific workflow systems for grid computing |journal=ACM SIGMOD Record |volume=34 |issue=3 |pages=44 |citeseerx=10.1.1.63.3176 |doi=10.1145/1084805.1084814 |s2cid=538714}}{{Cite book |last1=Curcin |first1=V. |title=2008 Cairo International Biomedical Engineering Conference |last2=Ghanem |first2=M. |year=2008 |isbn=978-1-4244-2694-2 |pages=1–9 |chapter=Scientific workflow systems - can one size fit all? |doi=10.1109/CIBEC.2008.4786077 |s2cid=1885579}}

Applications

Distributed scientists can collaborate on conducting large scale scientific experiments and knowledge discovery applications using distributed systems of computing resources, data sets, and devices. Scientific workflow systems play an important role in enabling this vision.

More specialized scientific workflow systems provide a visual programming front end enabling users to easily construct their applications as a visual graph by connecting nodes together, and tools have also been developed to build such applications in a platform-independent manner.{{cite book | author = D. Johnson| title = 2009 5th IEEE International Conference on E-Science Workshops | chapter = A middleware independent Grid workflow builder for scientific applications | pages = 86–91 | date = December 2009 | doi = 10.1109/ESCIW.2009.5407993|display-authors=etal| isbn = 978-1-4244-5946-9 | s2cid = 3339794 | chapter-url = https://eprints.soton.ac.uk/268539/2/2009_-_18539.pdf }} Each directed edge in the graph of a workflow typically represents a connection from the output of one application to the input of the next. A sequence of such edges may be called a pipeline.

Scientific workflows

The simplest computerized scientific workflows are scripts that call in data, programs, and other inputs and produce outputs that might include visualizations and analytical results. These may be implemented in programs such as R or MATLAB, using a scripting language such as Python with a command-line interface, or more recently using open-source web applications such as Jupyter Notebook.

There are many motives for differentiating scientific workflows from traditional business process workflows. These include:

  • providing an easy-to-use environment for individual application scientists themselves to create their own workflows.
  • providing interactive tools for the scientists enabling them to execute their workflows and view their results in real-time.
  • simplifying the process of sharing and reusing workflows between the scientists.
  • enabling scientists to track the provenance of the workflow execution results and the workflow creation steps.

By focusing on the scientists, the focus of designing scientific workflow system shifts away from the workflow scheduling activities, typically considered by grid computing environments for optimizing the execution of complex computations on predefined resources, to a domain-specific view of what data types, tools and distributed resources should be made available to the scientists and how can one make them easily accessible and with specific Quality of Service requirements {{cite journal | doi=10.1016/j.future.2007.07.009 | volume=24 | issue=6 | title=An innovative workflow mapping mechanism for Grids in the frame of Quality of Service | journal=Future Generation Computer Systems | pages=498–511| year=2008 | last1=Kyriazis | first1=Dimosthenis | last2=Tserpes | first2=Konstantinos | last3=Menychtas | first3=Andreas | last4=Litke | first4=Antonis | last5=Varvarigou | first5=Theodora }}

Scientific workflows are now recognized{{by whom|date=January 2014}} as a crucial element of the cyberinfrastructure, facilitating e-Science. Typically sitting on top of a middleware layer, scientific workflows are a means by which scientists can model, design, execute, debug, re-configure, and re-run their analysis and visualization pipelines. Part of the established scientific method is to create a record of the origins of a result, how it was obtained, experimental methods used, machine calibrations and parameters, etc. It is the same in e-Science, except provenance data are a record of the workflow activities invoked, services and databases accessed, data sets used, and so forth. Such information is useful for a scientist to interpret their workflow results and for other scientists to establish trust in the experimental result.Automatic capture and efficient storage of e-Science experiment provenance. Concurrency Computat.: Pract. Exper. 2008; 20:419–429

Sharing workflows

Social networking communities such as myExperiment have been developed to facilitate sharing and collaborative development of scientific workflows. Galaxy provide collaborative mechanisms for editing and publication of workflow definitions and workflow results directly on the Galaxy installation.

Analysis

A key assumption underlying all scientific workflow systems is that the scientists themselves will be able to use a workflow system to develop their applications based on visual flowcharting, logic diagramming, or, as a last resort, writing code to describe the workflow logic. Powerful workflow systems make it easy for non-programmers to first sketch out workflow steps using simple flowcharting tools, and then hook in various data acquisition, analysis, and reporting tools. For maximum productivity, details of the underlying programming code should normally be hidden.

Workflow analysis techniques can be used to analyze the properties of such workflows to verify certain properties before executing them. An example of a theoretical formal analysis framework for the verification and profiling of the control-flow aspects of scientific workflows and their data flow aspects for the Discovery Net system is described in the paper, "The design and implementation of a workflow analysis tool" by Curcin et al.{{Cite journal | last1 = Curcin | first1 = V. | last2 = Ghanem | first2 = M. | last3 = Guo | first3 = Y. | doi = 10.1098/rsta.2010.0157 | title = The design and implementation of a workflow analysis tool | journal = Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences | volume = 368 | issue = 1926 | pages = 4193–4208 | year = 2010 | pmid = 20679131|bibcode = 2010RSPTA.368.4193C | doi-access = free }}

The authors note that introducing program analysis and verification into the workflow world requires detailed understanding of execution semantics of workflow language, including execution properties of nodes and arcs in the workflow graph, understanding functional equivalencies between workflow patterns, and many other issues. Doing such analysis is difficult, and addressing these issues requires building on formal methods used in computer science research (e.g. Petri nets) and building on these formal methods to develop user-level tools to reason about the properties of both workflows and workflow systems. The lack of such tools in the past stopped automated workflow management solutions from maturing from nice-to-have academic toys to production-level tools used outside the narrow circle of early adopters and workflow enthusiasts.

Notable systems

Notable scientific workflow systems include:{{Citation | last1 = Barker | first1 = Adam | last2 = Van Hemert | first2 = Jano |chapter=Scientific Workflow: A Survey and Research Directions |title = Parallel Processing and Applied Mathematics, 7th International Conference, PPAM 2007, Revised Selected Papers | series = Lecture Notes in Computer Science | year = 2008 | volume = 4967 | place = Gdansk, Poland | publisher = Springer Berlin / Heidelberg | doi = 10.1007/978-3-540-68111-3_78 | isbn = 978-3-540-68105-2 | pages = 746–753 | citeseerx = 10.1.1.105.4605}}

  • Anduril, bioinformatics and image analysis{{Cite web |title=Anduril workflow website |url=http://www.anduril.org}}{{Cite journal |last1=Ovaska |first1=Kristian |last2=Laakso |first2=Marko |last3=Haapa-Paananen |first3=Saija |last4=Louhimo |first4=Riku |last5=Chen |first5=Ping |last6=Aittomäki |first6=Viljami |last7=Valo |first7=Erkka |last8=Núñez-Fontarnau |first8=Javier |last9=Rantanen |first9=Ville |date=2010-09-07 |title=Large-scale data integration framework provides a comprehensive view on glioblastoma multiforme |journal=Genome Medicine |volume=2 |issue=9 |pages=65 |doi=10.1186/gm186 |issn=1756-994X |pmc=3092116 |pmid=20822536 |doi-access=free}}
  • Apache Airavata, a general purpose workflow management system{{cite book|title=Proceedings of the 2011 ACM workshop on Gateway computing environments - GCE '11 |pages=21 |doi=10.1145/2110486.2110490 |date=2011-11-18 |last1=Marru |first1=Suresh |last2=Gardler |first2=Ross |last3=Slominski |first3=Aleksander |last4=Douma |first4=Ate |last5=Perera |first5=Srinath |last6=Weerawarana |first6=Sanjiva |last7=Gunathilake |first7=Lahiru |last8=Herath |first8=Chathura |last9=Tangchaisin |first9=Patanachai |last10=Pierce |first10=Marlon |last11=Mattmann |first11=Chris |last12=Singh |first12=Raminder |last13=Gunarathne |first13=Thilina |last14=Chinthaka |first14=Eran |isbn=9781450311236 |s2cid=18341808 }}
  • Apache Airflow, a general purpose workflow management system
  • Apache Taverna, widely used in bioinformatics, astronomy, biodiversity
  • BioBIKE, a cloud-based bioinformatics platform
  • Bioclipse, a graphical workbench, with a scripting environment that lets you perform complex actions as a kind of workflow.
  • Collective Knowledge, a Python-based general workflow and experiment crowdsourcing framework with JSON API and cross-platform package manager
  • Common Workflow Language, a community-developed YAML-based workflow language, supported by multiple engine implementations.
  • Cuneiform, a functional workflow language.{{Cite journal |last1=Brandt |first1=Jörgen |last2=Bux |first2=Marc N. |last3=Leser |first3=Ulf |year=2015 |title=Cuneiform: A functional language for large scale scientific data analysis |url=http://ceur-ws.org/Vol-1330/paper-03.pdf |journal=Proceedings of the Workshops of the EDBT/ICDT |volume=1330 |pages=17–26}}
  • Clone Manager from Sci-Ed.
  • CLC bio, a bioinformatics analysis and workflow management platform from QIAGEN Digital Insights.
  • Discovery Net, one of the earliest examples of a scientific workflow system
  • Galaxy, initially targeted at genomics{{Cite journal |last1=Goecks |first1=J. |last2=Nekrutenko |first2=A. |last3=Taylor |first3=J. |last4=Galaxy Team |first4=T. |year=2010 |title=Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences |journal=Genome Biology |volume=11 |issue=8 |pages=R86 |doi=10.1186/gb-2010-11-8-r86 |pmc=2945788 |pmid=20738864 |doi-access=free}}
  • GenePattern, a powerful scientific workflow system that provides access to hundreds of genomic analysis tools.{{Cite journal|last1=Reich|first1=Michael|last2=Liefeld|first2=Ted|last3=Gould|first3=Joshua|last4=Lerner|first4=Jim|last5=Tamayo|first5=Pablo|last6=Mesirov|first6=Jill P|title=GenePattern 2.0|journal=Nature Genetics|volume=38|issue=5|pages=500–501|doi=10.1038/ng0506-500|pmid=16642009|year=2006|s2cid=5503897}}
  • Kepler, a scientific workflow management system
  • KNIME, an open-source data analytics platform{{Cite journal |last1=Tiwari |first1=Abhishek |last2=Sekhar |first2=Arvind K.T. |year=2007 |title=Workflow based framework for life science informatics |journal=Computational Biology and Chemistry |volume=31 |issue=5–6 |pages=305–319 |doi=10.1016/j.compbiolchem.2007.08.009 |pmid=17931570}}
  • Nextflow, a bioinformatic data analysis workflow system
  • OnlineHPC, online scientific workflow designer and high performance computing toolkit
  • Orange, open source data visualization and analysis
  • Pegasus, an open-source scientific workflow management system{{Cite journal|last1=Deelman|first1=Ewa|author1-link= Ewa Deelman |last2=Vahi|first2=Karan|last3=Juve|first3=Gideon|last4=Rynge|first4=Mats|last5=Callaghan|first5=Scott|last6=Maechling|first6=Philip J.|last7=Mayani|first7=Rajiv|last8=Chen|first8=Weiwei|last9=Ferreira da Silva|first9=Rafael|last10=Livny|first10=Miron|last11=Wenger|first11=Kent|date=May 2015|title=Pegasus, a workflow management system for science automation|journal=Future Generation Computer Systems|language=en|volume=46|pages=17–35|doi=10.1016/j.future.2014.10.008|doi-access=free}}
  • Pipeline Pilot, graphical programming with many tools to address Cheminformatics workflows {{cite web|url=http://accelrys.com/products/pipeline-pilot/ |title=BIOVIA Pipeline Pilot | Scientific Workflow Authoring Application for Data Analysis |website=Accelrys.com |access-date=2016-12-04}}
  • Swift parallel scripting language, a scripting language with many of the capabilities of scientific workflow systems built-in.
  • UGENE provides a workflow management system that is installed on a local computer{{Cite journal |last1=Okonechnikov |first1=K |last2=Golosova |first2=O |last3=Fursov |first3=M |last4=Ugene |first4=Team |year=2012 |title=Unipro UGENE: A unified bioinformatics toolkit |journal=Bioinformatics |volume=28 |issue=8 |pages=1166–7 |doi=10.1093/bioinformatics/bts091 |pmid=22368248 |doi-access=free}}
  • VisTrails, a scientific workflow system developed in Python{{Cite book |last1=Bavoil |first1=L. |title=VIS 05. IEEE Visualization, 2005 |last2=Callahan |first2=S.P. |last3=Crossno |first3=P.J. |last4=Freire |first4=J. |last5=Scheidegger |first5=C.E. |last6=Silva |first6=C.T. |last7=Vo |first7=H.T. |year=2005 |isbn=978-0-7803-9462-9 |pages=135–142 |chapter=VisTrails: Enabling Interactive Multiple-View Visualizations |doi=10.1109/VISUAL.2005.1532788}}

More than 280 computational data analysis workflow systems have been identified,{{cite web|url=https://s.apache.org/existing-workflow-systems|title=Existing Workflow systems|website=Common Workflow Language wiki|archive-url=https://web.archive.org/web/20191017094453/https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems|archive-date=2019-10-17|url-status=live}} although the distinction between data analysis workflows and scientific workflows is fluid, as not all analysis workflow systems are used for scientific purposes.

= Comparisons between bioinformatics workflow systems =

With a large number of bioinformatics workflow systems to choose from,{{cite web |title=Existing Workflow systems |url=https://s.apache.org/existing-workflow-systems |url-status=live |archive-url=https://web.archive.org/web/20191017094453/https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems |archive-date=2019-10-17 |access-date=2019-10-17 |website=Common Workflow Language wiki}} it becomes difficult to understand and compare the features of the different workflow systems. There has been little work conducted in evaluating and comparing the systems from a bioinformatician's perspective, especially when it comes to comparing the data types they can deal with, the in-built functionalities that are provided to the user or even their performance or usability. Examples of existing comparisons include:

  • The paper "Scientific workflow systems-can one size fit all?", which provides a high-level framework for comparing workflow systems based on their control flow and data flow properties. The systems compared include Discovery Net, Taverna, Triana, Kepler as well as Yawl and BPEL.
  • The paper "Meta-workflows: pattern-based interoperability between Galaxy and Taverna"{{Cite book |last1=Abouelhoda |first1=M. |title=Proceedings of the 1st International Workshop on Workflow Approaches to New Data-centric Science - Wands '10 |last2=Alaa |first2=S. |last3=Ghanem |first3=M. |year=2010 |isbn=9781450301886 |pages=1 |chapter=Meta-workflows |doi=10.1145/1833398.1833400 |s2cid=17343728}} which provides a more user-oriented comparison between Taverna and Galaxy in the context of enabling interoperability between both systems.
  • The infrastructure paper "Delivering ICT Infrastructure for Biomedical Research"{{Citation |last1=Nyrönen |first1=TH |title=Delivering ICT infrastructure for biomedical research |pages=37–44 |year=2012 |series=Proceedings of the WICSA/ECSA 2012 Companion Volume (WICSA/ECSA '12) |publisher=ACM |doi=10.1145/2361999.2362006 |isbn=9781450315685 |s2cid=18199745 |display-authors=etal |last2=Laitinen |first2=J}} compares two workflow systems, Anduril and Chipster,{{Cite journal |last1=Kallio |first1=M. A. |last2=Tuimala |first2=J. T. |last3=Hupponen |first3=T |last4=Klemelä |first4=P |last5=Gentile |first5=M |last6=Scheinin |first6=I |last7=Koski |first7=M |last8=Käki |first8=J |last9=Korpelainen |first9=E. I. |year=2011 |title=Chipster: User-friendly analysis software for microarray and other high-throughput data |journal=BMC Genomics |volume=12 |pages=507 |doi=10.1186/1471-2164-12-507 |pmc=3215701 |pmid=21999641 |doi-access=free}} in terms of infrastructure requirements in a cloud-delivery model.
  • The paper "A review of bioinformatic pipeline frameworks"{{cite journal |last=Leipzig |first=J |name-list-style=vanc |date=2016 |title=A review of bioinformatic pipeline frameworks |journal=Briefings in Bioinformatics |volume=18 |issue=3 |pages=530–536 |doi=10.1093/bib/bbw020 |pmc=5429012 |pmid=27013646}} attempts to classify workflow management systems based on three dimensions: "using an implicit or explicit syntax, using a configuration, convention or class-based design paradigm and offering a command line or workbench interface".

See also

References

{{Reflist|30em}}