Published Versions 2 Vol 2 (1) : 108–121 2020
FAIR Computational Workflows
2980 72 0
Abstract & Keywords
Abstract: Computational workflows describe the complex multi-step methods that are used for data collection, data preparation, analytics, predictive modelling, and simulation that lead to new data products. They can inherently contribute to the FAIR data principles: by processing data according to established metadata; by creating metadata themselves during the processing of data; and by tracking and recording data provenance. These properties aid data quality assessment and contribute to secondary data usage. Moreover, workflows are digital objects in their own right. This paper argues that FAIR principles for workflows need to address their specific nature in terms of their composition of executable software steps, their provenance, and their development.
Keywords: Computational workflow, Reproducibility, Software, FAIR data, Provenance
Carole Goble acknowledges funding by BioExcel2 (H2020 823830), IBISBA1.0 (H2020 730976) and EOSCLife (H2020 824087). Daniel Schober’s work was financed by Phenomenal (H2020 654241) at the initiation-phase of this effort, current work in kind contribution. Kristian Peters is funded by the German Network for Bioinformatics Infrastructure (de.NBI) and acknowledges BMBF funding under grant number 031L0107. Stian Soiland-Reyes is funded by BioExcel2 (H2020 823830). Daniel Garijo, Yolanda Gil, gratefully acknowledge support from DARPA award W911NF-18-1-0027, NIH award 1R01AG059874-01, and NSF award ICER-1740683.
M. Atkinson, S. Gesing, J. Montagnat, & I. Taylor. Scientific workflows: Past, present and future, Future Generation Computer Systems 75(2017), 216-227,
E. Deelman, T. Peterka, I. Altintas, C.D. Carothers, K.K. van Dam, K. Moreland, … & J. Vetter. The future of scientific workflows. The International Journal of High Performance Computing Applications 32(1)(2017), 159–175. doi: 10.1177/1094342017704893.
K. Peters, J. Bradbury, S. Bergmann, M. Capuccini, M. Cascante, P. de Atauri, … & C. Steinbeck. PhenoMeNal: Processing and analysis of metabolomics data in the cloud. GigaScience 8(2)(2019),
S. Cohen-Boulakia, K. Belhajjame, O. Collin, J. Chopard, C. Froidevaux, A. Gaignard, … & C. Blanche. Workflows for computational reproducibility in the life sciences: Status, challenges and opportunities. Future Generation Computer Systems 75(2017), 284-298.
D. Garijo, O. Corcho, Y. Gil, M.N. Braskie, D. Hibar, X. Hua, N. Jahanshad, P. Thompson, & A.W. Toga. Workflow reuse in practice: A study of neuroimaging pipeline users. In: Proceedings of the IEEE Conference on e-Science, Guarujua, Brazil, 2014.
A. Shade, & T.K. Teal. Computing workflows for biologists: A roadmap. PLOS Biology 13(11)(2015), . doi: 10.1371/journal.pbio.1002303.
C. Mathew, A. Güntsch, M. Obst, S. Vicario, R. Haines, A. Williams, Y. de Jong, & C. Goble. A semi-automated workflow for biodiversity data retrieval, cleaning, and quality control. Biodiversity Data Journal 2(2014), e4221. doi: 10.3897/BDJ.2.e4221.
W. Freudling, M. Romaniello, D.M. Bramich, P. Ballester, V. Forchi, C. E. García-Dabló, S. Moehler, & M. J. Neeser, Automated data reduction workflows for astronomy: The ESO Reflex environment, Journal Astronomy and Astrophysics 559(2013).
C. Duffy, Yo. Gil, E. Deelman, S. Marru, M. Pierce, I. Demir, & G. Wiener. Designing a road map for geoscience workflows. Eos93(24)(2012), 225–226.doi: 10.1029/2012EO240002
K.J. Turner, & P.S. Lambert. Workflows for quantitative data analysis in the social sciences. International Journal on Software Tools for Technology Transfer 17(3)(2015), 321–338.doi: 10.1007/s10009-014-0315-4
M.R. Berthold, N. Cebron, F. Dill, T.R. Gabriel, T. Kötter, T. Meinl, P. Ohl, C. Sieb, K. Thiel, & B. Wiswedel. KNIME: The Konstanz Information Miner. In Data Analysis, Machine Learning and Applications. Studies in Classification, Data Analysis, and Knowledge Organization. Berlin: Springer.
P. Amstutz, M.R. Crusoe, N. Tijanić. (editors) B. Chapman, J. Chilton, M. Heuer, A. Kartashov, Da. Leehr, H. Ménager, M. Nedeljkovich, M. Scales, S. Soiland-Reyes, & L. Stojanovic. Common workflow language, v1.0. Specification. Common Workflow Language working group (2016). Available at: .
V. Cuevas-Vicenttín, S. Dey, S. Köhler, S. Riddle, & B. Ludäscher. Scientific workflows and provenance: Introduction and research opportunities. Datenbank Spektrum 12(3)(2012), 193–203. doi: 10.1007/s13222-012-0100-z.
K. Gorgolewski, C.D. Burns, C. Madison, D. Clark, Y.O. Halchenko, M.L.Waskom, & S.S. Ghosh. Nipype: A flexible, lightweight and extensible neuroimaging data processing framework in python. Frontiers in Neuroinformatics 5(13)(2011), doi: 10.3389/fninf.2011.00013.
E. Afgan, D. Baker, B. Batut, M. van den Beek, D. Bouvier, M. Čech, … & D. Blankenberg. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update, Nucleic Acids Research, 46(W1)(2018), W537–W54.doi: 10.1093/nar/gky379
K. Wolstencroft, R. Haines, D. Fellows, A. Williams, D. Withers, S. Owen, … C. Goble. The Taverna workflow suite: Designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic Acids Research 41(W1)(2013), W557–W561. doi: 10.1093/nar/gkt328.
E. Deelman, K. Vahi, J. Gideon, M. Rynge, S. Callaghan, P. Maechling,… & K.W. Pegasus. A Workflow Management System for Science Automation, Future Gener. Comput. Syst 46(C) (2015) 17-35.doi: 10.1016/j.future.2014.10.008
J. Köster, & S. Rahmann. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28(19)(2012), 2520–2522. doi: 10.1093/bioinformatics/bts480.
P. Di Tommaso, M. Chatzou, E.W. Floden, P.P Barja, E Palumbo, & C. Notredame. Nextflow enables reproducible computational workflows. Nature Biotechnology 35(4)(2017), 1546-1696. doi: 10.1038/nbt.3820.
R. Filguiera, A. Krause, M. Atkinson, & I. Klampanos. A Moreno dispel4py: A Python framework for data-intensive scientific computing. The International Journal of High Performance Computing Applications 31(4)(2017), 316–334.doi: 10.1177/1094342016649766
P. Moreno, L. Pireddu, P. Roger, N. Goonasekera, E. Afgan, M. van den Beek, … & S. Neumann. Galaxy-Kubernetes integration: Scaling bioinformatics workflows in the cloud. doi: 10.1101/488643.
T. McPhillips, T. Song, T. Kolisnik, S. Aulenbach, K. Belhajjame, K. Bocinsky, … & B. Ludäscher. YesWorkflow: A user-oriented, language-independent tool for recovering workflow information from scripts. International Journal of Digital Curation 10(1)(2015),doi: 10.2218/ijdc.v10i1.370
M.D. Wilkinson, M. Dumontier, I.J. Aalbersberg, G. Appleton, M. Axton, A. Baak, … & B. Mons. The FAIR guiding principles for scientific data management and stewardship. Scientific Data 3(2016), Article No. 160018. doi: 10.1038/sdata.2016.18.
J. Ison, M. Kalaš, I. Jonassen, D. Bolser, M. Uludag, H. McWilliam, J. Malone, R. Lopez, S. Pettifer, & P. Rice. EDAM: An ontology of bioinformatics operations, types of data and identifiers, topics and formats. Bioinformatics 29(10)(2013), 1325-1332. doi: 10.1093/bioinformatics/btt113.
S-A. Sansone, P. McQuilton, P. Rocca-Serra, A. Gonzalez-Beltran, M. Izzo, A.L. Lister, & M. Thurston. The FAIRsharing Community FAIRsharing as a community approach to standards, repositories and policies. Nature Biotechnology 37(2019), 358–367.
D. Garijo, P. Alper, K. Belhajjame, O. Corcho, Y. Gil, & C. Goble. Common motifs in scientific workflows: An empirical analysis. Future Generation Computer Systems 36(2014), doi: 10.1016/j.future.2013.09.018.
FAIR indicators in this issue DI-2019-001.
P. Groth, H. Cousijn, T. Clark, & C. Goble. FAIR data reuse - the path through data citation. Special issue on Emerging FAIR practices. (In press). DI-2019-0019.
K. Chard, M. D'Arcy, B. Heavner, I. Foster, C. Kesselman, R. Madduri, A. Rodriguez, S. Soiland-Reyes, C. Goble, K. Clark, E.W. Deutsch, I. Dinov, N Price, & A. Toga. I'll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets. In: IEEE International Conference on Big Data (Big Data), 2016, pp. 319-328. doi: 10.1109/BigData.2016.7840618.
M. Herschel, R. Diestelkämper, & H.B. Lahmar. A survey on provenance: What for? What form? What from? The VLDB Journal 26(6)(2017), 881-906.doi: 10.1007/s00778-017-0486-1
F.Z. Khan, S. Soiland-Reyes, R.O. Sinnott, A. Lonie, C. Goble, & M.R. Crusoe. Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv. GigaScience (accepted) (2019).
P. Alper, K. Belhajjame, V. Curcin, & C.A. Goble. LabelFlow framework for annotating workflow provenance. Informatics 5(1)(2018), 11. doi: 10.3390/informatics5010011.
D. Garijo, Y. Gil, & O. Corcho. Abstract, link, publish, exploit: An end to end framework, for workflow sharing. Future Generation Computer Systems 75(2017), 271-283. doi: 10.1016/j.future.2017.01.008.
J. Starlinger, S. Cohen-Boulakia, U. Leser. (Re)Use in Public Scientific Workflow Repositories. In: A. Ailamaki, & S. Bowers (eds) Scientific and statistical database management. SSDBM 2012. Lecture Notes in Computer Science, 7338(2012).doi: 10.1007/978-3-642-31235-9_24
J. Starlinger, B. Brancotte, S. Cohen-Boulakia, & U. Leser. Similarity search for scientific workflows. Proceedings of the VLDB Endowment 7(12)(2014), 1143-1154. doi: 10.14778/2732977.2732988.
A.M. Smith, D.S. Katz, & K.E. Niemeyer. FORCE11 Software Citation Working Group. Software citation principles. PeerJ Computer Science 2:e86, 2016. doi: 10.7717/peerj-cs.86.
D. De Roure, C. Goble, & R. Stevens. The design and realisation of the myExperiment Virtual Research Environment for social sharing of workflows. Future Generation Computer Systems 25(5)(2009), 561-567. doi: 10.1016/j.future.2008.06.010.
K. Belhajjame, J. Zhao, D. Garijo, M. Gamble, K.M. Hettne, R. Palma, E. Mina, Ó. Corcho, J.M. Gómez-Pérez, S. Bechhofer, G. Klyne, & C.A. Goble. Using a suite of ontologies for preserving workflow-centric research objects. Journal Web Semantics 32(2015), 16-42. doi: 10.1016/j.websem.2015.01.003.
K. Plankensteiner, J. Montagnat, & R. Prodan. IWIR: A language enabling portability across grid workflow systems. In: Workshop on Workflows in Support of Large-Scale Science, 2011, pp. 97-106. doi: 10.1145/2110497.2110509.
G. Terstyanszky, T. Kukla, T. Kiss, P. Kacsuk, A. Balasko, & Z. Farkas. Enabling scientific workflow sharing through coarse-grained interoperability. Future Generation Computing Systems 37(2014), 46-59. doi: 10.1016/j.future.2014.02.016.
M. Haendel, A. Su, J. McMurry, C.G. Chute,C. Mungall,B. Good, ... & T. Conlin. FAIR-TLC: Metrics to assess value of biomedical digital repositories: Response to RFI NOT-OD-16-133. doi: 10.5281/zenodo.203295.
D.S. Katz. Transitive credit as a means to address social and technological concerns stemming from citation and attribution of digital products. Journal of Open Research Software 2(1)(2014), e20. doi: 10.5334/
F. Casati, S. Ceri, B. Pernici, & G. Pozzi. Workflow evolution. Data and Knowledge Engineering 24(3)(1998), 211-238. doi: 10.1016/S0169-023X(97)00033-5.
C. Wroe, C. Goble, A. Goderis, P. Lord, S. Miles, J. Papay, P. Alper, & L. Moreau. Recycling workflows and services through discovery and reuse. Concurrency and Computation Practice and Experience 19(2)(2006), doi: 10.1002/cpe.1050.
H. Artaza, N.C. Hong, M. Corpas, A. Corpuz, R. Hooft, R.C. Jiménez, ... & D. Vaughan. Top 10 metrics for life science software good practices [version 1; peer review: 2 approved]. F1000Research 2016, 5(ELIXIR):2000. doi: 10.12688/f1000research.9206.1.
M. Taschuk, & G. Wilson. Ten simple rules for making research software more robust. PLoS Comp Bio(2017). doi: 10.1371/journal.pcbi.1005412.
F. da Veiga Leprevost, V.C. Barbosa, E.L. Francisco, Y. Perez-Riverol, & P.C. Carvalho. On best practices in the development of bioinformatics. Software Front Genet 5(2014), 199. doi: 10.3389/fgene.2014.00199.
G. Alterovitz, D. Dean, C. Goble, M.R Crusoe, S. Soiland-Reyes, A. Bell, et al. Enabling precision medicine via standard communication of HTS provenance, analysis, and results. PLoS Biol 16(12)(2018), e3000099. doi: 10.1371/journal.pbio.3000099.
V. Stodden, M. McNutt, D.H. Bailey, E. Deelman, Y. Gil, B. Hanson, M.A. Heroux, J.P.A. Ioannidis, M. Taufer. Enhancing reproducibility for computational methods Science 354(6317)(2016), 1240-1241.doi: 10.1126/science.aah6168
Article and author information
Cite As
C. Goble, S. Cohen-Boulakia, S. Soiland-Reyes, D. Garijo, Y. Gil, M.R. Crusoe, K. Peters & D. Schober. FAIR computational workflows. Data Intelligence 2(2020), 108–121. doi: 10.1162/dint_a_00033
Carole Goble
C. Goble (, corresponding author) and D. Schober ( initiated the effort and conceived the paper. C. Goble co-ordinated and led the writing, and edited he manuscript. C. Goble, S. Cohen-Boulakia (, S. Soiland-Reyes (, D. Garijo (, Y. Gil (, M.R. Crusoe (, K. Peters, D. Schober all contributed to the concepts, arguments, and written text. All eviewed the text.
Carole Goble is Professor of Computer Science at The University of Manchester. Over the past 25 years Carole has pursued research interests in the acceleration of FAIR scientific innovation through: distributed computing, workflows and automation; knowledge management and the Semantic Web; social, virtual environments; software engineering for scientific software; and new models of scholarship for data-intensive science. Carole has served on numerous committees and currently serves in the G7 Open Science Working Group as the UK expert. In 2008 she was awarded the Microsoft Jim Gray e-Science award for contributions to e-Science and in 2010 was elected a Fellow of the Royal Academy of Engineering. In 2014 she was awarded the Commander of the Order of the British Empire for services to Science.
Sarah Cohen-Boulakia
C. Goble, S. Cohen-Boulakia (, S. Soiland-Reyes (, D. Garijo (, Y. Gil (, M.R. Crusoe (, K. Peters, D. Schober all contributed to the concepts, arguments, and written text. All eviewed the text.
Sarah Cohen-Boulakia is a full Professor at the Laboratoire de Recherche en Informatique at Universite Paris-Sud. She holds a PhD in Computer Science and a Habilitation from the same university. She has been working for fifteen years in multi-disciplinary groups involving computer scientists and biologists of various domains. She spent two years as a postdoctoral researcher at the University of Pennsylvania, USA and 18 months at the Institute of Computational Biology (IBC) of Montpellier, France. Dr. Cohen-Boulakia’s research interests include provenance and design of scientific workflows, reproducibility of scientific experiments, integration, querying and ranking in the context of biological and biomedical databases. She currently co-animates a National working group on reproducibility of scientific experiments and she is involved in the European Research Infrastructure ELIXIR (
Stian Soiland-Reyes
C. Goble, S. Cohen-Boulakia (, S. Soiland-Reyes (, D. Garijo (, Y. Gil (, M.R. Crusoe (, K. Peters (, D. Schober all contributed to the concepts, arguments, and written text. All reviewed the text.
Stian Soiland-Reyes is a Technical Architect in the eScience Lab, based in the Department of Computer Science at The University of Manchester. Since 2006 he has worked as a software engineer and researcher focusing on reproducibility, scientific workflows, interoperability, linked data, metadata and open science. He is a persistent advocate of Open Scholarly Communication, and is on the leadership team of the Common Workflow Language and on the Project Management Committee of several open source projects at the Apache Software Foundation. He co-created the Research Object model, contributed to the W3C provenance model PROV-O and multiple Linked Data initiatives. He is co-chair of the Research Object Crate team.
Daniel Garijo
C. Goble, S. Cohen-Boulakia (, S. Soiland-Reyes (, D. Garijo (, Y. Gil (, M.R. Crusoe (, K. Peters (, D. Schober all contributed to the concepts, arguments, and written text. All reviewed the text.
Daniel Garijo is a computer scientist at the Information Sciences Institute of the University of Southern California. His research activities focus on e-Science and the Semantic Web, specifically on how to increase the understandability of software and scientific workflows using their associated provenance, metadata and intermediate results. Daniel was a member of the W3C Provenance Working Group to develop a standard for provenance on the Web, and he is currently collaborating with domain scientists to ease the description and composition of software in environmental and social sciences.
Yolanda Gil
C. Goble, S. Cohen-Boulakia (, S. Soiland-Reyes (, D. Garijo (, Y. Gil (, M.R. Crusoe (, K. Peters (, D. Schober all contributed to the concepts, arguments, and written text. All reviewed the text.
Yolanda Gil is Director of Knowledge Technologies and Associate Division Director at the Information Sciences Institute of the University of Southern California, and Research Professor in Computer Science and in Spatial Sciences. She is also Associate Director of Interdisciplinary Programs in Data Science at USC. She received her M.S. and PhD degrees in Computer Science from Carnegie Mellon University, with a focus on artificial intelligence. Her research is on intelligent interfaces for knowledge capture and discovery, which she investigates in a variety of projects concerning knowledge-based planning and problem solving, information analysis and assessment of trust, semantic annotation and metadata, and community-wide development of knowledge bases. Dr. Gil collaborates with scientists in different domains on semantic workflows and metadata capture, social knowledge collection, computer-mediated collaboration, and automated discovery. She initiated and chaired the W3C Provenance Group that led to a community standard in this area. Dr. Gil is a Fellow of the Association for Computing Machinery (ACM), and Past Chair of its Special Interest Group in Artificial Intelligence. She is also a Fellow of the Association for the Advancement of Artificial Intelligence (AAAI), and was elected as its 24th President in 2016.
Michael R. Crusoe
C. Goble, S. Cohen-Boulakia (, S. Soiland-Reyes (, D. Garijo (, Y. Gil (, M.R. Crusoe (, K. Peters (, D. Schober all contributed to the concepts, arguments, and written text. All reviewed the text. D. Schober and M.R. Crusoe contributed examples.
Michael R. Crusoe is one of the co-founders of the Common Workflow Language project and the CWL Project Lead. His facilitation, technical contributions, and training on behalf of CWL draw from his time as the former lead developer of C. Titus Brown's k-h-mer project, his previous career as a sysadmin and programmer, and his experiences in various Free and Open Source Software communities. This is not Michael's first time working on a standards project as he was the technical author of the International Labour Organization's Seafarers' Identity Card (2003) standard which is in force and ratified by 32 countries. Currently based in Berlin, Germany; Michael has been living in Europe for the last 4 years where he has enjoyed partnering with ELIXIR, ASTRON, and the EOSCPilot to build collaborations across the continent and across the world.
Kristian Peters
C. Goble, S. Cohen-Boulakia (, S. Soiland-Reyes (, D. Garijo (, Y. Gil (, M.R. Crusoe (, K. Peters (, D. Schober all contributed to the concepts, arguments, and written text. All reviewed the text.
Kristian Peters is currently working at the Leibniz Institute of Plant Biochemistry. He is part of the Germany network of bioinformatic infrastructures (de.NBI) and is a member of the GoFAIR metabolomics implementation network and the societies NBS and BLAM e.V. which focus on bryophyte biology and ecology. As he has studied both Information Technology and Biology his research focus is mainly on interdisciplinarity and integrating the research fields biochemistry, bioinformatics and ecology. His expertise in data integration covers a wide range of topics, including cloud e-infrastructures, statistics, machine learning, chemical ecology and ecometabolomics, targeted and untargeted metabolomics, plant and vegetation ecology, bryophyte biology, macro- and microscopy and climate change biology. His current research activities focus on the integrative data analysis and characterisation of compound classes of rare species in ecological contexts, creating scientific computational workflows for use cases in ecometabolomics and biomedicine, promoting the reproducibility and interoperability of software tools and the adoption of standardized research objects and formats.
Daniel Schober
C. Goble, S. Cohen-Boulakia (, S. Soiland-Reyes (, D. Garijo (, Y. Gil (, M.R. Crusoe (, K. Peters (, D. Schober all contributed to the concepts, arguments, and written text. All reviewed the text. D. Schober and M.R. Crusoe contributed examples.
Daniel Schober, a trained neurobiologist, did his PhD in medical knowledge engineering at Charité Hospital, Berlin. He mainly works in the areas of symbolic artificial intelligence, ontology engineering, policy management and data standard development. Aside his contributions to a multitude of description logics ontologies, he created best practices for the OBO Foundry (naming conventions) and developed open access XML standards for nuclear magnetic resonance data (nmrML). Foundational ontology research is done on the scale-dependency of ontologic top level categories, i.e. towards advanced physics concepts that emerge on the micro- and macrocosmic scale. Currently he investigates the impact of semantic and syntactic data standards in contribution to FAIR Data, in particular to Galaxy computational workflows. He has worked at the European Bioinformatics Institute in Cambridge, UK, then moved to IMBI Freiburg working on medical data integration and until recently, he worked in the mass spectrometry and bioinformatics department of the Leibniz Institute for Plant Biochemistry in Halle, Germany.
Publication records
Published: None (Versions2
Data Intelligence