If you are a natural scientist using computational methods to study large data sets, you are likely to invest significant effort into your data analysis workflow (DAW). What happens to your DAW after your paper is published is determined by the portability, adaptability, and dependability of the workflow.
Phase I of FONDA brought together an interdisciplinary team of computer- and natural scientists to improve these three DAW characteristics using actual data and real workflows from remote sensing, computational material science, genomics, and biomedical image analysis. A few examples of FONDA’s contributions to each topic are described below.
Portability
DAW infrastructures have three basic components: the specification language, the execution engine, and the computational infrastructure. Along with the other PAD characteristics, portability can be improved by targeting each of these. Team T1 of FONDA created a DAW meta-model which can be used to improve portability between specification languages. B2 demonstrated how a hard-coded electron microscopy analysis, previously fixed to a specific hardware configuration, could be reworked into a portable DAW using an execution engine. Finally, at the level of computational infrastructure, B1 developed novel methods for estimating the resource requirements of different workflow tasks and automatically selecting the best-fitting node to run each task. This was facilitated by the work of T4, which developed software packages for visually monitoring DAW execution.
If your workflow is portable at each of these levels, your colleagues will be able to implement your workflow in a different specification language, use a different execution engine, or run your workflow on their hardware without significant modifications. Furthermore, you will not have to modify your workflow significantly if you move to a new institution.
Adaptability
Many workflows are developed ad hoc to handle data generated during a particular project. As a result, they tend to not only be tightly bound to the infrastructure they were developed on, but also to specific types of input data.
To address adaptation to different infrastructures, A2 defined a workflow template language which allows users to mix-and-match equivalent tools to tailor their genomics workflows to different infrastructures. B5 developed a location-aware scheduling method which allows workflows to run effectively despite different underlying file systems and network configurations. Additionally, B4 deployed monitoring tools on FONDA’s infrastructure which identified configuration lapses that are common on unfamiliar infrastructure.
Meanwhile, A5 created a set of modular tools which improve adaptability to new datasets by eliminating scanner-and protocol related differences in the input data for neuroimaging workflows. Finally, A6 showed how DAWs could be adapted to handle interactive input for exploratory analysis.
If your workflow is adaptable, you and your colleagues can use your workflow as a starting point for further analysis. You can manipulate it to take advantage of slightly different tools, use different file structures, and accommodate for variations in the input data.
Dependability
We consider a workflow dependable when it contains built-in controls to check that all prerequisites for a complete and accurate run are met. These checks make implicit assumptions explicit, prevent runtime errors, and detect incorrect results.
In the first phase of FONDA, T2 developed tools for managing metadata from DAW executions. A1 compared these provenance traces from valid and invalid DAW executions in order to identify which implicit assumptions must be fulfilled for the DAW to run successfully. In B6, monitoring techniques were established that adapted to fluctuating event rates and facilitated proactive error detection and rectification.
In A3, researchers identified similar vs. implausible results in the material science database NOMAD. This allows for methods and computations which produce implausible results to be audited for dependability, and avoided if they systematically produce errors. B3 developed systematic debugging methods to help developers and domain scientists determine the root cause of runtime failures and unexpected or implausible results. These subprojects also collaborated with teams T3 and T5 which implemented validity constraints in DAW specification languages and provided benchmarks for measuring dependability, respectively.
If your DAW is dependable, it will prompt users to meet the necessary preconditions for a successful run and inform users as soon as possible if the completeness or correctness of the run is threatened. You and your colleagues will be able to trust the output of your DAW and avoid wasting time and resources on runs that cannot produce valid results.
Building on the successes of phase I, the second phase of FONDA focuses on how to make DAWs more sustainable, usable, and multi-site (SUM).