General Information

“Human productivity arguably still is the most expensive resource, trumping power, performance, and other factors”

Essentially all scientific disciplines are generating an ever-increasing amount of data. To derive scientific discoveries, these data sets are analyzed by complex data analysis workflows (DAWs), which are series of discrete analysis programs arranged in (often non-linear) pipelines. Because they usually deal with very large data sets, DAWs must be executed on distributed and/or parallel computational infrastructures. Traditionally, DAWs are optimized for speed, which leads to solutions that are hard to reproduce and share and that are tightly bound to exactly one type of input. However, as stated as summary in a recent NSF/DOE workshop that brought together the workflow and the HPC communities, “… human productivity arguably still is the most expensive resource, trumping power, performance, and other factors …” [1].

Our long-term goal is to develop methods and tools that achieve substantial reductions in development time and development cost of Data Analysis Workflows

The proposed CRC FONDA – “Foundations of workflows for large-scale scientific data analysis” – will take up this observation and investigate methods for increasing productivity in the development, execution, and maintenance of DAWs for large scientific data sets. Our long-term goal is to develop methods and tools that achieve substantial reductions in development time and development cost of DAWs. We will approach these questions from a fundamental perspective, i.e., we aim at finding new abstractions, models, and algorithms that can eventually form the basis of a new class of future DAW infrastructures.

Phase I

To advance these goals, the first phase of FONDA focused on three critical properties of DAWs and DAW engines, namely, portability, adaptability, and dependability (PAD). By focusing on these properties, researchers made groundbreaking strides towards answering questions such as:

  • How can we build DAWs and DAW engines that enable portability of analysis across different infrastructures?
  • How must DAWs be designed to adapt to changing input data or slightly changing requirements?
  • How can we build dependable DAW systems that are aware of and control their own limitations and preconditions?

Phase II

In FONDA’s second phase, we lift several assumptions made to refine the scope of the first phase and introduce three new themes. Specifically, we intend to improve the sustainability, usability, and to enable multi-site execution of DAWs (SUM). This would allow us to answer the following questions:

  • How can DAWs be made technologically and environmentally sustainable?
  • How can we make DAWs more usable for non-developers?
  • How can we execute DAWs effectively on compute infrastructure and data distributed across multiple sites?

Data Analysis Workflows are bridges between two worlds

DAWs are bridges between two worlds: First, the specific scientific discipline using a DAW, and, second, Computer Science, which builds the infrastructures necessary for developing and executing DAWs. Developing novel foundations for scientific DAWs thus requires a close interaction between these two worlds. FONDA implements this idea by building on an interdisciplinary group of PIs from Computer Science, Material Science, Geosciences, and the Life Sciences. Through these cooperations, FONDA’s research results will be continuously validated using relevant and current scientific problems from different fields of the natural sciences.

References

  1. Ewa Deelman and T Peterka and Ilkay Altintas and C Carothers and KK Dam and K Moreland and M Parashar and L Ramakrishnan and M Taufer and J Vetter (2015): The future of scientific workflows report of the DOE NGNS/CS scientific workflows workshop. 2015.

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 414984028 – SFB 1404 FONDA

For Further Information: