Research Area A: Abstractions for DAW Specification Languages

This research area comprises six subprojects that research novel abstractions for DAW specification languages and how they are turned into logical and physical plans by a DAW engine. The aim of all subprojects is to increase the level of portability, adaptability, and dependability of DAWs, DAW specification languages, and DAW engines. All subprojects within this area (except A1) are interdisciplinary, i.e., have one PI from Computer Science and one from a Natural Science.

A1: Foundations of Data Analysis Workflow Validation (Schweikardt, Weidlich)
An important aspect of DAW dependability is the systematic detection and avoidance of misguided executions. Subproject A1 will approach this problem as query discovery problem: Given a set of execution traces of a DAW or a family of DAWs, find a set of concise queries over the log stream that separate runs that succeeded from those that fail. Query discovery, in contrast to statistical methods for failure prediction, has the advantage that queries can be understood more easily by the DAW developer, which makes adaptation of DAWs to avoid problematic situations possible. An important cooperation will be with B6, which focuses on detecting abnormal behavior at runtime. The project is carried out in cooperation between Prof. Schweikardt, an expert in logic and database theory, and Prof. Weidlich, an expert in event stream processing.

A2: Adapting Genomic Data Analysis Workflows for Different Data Access Patterns (Leser, Reinert)
DAW runtime in distributed infrastructures if often dominated by the time required for data access and data exchange (DADE), which in turn depends on the data being analyzed, the tasks being executed, and the infrastructure on which a DAW runs. Changes in either of these aspects can quickly lead to deteriorating runtimes when a DAW is not adapted properly. Subproject A2 investigates methods that can adapt a given DAW to new input data or a different infrastructure with the goal to keep runtime low. A2 is an interdisciplinary project; it will develop its research using DAWs for large-scale genome data analysis, which are typically IO heavy and thus particularly depend on proper DADE operations. It will intensively cooperate with subproject A6 by testing its newly developed methods also on DAWs for finding structural genomic variations, and it will use the hardware abstractions developed in B1. It will be carried out by Prof. Reinert, an expert in data structures and algorithms for genomic data, and Prof. Leser, an expert in optimization of UDF-heavy DAWs.

A3: Deriving Trust Levels for Multi-Choice Data Analysis Workflows (Draxl, Grunske)
A3-BildDAWs often contain series of analysis tasks where for each task a multitude of possible programs exist (multi-choice DAW). A comprehensive characterization of the input data often requires executing several of these programs to combine their respective strengths and avoid their specific weaknesses. However, running all configurations often is not feasible, which raises the problem of finding the best possible combination to increase dependability and to maximize accuracy. A3 will approach this problem by adapting proven quality assurance and software testing techniques to the specific issues in multi-choice DAWs. Important cooperations exist with A6, which also analyzes variants of DAWs, and with B3 in the field of software test techniques for DAWs. A3 is an interdisciplinary project and will develop its methods for the problem of computing physical properties of simulated new materials at extremely large-scale. The subproject will be carried out by Prof. Draxl, one of the main initiators of the NOMAD project, a worldwide unique large-scale repository of comprehensively characterized materials, and Prof. Grunske, an expert in search-based derivation of test cases and robustness estimates for complex software systems.

A5: Dependability, Adaptability and Uncertainty Quantification for Data Analysis Workflows in Large-Scale Biomedical Image Analysis (Kainmüller, Ritter)
b2_job_ad_figure-1Scientific data analysis workflows (DAWs) increasingly include tasks that implement some form of large-scale machine learning (ML). This is particularly true in biomedical image analysis, where many recent breakthroughs rooted in the application of advanced ML for image analysis based on increasingly large training data sets. Such ML-heavy DAWs have the disadvantage of not being dependable in terms of the quality of predictions on real-life test data, one reason being a lack of adaptability to varying data distributions. In this subproject, we aim at improving the abilities of DAWs for ML-based biomedical image analysis by means of automated assessment the suitability of their models for new data sets, and adaptation of these models – whenever possible and appropriate – to test data that is different from the respective training data.

A6: Data Analysis Workflows for Interactive Scientific Exploration (Kehr, Weidlich)
DAWs for scientific discoveries are often exploratory. Furthermore, also the process of specifying a DAW is exploratory, involving the repeated adaptation of the current DAW specification based on results of previous executions or based on refined requirements. The interdisciplinary subproject A6 will investigate means to support the explorative process of DAW specification systematically by developing a specification model for exploratory DAWs, its mapping to distributed DAW infrastructures, and abstractions for interactive exploratory DAWs that connect exploration spaces with states of DAW executions. It focuses on DAWs for genome analysis, which are often long and complex and whose development involves numerous design choices and time-consuming trial-and-error phases. It will team-up especially with subproject A1 on analyzing traces of DAW executions and with B3 regarding the problem of mapping logged events back to the abstract tasks which produced them. The subproject addresses the phases of DAW modification, specification, and deployment. The subproject will be carried out jointly by Dr. Kehr, an expert in large-scale genome analysis methods, and Prof. Weidlich, an expert in workflow management and mining.