This research area comprises six subprojects that research novel abstractions for the computational infrastructures underlying a DAW engine, which encompasses DAW execution engines, schedulers, resource managers, and configuration of the underlying network. The focus of subprojects in this area is on improving PAD compliance of infrastructures and DAW execution engines.
B1: Scheduling and Adaptive Execution of Data Analysis Workflows across Heterogeneous Infrastructures (Kao, Meyerhenke)
The efficient implementation of complex DAWs in various scientific disciplines requires deep knowledge of a large stack – consisting of an abstract DAW description, compilation of a logical plan, mapping onto the currently available infrastructure, and appropriate configuration of execution engines. Components and configurations developed for one computational infrastructure are often unsuitable for another, either leading to an undesirable platform lock-in or to a considerable loss of efficiency.
The goal of subproject B1 is therefore to improve portability. To this end, we
- compare DAW requirements with declarative descriptions of the available infrastructure,
- profile both DAWs and infrastructure as needed, and
- then map the DAWs onto the infrastructure using novel scheduling and load balancing (SLB) techniques to automatically optimize efficiency.
Ultimately, we aim to allow scientists to focus on the domain-specific challenges in their DAWs, while our new components provide an efficient selection and use of the available computing infrastructure automatically.
B2: Portable and Adaptive Data Analysis Workflows for Real-Time 3D Vision (Eisert, Koch)
Within this project, we plan to establish an abstract description of common components of 3D vision data analysis workflows (DAWs) that allows for an efficient distribution on different computing hardware infrastructure as well as a simple adaptation to different experimental settings and sensors. The work includes the analysis, modularization and optimization of DAWs and vision algorithms with respect to computational and memory demands, scalability, data dependencies, and adaptability. The focus will initially be on three specific types of 3D vision / 3D reconstruction DAWs in the area of microscopy with visible light and high-energy electrons that are very diverse in their use of memory and data transfer rate, and suitable schemes of parallelization, namely optical reflection tomography of fossils captured in amber on a microscopic scale, 3D atomic-resolution electron ptychography in the transmission electron microscope, and real time tracking of microscopic FIB lamellae within a focused ion beam instrument.
B3: Debugging Distributed Data Analysis Workflows (Kehrer, Markl)
Like other software, DAWs may show unexpected behavior or even crash due to various reasons. Debugging aims at establishing a cause effect relationship between the observable problem and the actual error. Such error identification serves as an initial step of a reliable problem resolution, and thus debugging of DAWs is an indispensable task to increase the dependability of DAWs. However, debugging DAWs is particularly challenging due to the heterogeneous nature of the involved tasks and the distributed nature of the execution engine. The central research question addressed in this subproject is how to enable domain scientists to efficiently formulate, test, and refine a debugging hypothesis in the context of scientific software engineering. It will primarily work together with A3 on the adaptation of software test technologies to distributed DAWs and with B6 on the distributed monitoring of DAW executions. The subproject will be coordinated by Prof. Kehrer, an expert in model-based software development, and Prof. Markl, an expert in large-scale distributed data analytics.
B4: Exploiting Software-Defined Networks for Efficient Data Management in Next-Generation Data Analysis Workflows (Reinefeld, Scheuermann)
Running a given DAW on different computational infrastructures than it was developed for often incurs severe performance penalties. One reason is that DAWs are typically designed for specific infrastructures, which leads to hard-coded decisions regarding file locations, file movement, or means of network-based data exchange between tasks. This subproject will investigate the usage of software-defined networks (SDNs) to bring the requirements of the DAW and the capabilities of the underlying physical infrastructure in terms of data access closer together. It thus aims at improving portability and adaptability of DAW execution engines by means of adapting the underlying infrastructure. Technically, it will develop a light-weight declarative specification language for annotating DAWs with their communication and computation demands, which nicely connects to A2 working in the related field of data access pattern. It will furthermore cooperate with A2 on annotations for specifying data access properties and with B1 on the interplay of file placement and scheduling. The subproject will be led by Prof. Reinefeld, an expert in distributed management of large scientific data sets and high-performance computing, and Prof. Scheuermann, expert in network protocols and communication systems.
B5: Adaptive, Distributed and Scalable Analysis of Massive Satellite Data (Hostert, Leser)
Subproject B5 investigates means to improve adaptability and portability of DAWs for large-scale analysis of satellite data. DAWs for such problems are often long and involved and include tasks with very heterogeneous resource requirements (in terms of memory, runtime, and bandwidth). As the manner in which the concrete requirements of tools depend on the input data is a-priori unknown, these DAWs are very difficult to schedule. Today, this problem typically is solved by hardwiring data- and infrastructure dependent scheduling decisions into the DAW. B5 will research new methods for adaptive scheduling of DAWs that adapt to the concrete remote sensing scenario. B5 is an interdisciplinary subproject, led by Prof. Hostert, expert in large-scale satellite data analysis, and Prof. Leser, an expert in workflow management systems for large-scale scientific data analysis.
B6: Distributed Run-Time Monitoring and Control of Data Analysis Workflows (Grunske, Rabl)
Executions of DAWs are driven by specifications of the individual steps of a data-processing pipeline and the data it processes. However, for a multitude of reasons, execution may not function as desired, especially when distributed systems are used, which calls for proactive runtime monitoring of executions to detect and possibly resolve any problems early on. B6 will investigate the foundations of distributed monitoring and of control systems for DAWs, which is a prerequisite for the design of dependable DAWs. Cooperations will be established especially with A1, on the notion of “abnormal” versus “normal” behavior of an execution, and with B3 regarding the monitoring of distributed executions. The subproject will be led by Prof. Grunske, an expert in automated software analysis for error detection, and Prof. Rabl, an expert in efficient distributed streaming engines.