B1: Scheduling and Adaptive Execution of Data Analysis Workflows across Heterogeneous Infrastructures


The efficient implementation of complex DAWs in various scientific disciplines requires deep knowledge of a large stack – consisting of an abstract DAW description, compilation of a logical plan, mapping onto the currently available infrastructure, and appropriate configuration of execution engines. Components and configurations developed for one computational infrastructure are often unsuitable for another, either leading to an undesirable platform lock-in or to a considerable loss of efficiency.

The goal of subproject B1 is therefore to improve portability. To this end, we

  • compare DAW requirements with declarative descriptions of the available infrastructure,
  • profile both DAWs and infrastructure as needed, and
  • then map the DAWs onto the infrastructure using novel scheduling and load balancing (SLB) techniques to automatically optimize efficiency.

Ultimately, we aim to allow scientists to focus on the domain-specific challenges in their DAWs, while our new components provide an efficient selection and use of the available computing infrastructure automatically.





Jonathan Bader; Fabian Lehmann; Lauritz Thamsen; Jonathan Will; Ulf Leser; Odej Kao

Lotaru: Locally Estimating Runtimes of Scientific Workflow Tasks in Heterogeneous Clusters Inproceedings

In: 34th International Conference on Scientific and Statistical Database Management (SSDBM 2022), pp. to appear, ACM, 2022.

Links | BibTeX

Lauritz Thamsen; Dominik Scheinert; Jonathan Will; Jonathan Bader; Odej Kao

Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview Journal Article

In: Datenbank-Spektrum, 2022.

Abstract | BibTeX

Jonathan Will; Lauritz Thamsen; Jonathan Bader; Dominik Scheinert; Odej Kao

Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing Miscellaneous



Jonathan Bader; Fabian Lehmann; Alexander Groth; Lauritz Thamsen; Dominik Scheinert; Jonathan Will; Ulf Leser; Odej Kao

Reshi: Recommending Resources for Scientific Workflow Tasks on Heterogeneous Infrastructures Inproceedings

In: 41th International Performance Computing and Communications Conference 2022, IEEE, 2022.



Jonathan Will; Lauritz Thamsen; Dominik Scheinert; Jonathan Bader; Odej Kao

C3O: Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds Journal Article

In: 2021 IEEE International Conference on Cloud Engineering (IC2E), pp. 43-52, 2021.


Dominik Scheinert; Alireza Alamgiralem; Jonathan Bader; Jonathan Will; Thorsten Wittkopp; Lauritz Thamsen

On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds Inproceedings

In: 2021 IEEE International Conference on Big Data (Big Data), pp. 3113-3118, 2021.

Links | BibTeX

Jonathan Will; Onur Arslan; Jonathan Bader; Dominik Scheinert; Lauritz Thamsen

Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud Inproceedings

In: 2021 IEEE International Conference on Big Data (Big Data), pp. 3141-3146, 2021.

Links | BibTeX

Jonathan Bader; Lauritz Thamsen; Svetlana Kulagina; Jonathan Will; Henning Meyerhenke; Odej Kao

Tarema: Adaptive Resource Allocation for Scalable Scientific Workflows in Heterogeneous Clusters Inproceedings

In: 2021 IEEE International Conference on Big Data (Big Data), pp. 65-75, 2021.

Links | BibTeX


Jonathan Will; Jonathan Bader; Lauritz Thamsen

Towards Collaborative Optimization of Cluster Configurations for Distributed Dataflow Jobs Inproceedings

In: 2020 IEEE International Conference on Big Data (Big Data), pp. 2851-2856, 2020.

Links | BibTeX


Marcus Hilbrich; Sebastian Müller; Svetlana Kulagina; Christopher Lazik; Ninon De Mecquenem.; Lars Grunske

A consolidated View on Specification Languages for Data Analysis Workflows Proceeding Forthcoming

Automated Software Re-Engineering (ISoLA2022 · ASRE) (accepted), Forthcoming.