Continued from A2: Adapting Genomic Data Analysis Workflows for Different Data Access Patterns
Description
The problem of energy efficiency in big data processing is becoming increasingly important for environmental and sustainability reasons. We will investigate new techniques and algorithms that estimate the energy consumption of a Data Analysis Workflow (DAW) on a given infrastructure, and to apply multi-objective optimization techniques to rewrite DAWs so that they consume less energy.

Scientists
- Somayeh Mohammadi
- Philipp Thamm
Publications
2025
Lehmann, Fabian; Bader, Jonathan; Tschirpke, Friedrich; Mecquenem, Ninon De; Lößer, Ansgar; Becker, Sören; Lewińska, Katarzyna Ewa; Thamsen, Lauritz; Leser, Ulf
WOW: Workflow-Aware Data Movement and Task Scheduling for Dynamic Scientific Workflows Proceedings Article
In: 2025 IEEE 25th International Symposium on Cluster, Cloud and Internet Computing (CCGrid), Tromsø, Norway, 2025.
@inproceedings{lehmannWOW2025,
title = {WOW: Workflow-Aware Data Movement and Task Scheduling for Dynamic Scientific Workflows},
author = { Fabian Lehmann and Jonathan Bader and Friedrich Tschirpke and Ninon De Mecquenem and Ansgar Lößer and Sören Becker and Katarzyna Ewa Lewińska and Lauritz Thamsen and Ulf Leser},
year = {2025},
date = {2025-05-01},
urldate = {2025-05-01},
booktitle = {2025 IEEE 25th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)},
address = {Tromsø, Norway},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
2024
Mohammadi, Somayeh; PourKarimi, Latif; Zschäbitz, Manuel; Aretz, Tristan; Mecquenem, Ninon De; Leser, Ulf; Reinert, Knut
Optimizing Job/Task Granularity for Metagenomic Workflows in Heterogeneous Cluster Infrastructures Workshop
2024.
@workshop{nokey,
title = {Optimizing Job/Task Granularity for Metagenomic Workflows in Heterogeneous Cluster Infrastructures},
author = {Somayeh Mohammadi and Latif PourKarimi and Manuel Zschäbitz and Tristan Aretz and Ninon De Mecquenem and Ulf Leser and Knut Reinert},
editor = {George Fletcher and Verena Kantere},
url = {https://dastlab.github.io/edbticdt2024/?contents=main.html},
year = {2024},
date = {2024-03-25},
urldate = {2024-03-25},
abstract = {Data analysis workflows are popular for sequencing activities in large-scale and complex scientific processes. Scheduling approaches attempt to find an appropriate assignment of workflow tasks to the computing nodes for minimizing the makespan in heterogeneous cluster infrastructures. A common feature of these approaches is that they already know the structure of the workflow. However, for many workflows, a high degree of parallelization can be achieved by splitting the large input data of a single task into chunks and processing them independently. We call this problem task granularity which involves finding an assignment of tasks to computing nodes and simultaneously optimizing the structure of a bag of tasks. Accordingly, this paper addresses the problem of task granularity for metagenomic workflows. To this end, we first formulated the problem as a mathematical model. We then solved the proposed model using the genetic algorithm. To overcome the challenge of not knowing the number of tasks, we adjusted the number of tasks as a factor of the number of computing nodes. The procedure of increasing the number of tasks is performed interactively and evolutionarily. Experimental results showed that a desirable makespan value can be achieved after a few steps of the increase.},
howpublished = {Proceedings of the Workshops of the EDBT/ICDT 2024 Joint Conference, Paestum, Italy, March, 25-28 , 2024},
keywords = {},
pubstate = {published},
tppubtype = {workshop}
}
Pan, Chenxu; Reinert, Knut
A simple refined DNA minimizer operator enables 2-fold faster computation Journal Article
In: Bioinformatics, vol. 40, no. 2, pp. btae045, 2024, ISSN: 1367-4811.
@article{10.1093/bioinformatics/btae045,
title = {A simple refined DNA minimizer operator enables 2-fold faster computation},
author = {Chenxu Pan and Knut Reinert},
url = {https://doi.org/10.1093/bioinformatics/btae045},
doi = {10.1093/bioinformatics/btae045},
issn = {1367-4811},
year = {2024},
date = {2024-01-01},
urldate = {2024-01-01},
journal = {Bioinformatics},
volume = {40},
number = {2},
pages = {btae045},
abstract = {The minimizer concept is a data structure for sequence sketching. The standard canonical minimizer selects a subset of k-mers from the given DNA sequence by comparing the forward and reverse k-mers in a window simultaneously according to a predefined selection scheme. It is widely employed by sequence analysis such as read mapping and assembly. k-mer density, k-mer repetitiveness (e.g. k-mer bias), and computational efficiency are three critical measurements for minimizer selection schemes. However, there exist trade-offs between kinds of minimizer variants. Generic, effective, and efficient are always the requirements for high-performance minimizer algorithms.We propose a simple minimizer operator as a refinement of the standard canonical minimizer. It takes only a few operations to compute. However, it can improve the k-mer repetitiveness, especially for the lexicographic order. It applies to other selection schemes of total orders (e.g. random orders). Moreover, it is computationally efficient and the density is close to that of the standard minimizer. The refined minimizer may benefit high-performance applications like binning and read mapping.The source code of the benchmark in this work is available at the github repository https://github.com/xp3i4/mini_benchmark},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Lehmann, Fabian; Bader, Jonathan; Mecquenem, Ninon De; Wang, Xing; Bountris, Vasilis; Friederici, Florian; Leser, Ulf; Thamsen, Lauritz
Ponder: Online Prediction of Task Memory Requirements for Scientific Workflows Proceedings Article
In: 2024 IEEE 20th International Conference on e-Science (e-Science), pp. 1-10, 2024.
@inproceedings{lehmannPonder2024,
title = {Ponder: Online Prediction of Task Memory Requirements for Scientific Workflows},
author = { Fabian Lehmann and Jonathan Bader and Ninon De Mecquenem and Xing Wang and Vasilis Bountris and Florian Friederici and Ulf Leser and Lauritz Thamsen},
url = {https://ieeexplore.ieee.org/document/10678682},
doi = {10.1109/e-Science62913.2024.10678682},
year = {2024},
date = {2024-01-01},
urldate = {2024-01-01},
booktitle = {2024 IEEE 20th International Conference on e-Science (e-Science)},
pages = {1-10},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Sänger, Mario; Mecquenem, Ninon De; Lewińska, Katarzyna Ewa; Bountris, Vasilis; Lehmann, Fabian; Leser, Ulf; Kosch, Thomas
A Qualitative Assessment of Using ChatGPT as Large Language Model for Scientific Workflow Development Journal Article
In: GigaScience, 2024, ISSN: 2047-217X.
@article{saenger2024a,
title = {A Qualitative Assessment of Using ChatGPT as Large Language Model for Scientific Workflow Development},
author = { Mario Sänger and Ninon De Mecquenem and Katarzyna Ewa Lewińska and Vasilis Bountris and Fabian Lehmann and Ulf Leser and Thomas Kosch},
url = {https://doi.org/10.1093/gigascience/giae030},
doi = {10.1093/gigascience/giae030},
issn = {2047-217X},
year = {2024},
date = {2024-01-01},
urldate = {2024-01-01},
journal = {GigaScience},
abstract = {Scientific workflow systems are increasingly popular for expressing and executing complex data analysis pipelines over large datasets, as they offer reproducibility, dependability, and scalability of analyses by automatic parallelization on large compute clusters. However, implementing workflows is difficult due to the involvement of many black-box tools and the deep infrastructure stack necessary for their execution. Simultaneously, user-supporting tools are rare, and the number of available examples is much lower than in classical programming languages.To address these challenges, we investigate the efficiency of large language models (LLMs), specifically ChatGPT, to support users when dealing with scientific workflows. We performed 3 user studies in 2 scientific domains to evaluate ChatGPT for comprehending, adapting, and extending workflows. Our results indicate that LLMs efficiently interpret workflows but achieve lower performance for exchanging components or purposeful workflow extensions. We characterize their limitations in these challenging scenarios and suggest future research directions.Our results show a high accuracy for comprehending and explaining scientific workflows while achieving a reduced performance for modifying and extending workflow descriptions. These findings clearly illustrate the need for further research in this area.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Schintke, Florian; Belhajjame, Khalid; Mecquenem, Ninon De; Frantz, David; Guarino, Vanessa Emanuela; Hilbrich, Marcus; Lehmann, Fabian; Missier, Paolo; Sattler, Rebecca; Sparka, Jan Arne; Speckhard, Daniel T.; Stolte, Hermann; Vu, Anh Duc; Leser, Ulf
Validity constraints for data analysis workflows Journal Article
In: Future Generation Computer Systems, vol. 157, pp. 82–97, 2024, ISSN: 0167-739X.
@article{SCHINTKE2024,
title = {Validity constraints for data analysis workflows},
author = {Florian Schintke and Khalid Belhajjame and Ninon De Mecquenem and David Frantz and Vanessa Emanuela Guarino and Marcus Hilbrich and Fabian Lehmann and Paolo Missier and Rebecca Sattler and Jan Arne Sparka and Daniel T. Speckhard and Hermann Stolte and Anh Duc Vu and Ulf Leser},
url = {https://www.sciencedirect.com/science/article/pii/S0167739X24001079},
doi = {https://doi.org/10.1016/j.future.2024.03.037},
issn = {0167-739X},
year = {2024},
date = {2024-01-01},
urldate = {2024-01-01},
journal = {Future Generation Computer Systems},
volume = {157},
pages = {82--97},
abstract = {Porting a scientific data analysis workflow (DAW) to a cluster infrastructure, a new software stack, or even only a new dataset with some notably different properties is often challenging. Despite the structured definition of the steps (tasks) and their interdependencies during a complex data analysis in the DAW specification, relevant assumptions may remain unspecified and implicit. Such hidden assumptions often lead to crashing tasks without a reasonable error message, poor performance in general, non-terminating executions, or silent wrong results of the DAW, to name only a few possible consequences. Searching for the causes of such errors and drawbacks in a distributed compute cluster managed by a complex infrastructure stack, where DAWs for large datasets typically are executed, can be tedious and time-consuming. We propose validity constraints (VCs) as a new concept for DAW languages to alleviate this situation. A VC is a constraint specifying logical conditions that must be fulfilled at certain times for DAW executions to be valid. When defined together with a DAW, VCs help to improve the portability, adaptability, and reusability of DAWs by making implicit assumptions explicit. Once specified, VCs can be controlled automatically by the DAW infrastructure, and violations can lead to meaningful error messages and graceful behaviour (e.g., termination or invocation of repair mechanisms). We provide a broad list of possible VCs, classify them along multiple dimensions, and compare them to similar concepts one can find in related fields. We also provide a proof-of-concept implementation for the workflow system Nextflow.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
2023
Vu, Anh Duc; Sparka, Jan Arne; De Mecquenem, Ninon; Kehrer, Timo; Leser, Ulf; Grunske, Lars
Contract-Driven Design of Scientific Data Analysis Workflows Proceedings Article
In: 2023 IEEE 19th International Conference on e-Science (e-Science), pp. 1-10, IEEE, 2023, ISSN: 2325-3703.
@inproceedings{nokey,
title = {Contract-Driven Design of Scientific Data Analysis Workflows},
author = {Anh Duc Vu and Jan Arne Sparka and De Mecquenem, Ninon and Timo Kehrer and Ulf Leser and Lars Grunske},
doi = {10.1109/e-Science58273.2023.10254898},
issn = {2325-3703},
year = {2023},
date = {2023-10-11},
urldate = {2023-10-11},
booktitle = {2023 IEEE 19th International Conference on e-Science (e-Science)},
pages = {1-10},
publisher = {IEEE},
abstract = {Software systems enabling large-scale data analysis workflows (DAWs) are a key technology for many scientific disciplines, as they allow extracting new insights from experimental results. DAWs are (non-)linear pipelines composed of multiple interdependent tasks that are executed in a distributed fashion on large compute clusters. In science, the individual task implementations are developed by research groups all over the world and usually not tested outside a narrow scope of possible inputs, parameters, and infrastructures. As a result, the operations' correctness depends on many implicit assumptions. Among others this includes the completeness and suitability of input data, infrastructure properties such as available cores or main memory, etc. This combination of complexity, distribution and untested components makes quality assurance of DAWs a critical issue. In this paper, we propose to address this problem by introducing a contract-driven approach to DAW design and implementation. Following this method, DAW developers specify contracts in the form of requirements and promises for each task of a DAW. These contracts serve as guards to ensure that tasks run in a proper environment and produce correct results. We provide the first formal definition of contracts for DAWs and show how they are connected to DAW scheduling and execution. As a proof of concept, we extended Nextflow, a popular scientific workflow system, with contracts and defined a light-weight DSL for their specification. We exemplify the power of a contract-driven approach to DAW development by enhancing several real-world DAWs from Bioinformatics to capture typical problems during their execution and show how the specific notifications issued by broken contracts help debugging the DAWs.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Vu, Anh Duc; Sparka, Jan Arne; De Mecquenem, Ninon; Kehrer, Timo; Leser, Ulf; Grunske, Lars
Design by Contract Revisited in the Context of Scientific Data Analysis Workflows Proceedings Article
In: 2023 IEEE 19th International Conference on e-Science (e-Science), pp. 1-2, IEEE, 2023, ISSN: 2325-3703.
@inproceedings{nokey,
title = {Design by Contract Revisited in the Context of Scientific Data Analysis Workflows},
author = {Anh Duc Vu and Jan Arne Sparka and De Mecquenem, Ninon and Timo Kehrer and Ulf Leser and Lars Grunske},
doi = {10.1109/e-Science58273.2023.10254869},
issn = {2325-3703},
year = {2023},
date = {2023-10-11},
urldate = {2023-10-11},
booktitle = {2023 IEEE 19th International Conference on e-Science (e-Science)},
pages = {1-2},
publisher = {IEEE},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Schintke, Florian; De Mecquenem, Ninon; Frantz, David; Guarino, Vanessa Emanuela; Hilbrich, Marcus; Lehmann, Fabian; Sattler, Rebecca; Sparka, Jan Arne; Speckhard, Daniel T.; Stolte, Hermann; Vu, Anh Duc; Leser, Ulf
Validity Constraints for Data Analysis Workflows Miscellaneous
2023.
@misc{schintke2023validity,
title = {Validity Constraints for Data Analysis Workflows},
author = {Florian Schintke and De Mecquenem, Ninon and David Frantz and Vanessa Emanuela Guarino and Marcus Hilbrich and Fabian Lehmann and Rebecca Sattler and Jan Arne Sparka and Daniel T. Speckhard and Hermann Stolte and Anh Duc Vu and Ulf Leser},
year = {2023},
date = {2023-01-01},
urldate = {2023-01-01},
keywords = {},
pubstate = {published},
tppubtype = {misc}
}
Mohammadi, Somayeh; PourKarimi, Latif; Droop, Felix; De Mecquenem, Ninon; Leser, Ulf; Reinert, Knut
A mathematical programming approach for resource allocation of data analysis workflows on heterogeneous clusters Journal Article
In: The Journal of Supercomputing, pp. 1–30, 2023.
@article{mohammadi2023mathematical,
title = {A mathematical programming approach for resource allocation of data analysis workflows on heterogeneous clusters},
author = {Somayeh Mohammadi and Latif PourKarimi and Felix Droop and De Mecquenem, Ninon and Ulf Leser and Knut Reinert},
year = {2023},
date = {2023-01-01},
urldate = {2023-01-01},
journal = {The Journal of Supercomputing},
pages = {1–30},
publisher = {Springer},
keywords = {},
pubstate = {published},
tppubtype = {article}
}