Description
DAWs for scientific discoveries are often exploratory. Furthermore, also the process of specifying a DAW is exploratory, involving the repeated adaptation of the current DAW specification based on results of previous executions or based on refined requirements. The interdisciplinary subproject A6 will investigate means to support the explorative process of DAW specification systematically by developing a specification model for exploratory DAWs, its mapping to distributed DAW infrastructures, and abstractions for interactive exploratory DAWs that connect exploration spaces with states of DAW executions. It focuses on DAWs for genome analysis, which are often long and complex and whose development involves numerous design choices and time-consuming trial-and-error phases. It will team-up especially with subproject A1 on analyzing traces of DAW executions and with B3 regarding the problem of mapping logged events back to the abstract tasks which produced them. The subproject addresses the phases of DAW modification, specification, and deployment. The subproject will be carried out jointly by Dr. Kehr, an expert in large-scale genome analysis methods, and Prof. Weidlich, an expert in workflow management and mining.
PIs
Publications
2022
Nourhan Elfaramawy
Interactive Workflows for Exploratory Data Analysis Inproceedings
In: Bao, Zhifeng; Sellis, Timos (Ed.): Proceedings of the VLDB 2022 PhD Workshop co-located with the 48th
International Conference on Very Large Databases (VLDB 2022), Sydney,
Australia, September 5, 2022, CEUR-WS.org, 2022.
@inproceedings{elfaramawy2022Interactive,
title = {Interactive Workflows for Exploratory Data Analysis},
author = {Nourhan Elfaramawy},
editor = {Zhifeng Bao and Timos Sellis},
url = {http://ceur-ws.org/Vol-3186/paper_2.pdf},
year = {2022},
date = {2022-01-01},
urldate = {2022-01-01},
booktitle = {Proceedings of the VLDB 2022 PhD Workshop co-located with the 48th
International Conference on Very Large Databases (VLDB 2022), Sydney,
Australia, September 5, 2022},
volume = {3186},
publisher = {CEUR-WS.org},
series = {CEUR Workshop Proceedings},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
2021
Thomas Krannich; W Timothy J White; Sebastian Niehus; Guillaume Holley; Bjarni V Halldórsson; Birte Kehr
Population-scale detection of non-reference sequence variants using colored de Bruijn graphs Journal Article
In: Bioinformatics, vol. 38, no. 3, pp. 604-611, 2021, ISSN: 1367-4803.
@article{10.1093/bioinformatics/btab749,
title = {Population-scale detection of non-reference sequence variants using colored de Bruijn graphs},
author = {Thomas Krannich and W Timothy J White and Sebastian Niehus and Guillaume Holley and Bjarni V Halldórsson and Birte Kehr},
url = {https://doi.org/10.1093/bioinformatics/btab749},
doi = {10.1093/bioinformatics/btab749},
issn = {1367-4803},
year = {2021},
date = {2021-01-01},
urldate = {2021-01-01},
journal = {Bioinformatics},
volume = {38},
number = {3},
pages = {604-611},
abstract = {With the increasing throughput of sequencing technologies, structural variant (SV) detection has become possible across tens of thousands of genomes. Non-reference sequence (NRS) variants have drawn less attention compared with other types of SVs due to the computational complexity of detecting them. When using short-read data, the detection of NRS variants inevitably involves a de novo assembly which requires high-quality sequence data at high coverage. Previous studies have demonstrated how sequence data of multiple genomes can be combined for the reliable detection of NRS variants. However, the algorithms proposed in these studies have limited scalability to larger sets of genomes.We introduce PopIns2, a tool to discover and characterize NRS variants in many genomes, which scales to considerably larger numbers of genomes than its predecessor PopIns. In this article, we briefly outline the PopIns2 workflow and highlight our novel algorithmic contributions. We developed an entirely new approach for merging contig assemblies of unaligned reads from many genomes into a single set of NRS using a colored de Bruijn graph. Our tests on simulated data indicate that the new merging algorithm ranks among the best approaches in terms of quality and reliability and that PopIns2 shows the best precision for a growing number of genomes processed. Results on the Polaris Diversity Cohort and a set of 1000 Icelandic human genomes demonstrate unmatched scalability for the application on population-scale datasets.The source code of PopIns2 is available from https://github.com/kehrlab/PopIns2.Supplementary data are available at Bioinformatics online.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
With the increasing throughput of sequencing technologies, structural variant (SV) detection has become possible across tens of thousands of genomes. Non-reference sequence (NRS) variants have drawn less attention compared with other types of SVs due to the computational complexity of detecting them. When using short-read data, the detection of NRS variants inevitably involves a de novo assembly which requires high-quality sequence data at high coverage. Previous studies have demonstrated how sequence data of multiple genomes can be combined for the reliable detection of NRS variants. However, the algorithms proposed in these studies have limited scalability to larger sets of genomes.We introduce PopIns2, a tool to discover and characterize NRS variants in many genomes, which scales to considerably larger numbers of genomes than its predecessor PopIns. In this article, we briefly outline the PopIns2 workflow and highlight our novel algorithmic contributions. We developed an entirely new approach for merging contig assemblies of unaligned reads from many genomes into a single set of NRS using a colored de Bruijn graph. Our tests on simulated data indicate that the new merging algorithm ranks among the best approaches in terms of quality and reliability and that PopIns2 shows the best precision for a growing number of genomes processed. Results on the Polaris Diversity Cohort and a set of 1000 Icelandic human genomes demonstrate unmatched scalability for the application on population-scale datasets.The source code of PopIns2 is available from https://github.com/kehrlab/PopIns2.Supplementary data are available at Bioinformatics online.