PhD defence, Computational mathematics

Thesis defence

Date: Wednesday 27 August 2025

Time: 13.00 – 17.00

Location: Room 10, 2nd Floor, House 2, Albano Campus, Stockholm University

Alexander Petri will defend his PhD thesis in computational mathematics, titled "Computational methods for long-read sequencing data analysis", on Wednesday 27 August.

Respondent: Alexander Petri (Computational Mathematics)

Supervisor: Kristoffer Sahlin (Stockholm University)

Opponent: Shilpa Garg (University of Manchester)

Thesis Title: Computational methods for long-read sequencing data analysis

Abstract

This thesis presents algorithms developed for long-read sequencing techniques, which, since their introduction in the 2010’s become increasingly important approaches in modern bioscientific research. The first two papers cover the development of algorithms for our de novo transcriptome prediction pipeline, the isON pipeline, while the third paper describes an algorithm used for biotechnological analysis of ligated fragments.

Paper I introduces isONform, an algorithm capable of predicting different gene products, called isoforms, from a set of long reads sequenced from complementary DNA without the need to rely on a reference or annotation. IsONform is a tool that is part of a larger long-read transcriptome pipeline, isON pipeline, that consists of clustering and error correction steps prior to the isoform prediction. The isONform algorithm is based on the construction of a directed acyclic graph with minimizer-pairs as nodes and connecting neighboring minimizer-pairs on the reads with edges. The algorithm then employs an iterative bubble-popping scheme to merge nodes to ultimately follow all distinct paths through the graph generating the final isoform predictions. The algorithm has been shown to outperform existing state-of-the-art algorithms, while showing comparable results to approaches requiring information of a reference genome and an annotation.

Paper II introduces isONclust3, an algorithm used for clustering transcriptomic reads by gene family. The algorithm constitutes the first step employed in pipelines for reference-free prediction of isoforms. The algorithm is based on the minimizer indexing scheme with its novelties being a dynamic clustering approach, assessing and storing minimizers by confidence, and an iterative post-cluster merging step. The algorithm has been shown to scale better, in terms of runtime and memory usage, on large datasets than existing methods while yielding comparable or even better results with respect to clustering quality assessments. We demonstrate that isONclust3 is the only algorithm that can process the clustering of PacBio’s new Revio datasets with tens of millions of reads using typical cluster computing resources (256Gb RAM).

These algorithms help to improve the accuracy and efficiency of transcriptomic analysis based on long-read techniques, which is crucial for understanding complex biological systems and diseases. Paper III presents an algorithmic solution, cONcat, to the detection of concatenated fragments in long-read sequencing reads with typical error profiles. The algorithm is based on a greedy heuristic that employs the edit distance measure to find best-fitting fragments and divides the sequence around those points to search for fragment hits on the remaining areas of the read. The algorithm has been shown to be resilient to errors in the data and to be scalable on large numbers of reads.

Last updated: June 26, 2025

Source: Department of Mathematics