- Professor at the Department of Computer and Systems Sciences at Stockholm University, Sweden
- Adjunct Professor at the Department of Computer Science at Aalto University, Finland
- Board member of the Swedish Artificial Intelligence Society (SAIS)
More about me here.
A selection from Stockholm University publication database
Learning from heterogeneous temporal data from electronic health records
2017. Jing Zhao (et al.). Journal of Biomedical Informatics 65, 105-119Article
Electronic health records contain large amounts of longitudinal data that are valuable for biomedical informatics research. The application of machine learning is a promising alternative to manual analysis of such data. However, the complex structure of the data, which includes clinical events that are unevenly distributed over time, poses a challenge for standard learning algorithms. Some approaches to modeling temporal data rely on extracting single values from time series; however, this leads to the loss of potentially valuable sequential information. How to better account for the temporality of clinical data, hence, remains an important research question. In this study, novel representations of temporal data in electronic health records are explored. These representations retain the sequential information, and are directly compatible with standard machine learning algorithms. The explored methods are based on symbolic sequence representations of time series data, which are utilized in a number of different ways. An empirical investigation, using 19 datasets comprising clinical measurements observed over time from a real database of electronic health records, shows that using a distance measure to random subsequences leads to substantial improvements in predictive performance compared to using the original sequences or clustering the sequences. Evidence is moreover provided on the quality of the symbolic sequence representation by comparing it to sequences that are generated using domain knowledge by clinical experts. The proposed method creates representations that better account for the temporality of clinical events, which is often key to prediction tasks in the biomedical domain.
On Searching and Indexing Sequences of Temporal Intervals
2017. Orestis Kostakis, Panagiotis Papapetrou. Data mining and knowledge discovery 31 (3), 809-850Article
In several application domains, including sign language, sensor networks, and medicine, events are not necessarily instantaneous but they may have a time duration. Such events build sequences of temporal intervals, which may convey useful domain knowledge; thus, searching and indexing these sequences is crucial. We formulate the problem of comparing sequences of labeled temporal intervals and present a distance measure that can be computed in polynomial time. We prove that the distance measure is metric and satisfies the triangle inequality. For speeding up search in large databases of sequences of temporal intervals, we propose an approximate indexing method that is based on embeddings. The proposed indexing framework is shown to be contractive and can guarantee no false dismissal. The distance measure is tested and benchmarked through rigorous experimentation on real data taken from several application domains, including: American Sign Language annotated video recordings, robot sensor data, and Hepatitis patient data. In addition, the indexing scheme is tested on a large synthetic dataset. Our experiments show that speedups of over an order of magnitude can be achieved while maintaining high levels of accuracy. As a result of our work, it becomes possible to implement recommender systems, search engines and assistive applications for the fields that employ sequences of temporal intervals.
A Multi-granularity Pattern-based Sequence Classification Framework for Educational Data
2016. Mohammad Jaber (et al.). 3rd IEEE International Conference on Data Science and Advanced Analytics, 370-378Conference
In many application domains, such as education, sequences of events occurring over time need to be studied in order to understand the generative process behind these sequences, and hence classify new examples. In this paper, we propose a novel multi-granularity sequence classification framework that generates features based on frequent patterns at multiple levels of time granularity. Feature selection techniques are applied to identify the most informative features that are then used to construct the classification model. We show the applicability and suitability of the proposed framework to the area of educational data mining by experimenting on an educational dataset collected from an asynchronous communication tool in which students interact to accomplish an underlying group project. The experimental results showed that our model can achieve competitive performance in detecting the students' roles in their corresponding projects, compared to a baseline similarity-based approach.
Advances in Intelligent Data Analysis XV
2016. Henrik Boström (et al.).Conference
This book constitutes the refereed conference proceedings of the 15th International Conference on Intelligent Data Analysis, which was held in October 2016 in Stockholm, Sweden. The 36 revised full papers presented were carefully reviewed and selected from 75 submissions. The traditional focus of the IDA symposium series is on end-to-end intelligent support for data analysis. The symposium aims to provide a forum for inspiring research contributions that might be considered preliminary in other leading conferences and journals, but that have a potentially dramatic impact.
Clustering with Confidence
2016. Andreas Henelius (et al.).Article
Clustering is a widely used unsupervised learning method for finding structure in the data. However, the resulting clusters are typically presented without any guarantees on their robustness; slightly changing the used data sample or re-running a clustering algorithm involving some stochastic component may lead to completely different clusters. There is, hence, a need for techniques that can quantify the instability of the generated clusters. In this study, we propose a technique for quantifying the instability of a clustering solution and for finding robust clusters, termed core clusters, which correspond to clusters where the co-occurrence probability of each data item within a cluster is at least 1−α . We demonstrate how solving the core clustering problem is linked to finding the largest maximal cliques in a graph. We show that the method can be used with both clustering and classification algorithms. The proposed method is tested on both simulated and real datasets. The results show that the obtained clusters indeed meet the guarantees on robustness.
Early Random Shapelet Forest
2016. Isak Karlsson, Panagiotis Papapetrou, Henrik Boström. Discovery Science, 261-276Conference
Early classification of time series has emerged as an increasingly important and challenging problem within signal processing, especially in domains where timely decisions are critical, such as medical diagnosis in health-care. Shapelets, i.e., discriminative sub-sequences, have been proposed for time series classification as a means to capture local and phase independent information. Recently, forests of randomized shapelet trees have been shown to produce state-of-the-art predictive performance at a low computational cost. In this work, they are extended to allow for early classification of time series. An extensive empirical investigation is presented, showing that the proposed algorithm is superior to alternative state-of-the-art approaches, in case predictive performance is considered to be more important than earliness. The algorithm allows for tuning the trade-off between accuracy and earliness, thereby supporting the generation of early classifiers that can be dynamically adapted to specific needs at low computational cost.
Entropy-based Prediction of Network Protocols in the Forensic Analysis of DNS Tunnels
2016. Irvin Homem, Panagiotis Papapetrou, Spyridon Dosis.Conference
DNS tunneling techniques are often used for malicious purposes but network security mechanisms have struggled to detect these. Network forensic analysis has thus been used but has proved slow and effort intensive as Network Forensics Analysis Tools struggle to deal with undocumented or new network tunneling techniques. In this paper we present a method to aid forensic analysis through automating the inference of protocols tunneled within DNS tunneling techniques. We analyze the internal packet structure of DNS tunneling techniques and characterize the information entropy of different network protocols and their DNS tunneled equivalents. From this, we present our protocol prediction method that uses entropy distribution averaging. Finally we apply our method on a dataset to measure its performance and show that it has a prediction accuracy of 75%. Our method also preserves privacy as it does not parse the actual tunneled content, rather it only calculates the information entropy.
2016. Myrsini Glinos (et al.). Proceedings of the 9th ACM International Conference on Pervasive Technologies Related to Assistive EnvironmentsConference
A variety of eHealth apps exist today ranging from ovulation calculators, such as Glow, to more sophisticated systems for determining the right therapist via semantic analysis, such as Talkspace, or ZocDoc. Despite their promising functionality, existing systems offer limited capabilities in terms of search filters, reviews, and doctor recommendations. In this paper, we propose FindMyDoc, a novel peer-to-peer healthcare platform that goes beyond existing traditional healthcare models. It provides doctor recommendations by allowing proper filtering based on treatment procedures, quality of treatment, and reviews of healthcare providers. In addition, the search results are refined using a recommendation engine that employs user-based collaborative filtering and exploits a set of predefined review options provided by the patients in order to match them with doctors.
Generalized random shapelet forests
2016. Isak Karlsson, Panagiotis Papapetrou, Henrik Boström. Data mining and knowledge discovery 30 (5), 1053-1085Article
Shapelets are discriminative subsequences of time series, usually embedded in shapelet-based decision trees. The enumeration of time series shapelets is, however, computationally costly, which in addition to the inherent difficulty of the decision tree learning algorithm to effectively handle high-dimensional data, severely limits the applicability of shapelet-based decision tree learning from large (multivariate) time series databases. This paper introduces a novel tree-based ensemble method for univariate and multivariate time series classification using shapelets, called the generalized random shapelet forest algorithm. The algorithm generates a set of shapelet-based decision trees, where both the choice of instances used for building a tree and the choice of shapelets are randomized. For univariate time series, it is demonstrated through an extensive empirical investigation that the proposed algorithm yields predictive performance comparable to the current state-of-the-art and significantly outperforms several alternative algorithms, while being at least an order of magnitude faster. Similarly for multivariate time series, it is shown that the algorithm is significantly less computationally costly and more accurate than the current state-of-the-art.
Identifying Factors for the Effectiveness of Treatment of Heart Failure
2016. Lars Asker (et al.). IEEE 29th International Symposiumon Computer-Based Medical SystemsConference
An administrative health register containing health care data for over 2 million patients will be used to search for factors that can affect the treatment of heart failure. In the study, we will measure the effects of employed treatment for various groups of heart failure patients, using different measures of effectiveness. Significant deviations in effectiveness of treatments of the various patient groups will be reported and factors that may help explaining the effect of treatment will be analyzed. Identification of the most important factors that may help explain the observed deviations between the different groups will be derived through generation of predictive models, for which variable importance can be calculated. The findings may affect recommended treatments as well as high-lighting deviations from national guidelines.
Learning from Swedish Healthcare Data
2016. Lars Asker, Panagiotis Papapetrou, Henrik Boström. Proceedings of the 9th ACM International Conference on PErvasive Technologies Related to Assistive EnvironmentsConference
We present two ongoing projects aimed at learning from health care records. The first project, DADEL, is focusing on high-performance data mining for detrecting adverse drug events in healthcare, and uses electronic patient records covering seven years of patient record data from the Stockholm region in Sweden. The second project is focusing on heart failure and on understanding the differences in treatment between various groups of patients. It uses a Swedish administrative health register containing health care data for over two million patients.
Query-sensitive Distance Measure Selection for Time Series Nearest Neighbor Classification
2016. Alexios Kotsifakos, Vassilis Athitsos, Panagiotis Papapetrou. Intelligent Data Analysis 20 (1), 5-27Article
Many distance or similarity measures have been proposed for time series similarity search. However, none of these measures is guaranteed to be optimal when used for 1-Nearest Neighbor (NN) classification. In this paper we study the problem of selecting the most appropriate distance measure, given a pool of time series distance measures and a query, so as to perform NN classification of the query. We propose a framework for solving this problem, by identifying, given the query, the distance measure most likely to produce the correct classification result for that query. From this proposed framework, we derive three specific methods, that differ from each other in the way they estimate the probability that a distance measure correctly classifies a query object. In our experiments, our pool of measures consists of Dynamic TimeWarping (DTW), Move-Split-Merge (MSM), and Edit distance with Real Penalty (ERP). Based on experimental evaluation with 45 datasets, the best-performing of the three proposed methods provides the best results in terms of classification error rate, compared to the competitors, which include using the Cross Validation method for selecting the distance measure in each dataset, as well as using a single specific distance measure (DTW, MSM, or ERP) across all datasets.
STIFE: A Framework for Feature-based Classification of Sequences of Temporal Intervals
2016. Leon Bornemann, Jason Lecerf, Panagiotis Papapetrou. Discovery Science, 85-100Conference
In this paper, we study the problem of classification of sequences of temporal intervals. Our main contribution is the STIFE framework for extracting relevant features from interval sequences to build feature-based classifiers. STIFE uses a combination of basic static metrics, shapelet discovery and selection, as well as distance-based approaches. Additionally, we propose an improved way of computing the state of the art IBSM distance measure between two interval sequences, that reduces both runtime and memory needs from pseudo-polynomial to fully polynomial, which greatly reduces the runtime of distance based classification approaches. Our empirical evaluation not only shows that STIFE provides a very fast classification time in all evaluated scenarios but also reveals that a random forests using STIFE achieves similar or better accuracy than the state of the art k-NN classifier.
Semigeometric Tiling of Event Sequences
2016. Andreas Henelius (et al.). Machine Learning and Knowledge Discovery in Databases, 329-344Conference
Event sequences are ubiquitous, e.g., in finance, medicine, and social media. Often the same underlying phenomenon, such as television advertisements during Superbowl, is reflected in independent event sequences, like different Twitter users. It is hence of interest to find combinations of temporal segments and subsets of sequences where an event of interest, like a particular hashtag, has an increased occurrence probability. Such patterns allow exploration of the event sequences in terms of their evolving temporal dynamics, and provide more fine-grained insights to the data than what for example straightforward clustering can reveal. We formulate the task of finding such patterns as a novel matrix tiling problem, and propose two algorithms for solving it. Our first algorithm is a greedy set-cover heuristic, while in the second approach we view the problem as time-series segmentation. We apply the algorithms on real and artificial datasets and obtain promising results. The software related to this paper is available at https://github.com/bwrc/semigeom-r.
Significance testing of word frequencies in corpora
2016. Jefrey Lijffijt (et al.). Digital Scholarship in the Humanities 31 (2), 374-397Article
Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (Language is never, ever, ever, random, Corpus Linguistics and Linguistic Theory, 2005; 1(2): 263–76.), the use of the χ2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (Comparing corpora, International Journal of Corpus Linguistics, 2001; 6(1): 1–37) and Paquot and Bestgen (Distinctive words in academic writing: a comparison of three statistical tests for keyword extraction. In Jucker, A., Schreier, D., and Hundt, M. (eds), Corpora: Pragmatics and Discourse. Amsterdam: Rodopi, 2009, pp. 247–69), it is possible to represent the data differently and employ other tests, such that we assume independence at the level of texts rather than individual words. This allows us to account for the distribution of words within a corpus. In this article we compare the significance estimates of various statistical tests in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus. We find that the choice of the test, and hence data representation, matters. We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the observed differences, especially for poorly dispersed words. We recommend the use of the t-test, Wilcoxon rank-sum test, or bootstrap test for comparing word frequencies across corpora.
2015. Alexios Kotsifakos (et al.). Data mining and knowledge discovery 29 (5), 1280-1311Article
Similarity search in large sequence databases is a problem ubiquitous in a wide range of application domains, including searching biological sequences. In this paper we focus on protein and DNA data, and we propose a novel approximate method method for speeding up range queries under the edit distance. Our method works in a filter-and-refine manner, and its key novelty is a query-sensitive mapping that transforms the original string space to a new string space of reduced dimensionality. Specifically, it first identifies the most frequent codewords in the query, and then uses these codewords to convert both the query and the database to a more compact representation. This is achieved by replacing every occurrence of each codeword with a new letter and by removing the remaining parts of the strings. Using this new representation, our method identifies a set of candidate matches that are likely to satisfy the range query, and finally refines these candidates in the original space. The main advantage of our method, compared to alternative methods for whole sequence matching under the edit distance, is that it does not require any training to create the mapping, and it can handle large query lengths with negligible losses in accuracy. Our experimental evaluation demonstrates that, for higher range values and large query sizes, our method produces significantly lower costs and runtimes compared to two state-of-the-art competitor methods.
Embedding-based subsequence matching with gaps-range-tolerances
2015. Alexios Kotsifakos (et al.). The VLDB journal 24 (4), 519-536Article
We present a subsequence matching framework that allows for gaps in both query and target sequences, employs variable matching tolerance efficiently tuned for each query and target sequence, and constrains the maximum matching range. Using this framework, a dynamic programming method is proposed, called SMBGT, that, given a short query sequence Q and a large database, identifies in quadratic time the subsequence of the database that best matches Q. SMBGT is highly applicable to music retrieval. However, in Query-By-Humming applications, runtime is critical. Hence, we propose a novel embedding-based approach, called ISMBGT, for speeding up search under SMBGT. Using a set of reference sequences, ISMBGT maps both Q and each position of each database sequence into vectors. The database vectors closest to the query vector are identified, and SMBGT is then applied between Q and the subsequences that correspond to those database vectors. The key novelties of ISMBGT are that it does not require training, it is query sensitive, and it exploits the flexibility of SMBGT. We present an extensive experimental evaluation using synthetic and hummed queries on a large music database. Our findings show that ISMBGT can achieve speedups of up to an order of magnitude against brute-force search and over an order of magnitude against cDTW, while maintaining a retrieval accuracy very close to that of brute-force search.
Extracting news text from web pages
2015. Erik Lundgren, Lars Asker, Panagiotis Papapetrou. Proceeding PETRA '15 Proceedings of the 8th ACM International Conference on PErvasive Technologies Related to Assistive EnvironmentsConference
Finding the longest common sub-pattern in sequences of temporal intervals
2015. Orestis Kostakis, Panagiotis Papapetrou. Data mining and knowledge discovery 29 (5), 1178-1210Article
We study the problem of finding the longest common sub-pattern (LCSP) shared by two sequences of temporal intervals. In particular we are interested in finding the LCSP of the corresponding arrangements. Arrangements of temporal intervals are a powerful way to encode multiple concurrent labeled events that have a time duration. Discovering commonalities among such arrangements is useful for a wide range of scientific fields and applications, as it can be seen by the number and diversity of the datasets we use in our experiments. In this paper, we define the problem of LCSP and prove that it is NP-complete by demonstrating a connection between graphs and arrangements of temporal intervals. This connection leads to a series of interesting open problems. In addition, we provide an exact algorithm to solve the LCSP problem, and also propose and experiment with three polynomial time and space under-approximation techniques. Finally, we introduce two upper bounds for LCSP and study their suitability for speeding up 1-NN search. Experiments are performed on seven datasets taken from a wide range of real application domains, plus two synthetic datasets. Lastly, we describe several application cases that demonstrate the need and suitability of LCSP.
Forests of Randomized Shapelet Trees
2015. Isak Karlsson, Panagiotis Papapetrou, Henrik Boström. Statistical Learning and Data Sciences, 126-136Conference
Shapelets have recently been proposed for data series classification, due to their ability to capture phase independent and local information. Decision trees based on shapelets have been shown to provide not only interpretable models, but also, in many cases, state-of-the-art predictive performance. Shapelet discovery is however computationally costly, and although several techniques for speeding up the technique have been proposed, the computational cost is still in many cases prohibitive. In this work, an ensemble based method, referred to as Random Shapelet Forest (RSF), is proposed, which builds on the success of the random forest algorithm, and which is shown to have a lower computational complexity than the original shapelet tree learning algorithm. An extensive empirical investigation shows that the algorithm provides competitive predictive performance and that a proposed way of calculating importance scores can be used to successfully identify influential regions.
Multi-channel ECG classification using forests of randomized shapelet trees
2015. Isak Karlsson, Panagiotis Papapetrou, Lars Asker. Proceedings of the 8th ACM International Conference on PErvasive Technologies Related to Assistive EnvironmentsConference
Data series of multiple channels occur at high rates and in massive quantities in several application domains, such as healthcare. In this paper, we study the problem of multi-channel ECG classification. We map this problem to multivariate data series classification and propose five methods for solving it, using a split-and-combine approach. The proposed framework is evaluated using three base-classifiers on real-world data for detecting Myocardial Infarction. Extensive experiments are performed on real ECG data extracted from the Physiobank data repository. Our findings emphasize the importance of selecting an appropriate base-classifier for multivariate data series classification, while demonstrating the superiority of the Random Shapelet Forest (0.825 accuracy) against competitor methods (0.664 accuracy for 1-NN under cDTW).
Optimizing Hashing Functions for Similarity Indexing in Arbitrary Metric and Nonmetric Spaces
2015. Pat Jangyodsuk, Panagiotis Papapetrou, Vassilis Athitsos. Proceedings of the SIAM International Conference on Data Mining, 828-836Conference
Proceedings of the ECMLPKDD 2015 Doctoral Consortium
2015. Jaakko Hollmén, Panagiotis Papapetrou.Conference
2015. Jefrey Lijffijt, Panagiotis Papapetrou, Kai Puolamaki. Data mining and knowledge discovery 29 (6), 1838-1864Article
In order to find patterns in data, it is often necessary to aggregate or summarise data at a higher level of granularity. Selecting the appropriate granularity is a challenging task and often no principled solutions exist. This problem is particularly relevant in analysis of data with sequential structure. We consider this problem for a specific type of data, namely event sequences. We introduce the problem of finding the best set of window lengths for analysis of event sequences for algorithms with real-valued output. We present suitable criteria for choosing one or multiple window lengths and show that these naturally translate into a computational optimisation problem. We show that the problem is NP-hard in general, but that it can be approximated efficiently and even analytically in certain cases. We give examples of tasks that demonstrate the applicability of the problem and present extensive experiments on both synthetic data and real data from several domains. We find that the method works well in practice, and that the optimal sets of window lengths themselves can provide new insight into the data.
A peek into the black box
2014. Andreas Henelius (et al.). Data mining and knowledge discovery 28 (5-6), 1503-1529Article
Classifiers are often opaque and cannot easily be inspected to gain understanding of which factors are of importance. We propose an efficient iterative algorithm to find the attributes and dependencies used by any classifier when making predictions. The performance and utility of the algorithm is demonstrated on two synthetic and 26 real-world datasets, using 15 commonly used learning algorithms to generate the classifiers. The empirical investigation shows that the novel algorithm is indeed able to find groupings of interacting attributes exploited by the different classifiers. These groupings allow for finding similarities among classifiers for a single dataset as well as for determining the extent to which different classifiers exploit such interactions in general.
A statistical significance testing approach to mining the most informative set of patterns
2014. jefrey Lijffijt, Panagiotis Papapetrou, Kai Puolamäki. Data mining and knowledge discovery 28 (1), 238-263Article
Benchmarking Dynamic Time Warping on Nearest Neighbor Classification of Electrocardiograms
2014. Nikolaos Tselas, Panagiotis Papapetrou. Proceedings of the 7th International Conference on PErvasive Technologies Related to Assistive EnvironmentsConference
The human cardiovascular system is a complicated structure that has been the focus of research in many different domains, such as medicine, biology, as well as computer science. Due to the complexity of the heart, even nowadays some of the most common disorders are still hard to identify. In this paper, we map each ECG to a time series or set of time series and explore the applicability of two common time series similarity matching methods, namely, DTW and cDTW, to the problem of ECG classification. We benchmark the two methods on four different datasets in terms of accuracy. In addition, we explore their predictive performance when various ECG channels are taken into account. The latter is performed using a dataset taken from Physiobank. Our findings suggest that different ECG channels are more appropriate for different cardiovascular malfunctions.
Inferring Offline Hierarchical Ties from Online Social Networks
2014. Mohammad Jaber (et al.). Proceedings of the companion publication of the 23rd international conference on World wide web companion, 1261-1266Conference
Social networks can represent many different types of relationships between actors, some explicit and some implicit. For example, email communications between users may be represented explicitly in a network, while managerial relationships may not. In this paper we focus on analyzing explicit interactions among actors in order to detect hierarchical social relationships that may be implicit. We start by employing three well-known ranking-based methods, PageRank, Degree Centrality, and Rooted-PageRank (RPR) to infer such implicit relationships from interactions between actors. Then we propose two novel approaches which take into account the time-dimension of interactions in the process of detecting hierarchical ties. We experiment on two datasets, the Enron email dataset to infer manager-subordinate relationships from email exchanges, and a scientific publication co-authorship dataset to detect PhD advisor-advisee relationships from paper co-authorships. Our experiments show that time-based methods perform considerably better than ranking-based methods. In the Enron dataset, they detect 48% of manager-subordinate ties versus 32% found by Rooted-PageRank. Similarly, in co-author dataset, they detect 62% of advisor-advisee ties compared to only 39% by Rooted-PageRank.
Mining Candidates for Adverse Drug Interactions in Electronic Patient Records
2014. Lars Asker (et al.). PETRA '14 Proceedings of the 7th International Conference on Pervasive Technologies Related to Assistive Environments, PETRA’14Conference
Electronic patient records provide a valuable source of information for detecting adverse drug events. In this paper, we explore two different but complementary approaches to extracting useful information from electronic patient records with the goal of identifying candidate drugs, or combinations of drugs, to be further investigated for suspected adverse drug events. We propose a novel filter-and-refine approach that combines sequential pattern mining and disproportionality analysis. The proposed method is expected to identify groups of possibly interacting drugs suspected for causing certain adverse drug events. We perform an empirical investigation of the proposed method using a subset of the Stockholm electronic patient record corpus. The data used in this study consists of all diagnoses and medications for a group of patients diagnoses with at least one heart related diagnosis during the period 2008--2010. The study shows that the method indeed is able to detect combinations of drugs that occur more frequently for patients with cardiovascular diseases than for patients in a control group, providing opportunities for finding candidate drugs that cause adverse drug effects through interaction.
Model-Based Time Series Classification
2014. Alexios Kotsifakos, Panagiotis Papapetrou. Advances in Intelligent Data Analysis XIII, 179-191Conference
We propose MTSC, a filter-and-refine framework for time series Nearest Neighbor (NN) classification. Training time series belonging to certain classes are first modeled through Hidden Markov Models (HMMs). Given an unlabeled query, and at the filter step, we identify the top K models that have most likely produced the query. At the refine step, a distance measure is applied between the query and all training time series of the top K models. The query is then assigned with the class of the NN. In our experiments, we first evaluated the NN classification error rate of HMMs compared to three state-of-the-art distance measures on 45 time series datasets of the UCR archive, and showed that modeling time series with HMMs achieves lower error rates in 30 datasets and equal error rates in 4. Secondly, we compared MTSC with Cross Validation defined over the three measures on 33 datasets, and we observed that MTSC is at least as good as the competitor method in 23 datasets, while achieving competitive speedups, showing its effectiveness and efficiency.
Social Context Discovery from Temporal App Use Patterns
2014. Panagiotis Papapetrou, George Roussos. Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing, 397-402Conference
Using Time-Sensitive Rooted PageRank to Detect Hierarchical Social Relationships
2014. Mohammad Jaber (et al.). Advances in Intelligent Data Analysis XIII, 143-154Conference
We study the problem of detecting hierarchical ties in a social network by exploiting the interaction patterns between the actors (members) involved in the network. Motivated by earlier work using a rank-based approach, i.e., Rooted-PageRank, we introduce a novel time-sensitive method, called T-RPR, that captures and exploits the dynamics and evolution of the interaction patterns in the network in order to identify the underlying hierarchical ties. Experiments on two real datasets demonstrate the performance of T-RPR in terms of recall and show its superiority over a recent competitor method
Mining poly-regions in DNA
2012. Panagiotis Papapetrou, Gary Benson, George Kollios. International journal on data mining and bioinformatics 6 (4), 406-428Article
We study the problem of mining poly-regions in DNA. A poly-region is defined as a bursty DNA area, i.e., area of elevated frequency of a DNA pattern. We introduce a general formulation that covers a range of meaningful types of poly-regions and develop three efficient detection methods. The first applies recursive segmentation and is entropy-based. The second uses a set of sliding windows that summarize each sequence segment using several statistics. Finally, the third employs a technique based on majority vote. The proposed algorithms are tested on DNA sequences of four different organisms in terms of recall and runtime.
2012. Alexios Kotsifakos (et al.). Proceedings of the VLDB Endowment 5 (12), 1930-1933Article
We present "Hum-a-song", a system built for music retrieval, and particularly for the Query-By-Humming (QBH) application. According to QBH, the user is able to hum a part of a song that she recalls and would like to learn what this song is, or find other songs similar to it in a large music repository. We present a simple yet efficient approach that maps the problem to time series subsequence matching. The query and the database songs are represented as 2-dimensional time series conveying information about the pitch and the duration of the notes. Then, since the query is a short sequence and we want to find its best match that may start and end anywhere in the database, subsequence matching methods are suitable for this task. In this demo, we present a system that employs and exposes to the user a variety of state-of-the-art dynamic programming methods, including a newly proposed efficient method named SMBGT that is robust to noise and considers all intrinsic problems in QBH; it allows variable tolerance levels when matching elements, where tolerances are defined as functions of the compared sequences, gaps in both the query and target sequences, and bounds the matching length and (optionally) the minimum number of matched elements. Our system is intended to become open source, which is to the best of our knowledge the first non-commercial effort trying to solve QBH with a variety of methods, and that also approaches the problem from the time series perspective.