# Panagiotis PapapetrouProfessor, deputy head of department

## Teaching

- DAMI: Data Mining (Fall I)

- ML: Machine Learning (Spring I)

## Publications

A selection from Stockholm University publication database

• # SMILE

2021. Jonathan Rebane (et al.). Data mining and knowledge discovery 35 (1), 372-399

Article

In this paper, we study the problem of classification of sequences of temporal intervals. Our main contribution is a novel framework, which we call SMILE, for extracting relevant features from interval sequences to construct classifiers.SMILE introduces the notion of utilizing random temporal abstraction features, we define as e-lets, as a means to capture information pertaining to class-discriminatory events which occur across the span of complete interval sequences. Our empirical evaluation is applied to a wide array of benchmark data sets and fourteen novel datasets for adverse drug event detection. We demonstrate how the introduction of simple sequential features, followed by progressively more complex features each improve classification performance. Importantly, this investigation demonstrates that SMILE significantly improves AUC performance over the current state-of-the-art. The investigation also reveals that the selection of underlying classification algorithm is important to achieve superior predictive performance, and how the number of features influences the performance of our framework.

• # A clustering framework for patient phenotyping with application to adverse drug events

2020. Maria Bampa, Panagiotis Papapetrou, Jaakko Hollmén. 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), 177-182

Conference

We present a clustering framework for identifying patient groups with Adverse Drug Reactions from Electronic Health Records (EHRs). The increased adoption of EHRs has brought changes in the way drug safety surveillance is carried out and plays an important role in effective drug regulation. Unsupervised machine learning methods using EHRs as their input can identify patients that share common meaningful information, without the need for expert input. In this work, we propose a generalized framework that exploits the strengths of different clustering algorithms and via clustering aggregation identifies consensus patient cluster profiles. Moreover, the inherent hierarchical structure of diagnoses and medication codes is exploited. We assess the statistical significance of the produced clusterings by applying a randomization technique that keeps the data distribution margins fixed, as we are interested in evaluating information that is not conveyed by the marginal distributions. The experimental findings suggest that the framework produces medically meaningful patient groups with regard to adverse drug events by investigating two use-cases, i.e., aplastic anaemia and drug-induced skin eruption.

• # Clinical predictive keyboard using statistical and neural language modeling

2020. Ioannis Pavlopoulos, Panagiotis Papapetrou. 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), 293-296

Conference

A language model can be used to predict the next word during authoring, to correct spelling or to accelerate writing (e.g., in sms or emails). Language models, however, have only been applied in a very small scale to assist physicians during authoring (e.g., discharge summaries or radiology reports). But along with the assistance to the physician, computer-based systems which expedite the patient's exit also assist in decreasing the hospital infections. We employed statistical and neural language modeling to predict the next word of a clinical text and assess all the models in terms of accuracy and keystroke discount in two datasets with radiology reports. We show that a neural language model can achieve as high as 51.3% accuracy in radiology reports (one out of two words predicted correctly). We also show that even when the models are employed only for frequent words, the physician can save valuable time.

Read more about Clinical predictive keyboard using statistical and neural language modeling
• # Evaluating local interpretable model-agnostic explanations on clinical machine learning classification models

2020. Nesaretnam Barr Kumarakulasinghe (et al.). 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), 7-12

Conference

The usage of black-box classification models within the healthcare field is highly dependent on being interpretable by the receiver. Local Interpretable Model-Agnostic Explanation (LIME) provides a patient-specific explanation for a given classification, thus enhancing the possibility for any complex classifier to serve as a safety aid within a clinical setting. However, research on if the explanation provided by LIME is relevant to clinicians is limited and there is no current framework for how an evaluation of LIME is to be performed. To evaluate the clinical relevance of the explanations provided by LIME, this study has investigated how physician's independent explanations for classified observations compare with explanations provided by LIME. Additionally, the clinical relevance and the experienced reliance on the explanations provided by LIME have been evaluated by field experts. The results indicate that the explanation provided by LIME is clinically relevant and has a very high concordance with the explanations provided by physicians. Furthermore, trust and reliance on LIME are fairly high amongst clinicians. The study proposes a framework for further research within the area.

Read more about Evaluating local interpretable model-agnostic explanations on clinical machine learning classification models
• # Exploiting complex medical data with interpretable deep learning for adverse drug event prediction

2020. Jonathan Rebane, Isak Samsten, Panagiotis Papapetrou. Artificial Intelligence in Medicine 109

Article

A variety of deep learning architectures have been developed for the goal of predictive modelling and knowledge extraction from medical records. Several models have placed strong emphasis on temporal attention mechanisms and decay factors as a means to include highly temporally relevant information regarding the recency of medical event occurrence while facilitating medical code-level interpretability. In this study we utilise such models with a large Electronic Patient Record (EPR) data set consisting of diagnoses, medication, and clinical text data for the purpose of adverse drug event (ADE) prediction. The first contribution of this work is an empirical evaluation of two state-of-the-art medical-code based models in terms of objective performance metrics for ADE prediction on diagnosis and medication data. Secondly, as an extension of previous work, we augment an interpretable deep learning architecture to permit numerical risk and clinical text features and demonstrate how this approach yields improved predictive performance compared to the other baselines. Finally, we assess the importance of attention mechanisms in regards to their usefulness for medical code-level and text-level interpretability, which may facilitate novel insights pertaining to the nature of ADE occurrence within the health care domain.

• # Guiding principles for the use of knowledge bases and real-world data in clinical decision support systems

2020. Mikael Hoffmann (et al.). Expert Review of Clinical Pharmacology 13 (9), 925-934

Article

Introduction: Technical and logical breakthroughs have provided new opportunities in medicine to use knowledge bases and large-scale clinical data (real-world) at point-of-care as part of a learning healthcare system to diminish the knowledge-practice gap.

Areas covered: The article is based on presentations, discussions and recommendations from an international scientific workshop. Value, research needs and funding avenues of knowledge bases and access to real-world data as well as transparency and incorporation of patient perspectives are discussed.

Expert opinion: Evidence-based, publicly funded, well-structured and curated knowledge bases are of global importance. They ought to be considered as a public responsibility requiring transparency and handling of conflicts of interest. Information has to be made accessible for clinical decision support systems (CDSS) for healthcare staff and patients. Access to rich and real-world data is essential for a learning health care ecosystem and can be augmented by data on patient-reported outcomes and preferences. This field can progress by the establishment of an international policy group for developing a best practice guideline on the development, maintenance, governance, evaluation principles and financing of open-source knowledge bases and handling of real-world data.

Read more about Guiding principles for the use of knowledge bases and real-world data in clinical decision support systems
• # Locally and globally explainable time series tweaking

2020. Isak Karlsson (et al.). Knowledge and Information Systems 62 (5), 1671-1700

Article

Time series classification has received great attention over the past decade with a wide range of methods focusing on predictive performance by exploiting various types of temporal features. Nonetheless, little emphasis has been placed on interpretability and explainability. In this paper, we formulate the novel problem of explainable time series tweaking, where, given a time series and an opaque classifier that provides a particular classification decision for the time series, we want to find the changes to be performed to the given time series so that the classifier changes its decision to another class. We show that the problem is NP -hard, and focus on three instantiations of the problem using global and local transformations. In the former case, we investigate the k-nearest neighbor classifier and provide an algorithmic solution to the global time series tweaking problem. In the latter case, we investigate the random shapelet forest classifier and focus on two instantiations of the local time series tweaking problem, which we refer to as reversible and irreversible time series tweaking, and propose two algorithmic solutions for the two problems along with simple optimizations. An extensive experimental evaluation on a variety of real datasets demonstrates the usefulness and effectiveness of our problem formulation and solutions.

• # Mining disproportional frequent arrangements of event intervals for investigating adverse drug events

2020. Zed Lee, Jonathan Rebane, Panagiotis Papapetrou. 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), 289-292

Conference

Adverse drug events are pervasive and costly medical conditions, in which novel research approaches are needed to investigate the nature of such events further and ultimately achieve early detection and prevention. In this paper, we seek to characterize patients who experience an adverse drug event, represented as a case group, by contrasting them to similar control group patients who do not experience such an event. To achieve this goal, we utilize an extensive electronic patient record database and apply a combination of frequent arrangement mining and disproportionality analysis. Our results have identified how several adverse drug events are characterized in regards to frequent disproportional arrangements, where we highlight how such arrangements can provide additional temporal-based information compared to similar approaches.

• # User Traffic Prediction for Proactive Resource Management

2020. Amin Azari (et al.). IEEE Global Communications Conference (GLOBECOM), 1-6

Conference

Traffic prediction plays a vital role in efficient planning and usage of network resources in wireless networks. While traffic prediction in wired networks is an established field, there is a lack of research on the analysis of traffic in cellular networks, especially in a content-blind manner at the user level. Here, we shed light into this problem by designing traffic prediction tools that employ either statistical, rule-based, or deep machine learning methods. First, we present an extensive experimental evaluation of the designed tools over a real traffic dataset. Within this analysis, the impact of different parameters, such as length of prediction, feature set used in analyses, and granularity of data, on accuracy of prediction are investigated. Second, regarding the coupling observed between behavior of traffic and its generating application, we extend our analysis to the blind classification of applications generating the traffic based on the statistics of traffic arrival/departure. The results demonstrate presence of a threshold number of previous observations, beyond which, deep machine learning can outperform linear statistical learning, and before which, statistical learning outperforms deep learning approaches. Further analysis of this threshold value represents a strong coupling between this threshold, the length of future prediction, and the feature set in use. Finally, through a case study, we present how the experienced delay could be decreased by traffic arrival prediction.

• # Z-Miner

2020. Zed Lee, Tony Lindgren, Panagiotis Papapetrou. KDD '20, 524-534

Conference

Mining frequent patterns of event intervals from a large collection of interval sequences is a problem that appears in several application domains. In this paper, we propose Z-Miner, a novel algorithm for solving this problem that addresses the deficiencies of existing competitors by employing two novel data structures: Z-Table, a hierarchical hash-based data structure for time-efficient candidate generation and support count, and Z-Arrangement, a data structure for efficient memory consumption. The proposed algorithm is able to handle patterns with repetitions of the same event label, allowing for gap and error tolerance constraints, as well as keeping track of the exact occurrences of the extracted frequent patterns. Our experimental evaluation on eight real-world and six synthetic datasets demonstrates the superiority of Z-Miner against four state-of-the-art competitors in terms of runtime efficiency and memory footprint.

• # A classification framework for exploiting sparse multi-variate temporal features with application to adverse drug event detection in medical records

2019. Francesco Bagattini (et al.). BMC Medical Informatics and Decision Making 19

Article

Background: Adverse drug events (ADEs) as well as other preventable adverse events in the hospital setting incur a yearly monetary cost of approximately \$3.5 billion, in the United States alone. Therefore, it is of paramount importance to reduce the impact and prevalence of ADEs within the healthcare sector, not only since it will result in reducing human suffering, but also as a means to substantially reduce economical strains on the healthcare system. One approach to mitigate this problem is to employ predictive models. While existing methods have been focusing on the exploitation of static features, limited attention has been given to temporal features.

Methods: In this paper, we present a novel classification framework for detecting ADEs in complex Electronic health records (EHRs) by exploiting the temporality and sparsity of the underlying features. The proposed framework consists of three phases for transforming sparse and multi-variate time series features into a single-valued feature representation, which can then be used by any classifier. Moreover, we propose and evaluate three different strategies for leveraging feature sparsity by incorporating it into the new representation.

Results: A large-scale evaluation on 15 ADE datasets extracted from a real-world EHR system shows that the proposed framework achieves significantly improved predictive performance compared to state-of-the-art. Moreover, our framework can reveal features that are clinically consistent with medical findings on ADE detection.

Conclusions: Our study and experimental findings demonstrate that temporal multi-variate features of variable length and with high sparsity can be effectively utilized to predict ADEs from EHRs. Two key advantages of our framework are that it is method agnostic, i.e., versatile, and of low computational cost, i.e., fast; hence providing an important building block for future exploitation within the domain of machine learning from EHRs.

Read more about A classification framework for exploiting sparse multi-variate temporal features with application to adverse drug event detection in medical records
• # Aggregate-Eliminate-Predict

2019. Maria Bampa, Panagiotis Papapetrou.

We study the problem of detecting adverse drug events in electronic healthcare records. The challenge in this work is to aggregate heterogeneous data types involving diagnosis codes, drug codes, as well as lab measurements. An earlier framework proposed for the same problem demonstrated promising predictive performance for the random forest classifier by using only lab measurements as data features. We extend this framework, by additionally including diagnosis and drug prescription codes, concurrently. In addition, we employ a recursive feature selection mechanism on top, that extracts the top-k most important features. Our experimental evaluation on five medical datasets of adverse drug events and six different classifiers, suggests that the integration of these additional features provides substantial and statistically significant improvements in terms of AUC, while employing medically relevant features.

• # An Investigation of Interpretable Deep Learning for Adverse Drug Event Prediction

2019. Jonathan Rebane, Isak Karlsson, Panagiotis Papapetrou. 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems

Conference

A variety of deep learning architectures have been developed for the goal of predictive modelling in regards to detecting health diagnoses in medical records. Several models have placed strong emphases on temporal attention mechanisms and decay factors as a means to include highly temporally relevant information regarding the recency of medical event occurrence while facilitating medical code-level interpretability. In this study we utilise such models with a novel Electronic Patient Record (EPR) data set consisting of both diagnoses and medication data for the purpose of Adverse Drug Event (ADE) prediction. As such, a main contribution of this work is an empirical evaluation of two state-of-the-art deep learning architectures in terms of objective performance metrics for ADE prediction. We also assess the importance of attention mechanisms in regards to their usefulness for medical code-level interpretability, which may facilitate novel insights pertaining to the nature of ADE occurrence within the health care domain.

• # Cellular Traffic Prediction and Classification

2019. Amin Azari (et al.). Discovery Science, 129-144

Conference

Prediction of user traffic in cellular networks has attracted profound attention for improving the reliability and efficiency of network resource utilization. In this paper, we study the problem of cellular network traffic prediction and classification by employing standard machine learning and statistical learning time series prediction methods, including long short-term memory (LSTM) and autoregressive integrated moving average (ARIMA), respectively. We present an extensive experimental evaluation of the designed tools over a real network traffic dataset. Within this analysis, we explore the impact of different parameters on the effectiveness of the predictions. We further extend our analysis to the problem of network traffic classification and prediction of traffic bursts. The results, on the one hand, demonstrate the superior performance of LSTM over ARIMA in general, especially when the length of the training dataset is large enough and its granularity is fine enough. On the other hand, the results shed light onto the circumstances in which, ARIMA performs close to the optimal with lower complexity.

• # Clustering Diagnostic Profiles of Patients

2019. Jaakko Hollmén, Panagiotis Papapetrou. Artificial Intelligence Applications and Innovations, 120-126

Conference

Electronic Health Records provide a wealth of information about the care of patients and can be used for checking the conformity of planned care, computing statistics of disease prevalence, or predicting diagnoses based on observed symptoms, for instance. In this paper, we explore and analyze the recorded diagnoses of patients in a hospital database in retrospect, in order to derive profiles of diagnoses in the patient database. We develop a data representation compatible with a clustering approach and present our clustering approach to perform the exploration. We use a k-means clustering model for identifying groups in our binary vector representation of diagnoses and present appropriate model selection techniques to select the number of clusters. Furthermore, we discuss possibilities for interpretation in terms of diagnosis probabilities, in the light of external variables and with the common diagnoses occurring together.

• # Example-Based Feature Tweaking Using Random Forests

2019. Tony Lindgren (et al.). 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science

Conference

In certain application areas when using predictive models, it is not enough to make an accurate prediction for an example, instead it might be more important to change a prediction from an undesired class into a desired class. In this paper we investigate methods for changing predictions of examples. To this end, we introduce a novel algorithm for changing predictions of examples and we compare this novel method to an existing method and a baseline method. In an empirical evaluation we compare the three methods on a total of 22 datasets. The results show that the novel method and the baseline method can change an example from an undesired class into a desired class in more cases than the competitor method (and in some cases this difference is statistically significant). We also show that the distance, as measured by the euclidean norm, is higher for the novel and baseline methods (and in some cases this difference is statistically significantly) than for state-of-the-art. The methods and their proposed changes are also evaluated subjectively in a medical domain with interesting results.

• # FISUL: A Framework for Detecting Adverse Drug Events from Heterogeneous Medical Sources Using Feature Importance

2019. Corinne G. Allaart, Lena Mondrejevski, Panagiotis Papapetrou. Artificial Intelligence Applications and Innovations, 139-151

Conference

Adverse drug events (ADEs) are considered to be highly important and critical conditions, while accounting for around 3.7% of hospital admissions all over the world. Several studies have applied predictive models for ADE detection; nonetheless, only a restricted number and type of features has been used. In the paper, we propose a framework for identifying ADEs in medical records, by first applying the Boruta feature importance criterion, and then using the top-ranked features for building a predictive model as well as for clustering. We provide an experimental evaluation on the MIMIC-III database by considering 7 types of ADEs illustrating the benefit of the Boruta criterion for the task of ADE detection.

Read more about FISUL: A Framework for Detecting Adverse Drug Events from Heterogeneous Medical Sources Using Feature Importance
• # Implementation of Mobile-Based Real-Time Heart Rate Variability Detection for Personalized Healthcare

2019. Luis Quintero (et al.). 2019 International Conference on Data Mining Workshops (ICDMW)

Conference

The ubiquity of wearable devices together with areas like internet of things, big data and machine learning have promoted the development of solutions for personalized healthcare that use digital sensors. However, there is a lack of an implemented framework that is technically feasible, easily scalable and that provides meaningful variables to be used in applications for translational medicine. This paper describes the implementation and early evaluation of a physiological sensing tool that collects and processes photoplethysmography data from a wearable smartwatch to calculate heart rate variability in real-time. A technical open-source framework is outlined, involving mobile devices for collection of heart rate data, feature extraction and execution of data mining or machine learning algorithms that ultimately deliver mobile health interventions tailored to the users. Eleven volunteers participated in the empirical evaluation that was carried out using an existing mobile virtual reality application for mental health and under controlled slow-paced breathing exercises. The results validated the feasibility of implementation of the proposed framework in the stages of signal acquisition and real-time calculation of heart rate variability (HRV). The analysis of data regarding packet loss, peak detection and overall system performance provided considerations to enhance the real-time calculation of HRV features. Further studies are planned to validate all the stages of the proposed framework.

Read more about Implementation of Mobile-Based Real-Time Heart Rate Variability Detection for Personalized Healthcare
• # Method in interdisciplinary research: data science for digital art history

2019. Ewa Machotka, Panagiotis Papapetrou. International Journal for Digital Art History (4)

Article

This paper creates a conceptual frame and explanatory point of reference for the collection of papers presented at the exploratory workshop “Data Science for Digital Art History: Tackling Big Data Challenges, Algorithms, and Systems” organized at the KDD 2018 Conference in Data Mining and Knowledge Discovery held in London in August 2018. The goal of the workshop was to probe the field and to build a constructive interdisciplinary dialogue between two research areas: Data Science and Art History. The workshop’s chairs and the authors of this paper share the conviction that Data Science can enrich art studies while analysis of visual data can have a positive impact on Data Science. Thus, the research initiative tried to critically reflect on the interdisciplinary collaboration between diverse research communities and its epistemological and ontological effects.

Read more about Method in interdisciplinary research: data science for digital art history
• # Mining Adverse Drug Events Using Multiple Feature Hierarchies and Patient History Windows

2019. Maria Bampa, Panagiotis Papapetrou. 19th IEEE International Conference on Data Mining Workshops

Conference

We study the problem of detecting adverse drug events in electronic health records. The challenge is this work is to aggregate heterogeneous data types involving lab measurements, diagnoses codes and medications codes. An earlier framework proposed for the same problem demonstrated promising predictive performance for the random forest classifier by using only lab measurements as data features. We extend this framework, by additionally including diagnosis and drug prescription codes, concurrently. In addition, we employ the concept of hierarchies of clinical codes as proposed by another work, in order to exploit the inherently complex nature of the medical data. Moreover, we extended the state-of-the-art by considering variable patient history lengths before the occurrence of an ADE event rather than a patient history of an arbitrary length. Our experimental evaluation on eight medical datasets of adverse drug events, five different patient history lengths, and six different classifiers, suggests that the integration of these additional features on the different window lengths provides significant improvements in terms of AUC while employing medically relevant features.

• # Mining and Model Understanding on Medical Data

2019. Myra Spiliopoulou, Panagiotis Papapetrou. KDD '19, 3223-3224

Conference

What are the basic forms of healthcare data? How are Electronic Health Records and Cohorts structured? How can we identify the key variables in such data and how important are temporal abstractions? What are the main challenges in knowledge extraction from medical data sources? What are the key machine algorithms used for this purpose? What are the main questions that clinicians and medical experts pose to machine learning researchers?

In this tutorial, we provide answers to these questions by presenting state-of-the-art methods, workflows, and tools for mining and understanding medical data. Particular emphasis is given on temporal abstractions, knowledge extraction from cohorts, machine learning model interpretability, and mHealth.

• # Open-Source Physiological Computing Framework using Heart Rate Variability in Mobile Virtual Reality Applications

2019. Luis Quintero, Panagiotis Papapetrou, John E. Muñoz. 2019 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR)

Conference

Electronic and mobile health technologies are posed as a tool that can promote self-care and extend coverage to bridge the gap in accessibility to mental care services between low-and high-income communities. However, the current technology-based mental health interventions use systems that are either cumbersome, expensive or require specialized knowledge to be operated. This paper describes the open-source framework PARE-VR, which provides heart rate variability (HRV) analysis to mobile virtual reality (VR) applications. It further outlines the advantages of the presented architecture as an initial step to provide more scalable mental health therapies in comparison to current technical setups; and as an approach with the capability to merge physiological data and artificial intelligence agents to provide computing systems with user understanding and adaptive functionalities. Furthermore, PARE-VR is evaluated with a feasibility study using a specific relaxation exercise with slow-paced breathing. The aim of the study is to get insights of the system performance, its capability to detect HRV metrics in real-time, as well as to identify changes between normal and slow-paced breathing using the HRV data. Preliminary results of the study, with the participation of eleven volunteers, showed high engagement of users towards the VR activity, and demonstrated technical potentialities of the framework to create physiological computing systems using mobile VR and wearable smartwatches for scalable health interventions. Several insights and recommendations were concluded from the study for enhancing the HRV analysis in real-time and conducting future similar studies.

Read more about Open-Source Physiological Computing Framework using Heart Rate Variability in Mobile Virtual Reality Applications
• # Discovering, selecting and exploiting feature sequence records of study participants for the classification of epidemiological data on hepatic steatosis

2018. Tommy Hielscher (et al.). Proceedings of the 33rd Annual ACM Symposium on Applied Computing, 6-13

Conference

In longitudinal epidemiological studies, participants undergo repeated medical examinations and are thus represented by a potentially large number of short examination outcome sequences. Some of those sequences may contain important information in various forms, such as patterns, with respect to the disease under study, while others may be on features of little relevance to the outcome. In this work, we propose a framework for Discovery, Selection and Exploitation (DiSelEx) of longitudinal epidemiological data, aiming to identify informative patterns among these sequences. DiSelEx combines sequence clustering with supervised learning to identify sequence groups that contribute to class separation. Newly derived and old features are evaluated and selected according to their redundancy and informativeness regarding the target variable. The selected feature set is then used to learn a classification model on the study data. We evaluate DiSelEx on cohort participants for the disorder "hepatic steatosis" and report on the impact on predictive performance when using sequential data in comparison to utilizing only the basic classifier.

Read more about Discovering, selecting and exploiting feature sequence records of study participants for the classification of epidemiological data on hepatic steatosis
• # Explainable predictions of adverse drug events from electronic health records via oracle coaching

2018. Loes Crielaard, Panagiotis Papapetrou. 2018 IEEE International Conference on Data Mining Workshops (ICDMW), 707-714

Conference

Information about drug efficacy and safety is limited despite current research on adverse drug events (ADEs). Electronic health records (EHRs) may be an overcoming medium, however the application and evaluation of predictive models for ADE detection based on EHRs focus primarily on predictive performance with little emphasis on explainability and clinical relevance of the obtained predictions. This paper therefore aims to provide new means for obtaining explainable and clinically relevant predictions and medical pathways underlying ADEs, by deriving sets of rules leading to explainable ADE predictions via oracle coaching and indirect rule induction. This is achieved by mapping opaque random forest models to explainable decision trees without compromising predictive performance. The results suggest that the average performance of decision trees with oracle coaching exceeds that of random forests for all considered metrics for the task of ADE detection. Relationships between many patient features present in the rulesets and the ADEs appear to exist, however not conforming to the causal pathways implied by the models - which emphasises the need for explainable predictions.

• # Explainable time series tweaking via irreversible and reversible temporal transformations

2018. Isak Karlsson (et al.). 2018 IEEE International Conference on Data Mining (ICDM), 207-216

Conference

Time series classification has received great attention over the past decade with a wide range of methods focusing on predictive performance by exploiting various types of temporal features. Nonetheless, little emphasis has been placed on interpretability and explainability. In this paper, we formulate the novel problem of explainable time series tweaking, where, given a time series and an opaque classifier that provides a particular classification decision for the time series, we want to find the minimum number of changes to be performed to the given time series so that the classifier changes its decision to another class. We show that the problem is NP-hard, and focus on two instantiations of the problem, which we refer to as reversible and irreversible time series tweaking. The classifier under investigation is the random shapelet forest classifier. Moreover, we propose two algorithmic solutions for the two problems along with simple optimizations, as well as a baseline solution using the nearest neighbor classifier. An extensive experimental evaluation on a variety of real datasets demonstrates the usefulness and effectiveness of our problem formulation and solutions.

Read more about Explainable time series tweaking via irreversible and reversible temporal transformations
• # Exploring epistaxis as an adverse effect of anti-thrombotic drugs and outdoor temperature

2018. Jaakko Hollmén (et al.). Proceedings of the 11th PErvasive Technologies Related to Assistive Environments Conference (PETRA), 1-4

Conference

Electronic health records contain a wealth of epidemiological information about diseases at the population level. Using a database of medical diagnoses and drug prescriptions in electronic health records, we investigate the correlation between outdoor temperature and the incidence of epistaxis over time for two groups of patients. One group consists of patients that had been diagnosed with epistaxis and also been prescribed at least one of the three anti-thrombotic agents: Warfarin, Apixaban, or Rivaroxaban. The other group consists of patients that had been diagnosed with epistaxis and not been prescribed any of the three anti-thrombotic drugs. We find a strong negative correlation between the incidence of epistaxis and outdoor temperature for the group that had not been prescribed any of the three anti-thrombotic drugs, while there is a weaker correlation between incidence of epistaxis and outdoor temperature for the other group. It is, however, clear that both groups are affected in a similar way, such that the incidence of epistaxis increases with colder temperatures.

• # Seq2Seq RNNs and ARIMA models for Cryptocurrency Prediction

2018. Jonathan Rebane (et al.). Proceedings of SIGKDD Workshop on Fintech (SIGKDD Fintech’18)

Conference

Cyrptocurrency price prediction has recently become an alluring topic, attracting massive media and investor interest. Traditional models, such as Autoregressive Integrated Moving Average models (ARIMA) and models with more modern popularity, such as Recurrent Neural Networks (RNN’s) can be considered candidates for such financial prediction problems, with RNN’s being capable of utilizing various endogenous and exogenous input sources. This study compares the model performance of ARIMA to that of a seq2seq recurrent deep multi-layer neural network (seq2seq) utilizing a varied selection of inputs types. The results demonstrate superior performance of seq2seq over ARIMA, for models generated throughout most of bitcoin price history, with additional data sources leading to better performance during less volatile price periods.

• # ABIDE: Querying Time-Evolving Sequences of Temporal Intervals

2017. Orestis Kostakis, Panagiotis Papapetrou. Advances in Intelligent Data Analysis XVI, 173-185

Conference

We study the problem of online similarity search in sequences of temporal intervals; given a standing query and a time-evolving sequence of event-intervals, we want to assess the existence of the query in the sequence over time. Since indexing is inapplicable to our problem, the goal is to reduce runtime without sacrificing retrieval accuracy. We present three lower-bounding and two early-abandon methods for speeding up search, while guaranteeing no false dismissals. We present a framework for combining lower bounds with early abandoning, called ABIDE. Empirical evaluation on eight real datasets and two synthetic datasets suggests that ABIDE provides speedups of at least an order of magnitude and up to 6977 times on average, compared to existing approaches and a baseline. We conclude that ABIDE is more powerful than existing methods, while we can attain the same pruning power with less CPU computations.

• # Conformal prediction using random survival forests

2017. Henrik Boström (et al.). 16th IEEE International Conference on Machine Learning and Applications, 812-817

Conference

Random survival forests constitute a robust approach to survival modeling, i.e., predicting the probability that an event will occur before or on a given point in time. Similar to most standard predictive models, no guarantee for the prediction error is provided for this model, which instead typically is empirically evaluated. Conformal prediction is a rather recent framework, which allows the error of a model to be determined by a user specified confidence level, something which is achieved by considering set rather than point predictions. The framework, which has been applied to some of the most popular classification and regression techniques, is here for the first time applied to survival modeling, through random survival forests. An empirical investigation is presented where the technique is evaluated on datasets from two real-world applications; predicting component failure in trucks using operational data and predicting survival and treatment of heart failure patients from administrative healthcare data. The experimental results show that the error levels indeed are very close to the provided confidence levels, as guaranteed by the conformal prediction framework, and that the error for predicting each outcome, i.e., event or no-event, can be controlled separately. The latter may, however, lead to less informative predictions, i.e., larger prediction sets, in case the class distribution is heavily imbalanced.

• # Detecting Hierarchical Ties Using Link-Analysis Ranking at Different Levels of Time Granularity

2017. Hend Kareem, Lars Asker, Panagiotis Papapetrou.

Social networks contain implicit knowledge that can be used to infer hierarchical relations that are not explicitly present in the available data. Interaction patterns are typically affected by users' social relations. We present an approach to inferring such information that applies a link-analysis ranking algorithm at different levels of time granularity. In addition, a voting scheme is employed for obtaining the hierarchical relations. The approach is evaluated on two datasets: the Enron email data set, where the goal is to infer manager-subordinate relationships, and the Co-author data set, where the goal is to infer PhD advisor-advisee relations. The experimental results indicate that the proposed approach outperforms more traditional approaches to inferring hierarchical relations from social networks.

• # Harnessing Predictive Models for Assisting Network Forensic Investigations of DNS Tunnels

2017. Irvin Homem, Panagiotis Papapetrou. Annual ADFSL Conference on Digital Forensics, Security and Law

Conference

In recent times, DNS tunneling techniques have been used for malicious purposes, however network security mechanisms struggle to detect them. Network forensic analysis has been proven effective, but is slow and effort intensive as Network Forensics Analysis Tools struggle to deal with undocumented or new network tunneling techniques. In this paper, we present a machine learning approach, based on feature subsets of network traffic evidence, to aid forensic analysis through automating the inference of protocols carried within DNS tunneling techniques. We explore four network traffic protocols, namely, HTTP, HTTPS, FTP, and POP3. Three features are extracted from the DNS tunneled traffic: IP packet length, DNS Query Name Entropy and DNS Query Name Length. We benchmark the performance of four classification models, i.e., decision trees, support vector machines, k-nearest neighbours, and neural networks, on a data set of DNS tunneled traffic. Classification accuracy of 95% is achieved and the feature set reduces the original evidence data size by a factor of 74%. More importantly, our findings provide strong evidence that predictive modeling machine learning techniques can be used to identify network protocols within DNS tunneled traffic in real-time with high accuracy from a relatively small-sized feature-set, without necessarily infringing on privacy from the outset, nor having to collect complete DNS Tunneling sessions.

Read more about Harnessing Predictive Models for Assisting Network Forensic Investigations of DNS Tunnels
• # KAPMiner: Mining Ordered Association Rules with Constraints

2017. Isak Karlsson, Panagiotis Papapetrou, Lars Asker. Advances in Intelligent Data Analysis XVI, 149-161

Conference

We study the problem of mining ordered association rules from event sequences. Ordered association rules differ from regular association rules in that the events occurring in the antecedent (left hand side) of the rule are temporally constrained to occur strictly before the events in the consequent (right hand side). We argue that such constraints can provide more meaningful rules in particular application domains, such as health care. The importance and interestingness of the extracted rules are quantified by adapting existing rule mining metrics. Our experimental evaluation on real data sets demonstrates the descriptive power of ordered association rules against ordinary association rules.

• # Learning from Administrative Health Registries

2017. Jonathan Rebane (et al.). SoGood 2017: Data Science for Social Good

Conference

Over the last decades the healthcare domain has seen a tremendous increase and interest in methods for making inference about patient care using large quantities of medical data. Such data is often stored in electronic health records and administrative health registries. As these data sources have grown increasingly complex, with millions of patients represented by thousands of attributes, static or time evolving, finding relevant and accurate patterns that can be used for predictive or descriptive modelling is impractical for human experts. In this paper, we concentrate our review on Swedish Administrative Health Registries (AHRs) and Electronic Health Records (EHRs) and provide an overview of recent and ongoing work in the area with focus on adverse drug events (ADEs) and heart failure.

• # Learning from heterogeneous temporal data from electronic health records

2017. Jing Zhao (et al.). Journal of Biomedical Informatics 65, 105-119

Article

Electronic health records contain large amounts of longitudinal data that are valuable for biomedical informatics research. The application of machine learning is a promising alternative to manual analysis of such data. However, the complex structure of the data, which includes clinical events that are unevenly distributed over time, poses a challenge for standard learning algorithms. Some approaches to modeling temporal data rely on extracting single values from time series; however, this leads to the loss of potentially valuable sequential information. How to better account for the temporality of clinical data, hence, remains an important research question. In this study, novel representations of temporal data in electronic health records are explored. These representations retain the sequential information, and are directly compatible with standard machine learning algorithms. The explored methods are based on symbolic sequence representations of time series data, which are utilized in a number of different ways. An empirical investigation, using 19 datasets comprising clinical measurements observed over time from a real database of electronic health records, shows that using a distance measure to random subsequences leads to substantial improvements in predictive performance compared to using the original sequences or clustering the sequences. Evidence is moreover provided on the quality of the symbolic sequence representation by comparing it to sequences that are generated using domain knowledge by clinical experts. The proposed method creates representations that better account for the temporality of clinical events, which is often key to prediction tasks in the biomedical domain.

Read more about Learning from heterogeneous temporal data from electronic health records
• # Mining disproportional itemsets for characterizing groups of heart failure patients from administrative health records

2017. Isak Karlsson (et al.). Proceedings of the 10th International Conference on PErvasive Technologies Related to Assistive Environments, 394-398

Conference

Heart failure is a serious medical conditions involving decreased quality of life and an increased risk of premature death. A recent evaluation by the Swedish National Board of Health and Welfare shows that Swedish heart failure patients are often undertreated and do not receive basic medication as recommended by the national guidelines for treatment of heart failure. The objective of this paper is to use registry data to characterize groups of heart failure patients, with an emphasis on basic treatment. Towards this end, we explore the applicability of frequent itemset mining and disproportionality analysis for finding interesting and distinctive characterizations of a target group of patients, e.g., those who have received basic treatment, against a control group, e.g., those who have not received basic treatment. Our empirical evaluation is performed on data extracted from administrative health records from the Stockholm County covering the years 2010--2016. Our findings suggest that frequency is not always the most appropriate measure of importance for frequent itemsets, while itemset disproportionality against a control group provides alternative rankings of the extracted itemsets leading to some medically intuitive characterizations of the target groups.

Read more about Mining disproportional itemsets for characterizing groups of heart failure patients from administrative health records
• # On Searching and Indexing Sequences of Temporal Intervals

2017. Orestis Kostakis, Panagiotis Papapetrou. Data mining and knowledge discovery 31 (3), 809-850

Article

In several application domains, including sign language, sensor networks, and medicine, events are not necessarily instantaneous but they may have a time duration. Such events build sequences of temporal intervals, which may convey useful domain knowledge; thus, searching and indexing these sequences is crucial. We formulate the problem of comparing sequences of labeled temporal intervals and present a distance measure that can be computed in polynomial time. We prove that the distance measure is metric and satisfies the triangle inequality. For speeding up search in large databases of sequences of temporal intervals, we propose an approximate indexing method that is based on embeddings. The proposed indexing framework is shown to be contractive and can guarantee no false dismissal. The distance measure is tested and benchmarked through rigorous experimentation on real data taken from several application domains, including: American Sign Language annotated video recordings, robot sensor data, and Hepatitis patient data. In addition, the indexing scheme is tested on a large synthetic dataset. Our experiments show that speedups of over an order of magnitude can be achieved while maintaining high levels of accuracy. As a result of our work, it becomes possible to implement recommender systems, search engines and assistive applications for the fields that employ sequences of temporal intervals.

• # A Multi-granularity Pattern-based Sequence Classification Framework for Educational Data

2016. Mohammad Jaber (et al.). 3rd IEEE International Conference on Data Science and Advanced Analytics, 370-378

Conference

In many application domains, such as education, sequences of events occurring over time need to be studied in order to understand the generative process behind these sequences, and hence classify new examples. In this paper, we propose a novel multi-granularity sequence classification framework that generates features based on frequent patterns at multiple levels of time granularity. Feature selection techniques are applied to identify the most informative features that are then used to construct the classification model. We show the applicability and suitability of the proposed framework to the area of educational data mining by experimenting on an educational dataset collected from an asynchronous communication tool in which students interact to accomplish an underlying group project. The experimental results showed that our model can achieve competitive performance in detecting the students' roles in their corresponding projects, compared to a baseline similarity-based approach.

Read more about A Multi-granularity Pattern-based Sequence Classification Framework for Educational Data
• # Advances in Intelligent Data Analysis XV

2016. Henrik Boström (et al.).

Conference

This book constitutes the refereed conference proceedings of the 15th International Conference on Intelligent Data Analysis, which was held in October 2016 in Stockholm, Sweden. The 36 revised full papers presented were carefully reviewed and selected from 75 submissions. The traditional focus of the IDA symposium series is on end-to-end intelligent support for data analysis. The symposium aims to provide a forum for inspiring research contributions that might be considered preliminary in other leading conferences and journals, but that have a potentially dramatic impact.

• # Clustering with Confidence

2016. Andreas Henelius (et al.).

Article

Clustering is a widely used unsupervised learning method for finding structure in the data. However, the resulting clusters are typically presented without any guarantees on their robustness; slightly changing the used data sample or re-running a clustering algorithm involving some stochastic component may lead to completely different clusters. There is, hence, a need for techniques that can quantify the instability of the generated clusters. In this study, we propose a technique for quantifying the instability of a clustering solution and for finding robust clusters, termed core clusters, which correspond to clusters where the co-occurrence probability of each data item within a cluster is at least 1−α  . We demonstrate how solving the core clustering problem is linked to finding the largest maximal cliques in a graph. We show that the method can be used with both clustering and classification algorithms. The proposed method is tested on both simulated and real datasets. The results show that the obtained clusters indeed meet the guarantees on robustness.

• # Early Random Shapelet Forest

2016. Isak Karlsson, Panagiotis Papapetrou, Henrik Boström. Discovery Science, 261-276

Conference

Early classification of time series has emerged as an increasingly important and challenging problem within signal processing, especially in domains where timely decisions are critical, such as medical diagnosis in health-care. Shapelets, i.e., discriminative sub-sequences, have been proposed for time series classification as a means to capture local and phase independent information. Recently, forests of randomized shapelet trees have been shown to produce state-of-the-art predictive performance at a low computational cost. In this work, they are extended to allow for early classification of time series. An extensive empirical investigation is presented, showing that the proposed algorithm is superior to alternative state-of-the-art approaches, in case predictive performance is considered to be more important than earliness. The algorithm allows for tuning the trade-off between accuracy and earliness, thereby supporting the generation of early classifiers that can be dynamically adapted to specific needs at low computational cost.

• # Entropy-based Prediction of Network Protocols in the Forensic Analysis of DNS Tunnels

2016. Irvin Homem, Panagiotis Papapetrou, Spyridon Dosis.

Conference

DNS tunneling techniques are often used for malicious purposes but network security mechanisms have struggled to detect these. Network forensic analysis has thus been used but has proved slow and effort intensive as Network Forensics Analysis Tools struggle to deal with undocumented or new network tunneling techniques. In this paper we present a method to aid forensic analysis through automating the inference of protocols tunneled within DNS tunneling techniques. We analyze the internal packet structure of DNS tunneling techniques and characterize the information entropy of different network protocols and their DNS tunneled equivalents. From this, we present our protocol prediction method that uses entropy distribution averaging. Finally we apply our method on a dataset to measure its performance and show that it has a prediction accuracy of 75%. Our method also preserves privacy as it does not parse the actual tunneled content, rather it only calculates the information entropy.

Read more about Entropy-based Prediction of Network Protocols in the Forensic Analysis of DNS Tunnels
• # FindMyDoc

2016. Myrsini Glinos (et al.). Proceedings of the 9th ACM International Conference on Pervasive Technologies Related to Assistive Environments

Conference

A variety of eHealth apps exist today ranging from ovulation calculators, such as Glow, to more sophisticated systems for determining the right therapist via semantic analysis, such as Talkspace, or ZocDoc. Despite their promising functionality, existing systems offer limited capabilities in terms of search filters, reviews, and doctor recommendations. In this paper, we propose FindMyDoc, a novel peer-to-peer healthcare platform that goes beyond existing traditional healthcare models. It provides doctor recommendations by allowing proper filtering based on treatment procedures, quality of treatment, and reviews of healthcare providers. In addition, the search results are refined using a recommendation engine that employs user-based collaborative filtering and exploits a set of predefined review options provided by the patients in order to match them with doctors.

• # Generalized random shapelet forests

2016. Isak Karlsson, Panagiotis Papapetrou, Henrik Boström. Data mining and knowledge discovery 30 (5), 1053-1085

Article

Shapelets are discriminative subsequences of time series, usually embedded in shapelet-based decision trees. The enumeration of time series shapelets is, however, computationally costly, which in addition to the inherent difficulty of the decision tree learning algorithm to effectively handle high-dimensional data, severely limits the applicability of shapelet-based decision tree learning from large (multivariate) time series databases. This paper introduces a novel tree-based ensemble method for univariate and multivariate time series classification using shapelets, called the generalized random shapelet forest algorithm. The algorithm generates a set of shapelet-based decision trees, where both the choice of instances used for building a tree and the choice of shapelets are randomized. For univariate time series, it is demonstrated through an extensive empirical investigation that the proposed algorithm yields predictive performance comparable to the current state-of-the-art and significantly outperforms several alternative algorithms, while being at least an order of magnitude faster. Similarly for multivariate time series, it is shown that the algorithm is significantly less computationally costly and more accurate than the current state-of-the-art.

• # Identifying Factors for the Effectiveness of Treatment of Heart Failure

2016. Lars Asker (et al.). IEEE 29th International Symposiumon Computer-Based Medical Systems

Conference

An administrative health register containing health care data for over 2 million patients will be used to search for factors that can affect the treatment of heart failure. In the study, we will measure the effects of employed treatment for various groups of heart failure patients, using different measures of effectiveness. Significant deviations in effectiveness of treatments of the various patient groups will be reported and factors that may help explaining the effect of treatment will be analyzed. Identification of the most important factors that may help explain the observed deviations between the different groups will be derived through generation of predictive models, for which variable importance can be calculated. The findings may affect recommended treatments as well as high-lighting deviations from national guidelines.

Read more about Identifying Factors for the Effectiveness of Treatment of Heart Failure
• # Learning from Swedish Healthcare Data

2016. Lars Asker, Panagiotis Papapetrou, Henrik Boström. Proceedings of the 9th ACM International Conference on PErvasive Technologies Related to Assistive Environments

Conference

We present two ongoing projects aimed at learning from health care records. The first project, DADEL, is focusing on high-performance data mining for detrecting adverse drug events in healthcare, and uses electronic patient records covering seven years of patient record data from the Stockholm region in Sweden. The second project is focusing on heart failure and on understanding the differences in treatment between various groups of patients. It uses a Swedish administrative health register containing health care data for over two million patients.

• # Query-sensitive Distance Measure Selection for Time Series Nearest Neighbor Classification

2016. Alexios Kotsifakos, Vassilis Athitsos, Panagiotis Papapetrou. Intelligent Data Analysis 20 (1), 5-27

Article

Many distance or similarity measures have been proposed for time series similarity search. However, none of these measures is guaranteed to be optimal when used for 1-Nearest Neighbor (NN) classification. In this paper we study the problem of selecting the most appropriate distance measure, given a pool of time series distance measures and a query, so as to perform NN classification of the query. We propose a framework for solving this problem, by identifying, given the query, the distance measure most likely to produce the correct classification result for that query. From this proposed framework, we derive three specific methods, that differ from each other in the way they estimate the probability that a distance measure correctly classifies a query object. In our experiments, our pool of measures consists of Dynamic TimeWarping (DTW), Move-Split-Merge (MSM), and Edit distance with Real Penalty (ERP). Based on experimental evaluation with 45 datasets, the best-performing of the three proposed methods provides the best results in terms of classification error rate, compared to the competitors, which include using the Cross Validation method for selecting the distance measure in each dataset, as well as using a single specific distance measure (DTW, MSM, or ERP) across all datasets.

Read more about Query-sensitive Distance Measure Selection for Time Series Nearest Neighbor Classification
• # STIFE: A Framework for Feature-based Classification of Sequences of Temporal Intervals

2016. Leon Bornemann, Jason Lecerf, Panagiotis Papapetrou. Discovery Science, 85-100

Conference

In this paper, we study the problem of classification of sequences of temporal intervals. Our main contribution is the STIFE framework for extracting relevant features from interval sequences to build feature-based classifiers. STIFE uses a combination of basic static metrics, shapelet discovery and selection, as well as distance-based approaches. Additionally, we propose an improved way of computing the state of the art IBSM distance measure between two interval sequences, that reduces both runtime and memory needs from pseudo-polynomial to fully polynomial, which greatly reduces the runtime of distance based classification approaches. Our empirical evaluation not only shows that STIFE provides a very fast classification time in all evaluated scenarios but also reveals that a random forests using STIFE achieves similar or better accuracy than the state of the art k-NN classifier.

Read more about STIFE: A Framework for Feature-based Classification of Sequences of Temporal Intervals
• # Semigeometric Tiling of Event Sequences

2016. Andreas Henelius (et al.). Machine Learning and Knowledge Discovery in Databases, 329-344

Conference

Event sequences are ubiquitous, e.g., in finance, medicine, and social media. Often the same underlying phenomenon, such as television advertisements during Superbowl, is reflected in independent event sequences, like different Twitter users. It is hence of interest to find combinations of temporal segments and subsets of sequences where an event of interest, like a particular hashtag, has an increased occurrence probability. Such patterns allow exploration of the event sequences in terms of their evolving temporal dynamics, and provide more fine-grained insights to the data than what for example straightforward clustering can reveal. We formulate the task of finding such patterns as a novel matrix tiling problem, and propose two algorithms for solving it. Our first algorithm is a greedy set-cover heuristic, while in the second approach we view the problem as time-series segmentation. We apply the algorithms on real and artificial datasets and obtain promising results. The software related to this paper is available at https://github.com/bwrc/semigeom-r.

• # Significance testing of word frequencies in corpora

2016. Jefrey Lijffijt (et al.). Digital Scholarship in the Humanities 31 (2), 374-397

Article

Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (Language is never, ever, ever, random, Corpus Linguistics and Linguistic Theory, 2005; 1(2): 263–76.), the use of the χ2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (Comparing corpora, International Journal of Corpus Linguistics, 2001; 6(1): 1–37) and Paquot and Bestgen (Distinctive words in academic writing: a comparison of three statistical tests for keyword extraction. In Jucker, A., Schreier, D., and Hundt, M. (eds), Corpora: Pragmatics and Discourse. Amsterdam: Rodopi, 2009, pp. 247–69), it is possible to represent the data differently and employ other tests, such that we assume independence at the level of texts rather than individual words. This allows us to account for the distribution of words within a corpus. In this article we compare the significance estimates of various statistical tests in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus. We find that the choice of the test, and hence data representation, matters. We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the observed differences, especially for poorly dispersed words. We recommend the use of the t-test, Wilcoxon rank-sum test, or bootstrap test for comparing word frequencies across corpora.

• # Analysing Online Education-based Asynchronous Communication Tools to Detect Students' Roles

2015. Mohammad Jaber (et al.). CSEDU 2015 - Proceedings of the 7th International Conference on Computer Supported Education, 416-424

Conference

This paper studies the application of Educational Data Mining to examine the online communication behaviour of students working together on the same project in order to identify the different roles played by the students. Analysis was carried out using real data from students' participation in project communication tools. Several sets of features including individual attributes and information about the interactions between the project members were used to train different classification algorithms. The results show that considering the individual attributes of students provided regular classification performance. The inclusion of information about the reply relationships among the project members generally improved mapping students to their roles. However, it was necessary to add time-based'' features in order to achieve the best classification results, which showed both precision and recall of over 95\% for a number of algorithms. Most of these time-based'' features coincided with the first weeks of the experience, which indicates the importance of initial interactions between project members.

Read more about Analysing Online Education-based Asynchronous Communication Tools to Detect Students' Roles
• # DRESS

2015. Alexios Kotsifakos (et al.). Data mining and knowledge discovery 29 (5), 1280-1311

Article

Similarity search in large sequence databases is a problem ubiquitous in a wide range of application domains, including searching biological sequences. In this paper we focus on protein and DNA data, and we propose a novel approximate method method for speeding up range queries under the edit distance. Our method works in a filter-and-refine manner, and its key novelty is a query-sensitive mapping that transforms the original string space to a new string space of reduced dimensionality. Specifically, it first identifies the most frequent codewords in the query, and then uses these codewords to convert both the query and the database to a more compact representation. This is achieved by replacing every occurrence of each codeword with a new letter and by removing the remaining parts of the strings. Using this new representation, our method identifies a set of candidate matches that are likely to satisfy the range query, and finally refines these candidates in the original space. The main advantage of our method, compared to alternative methods for whole sequence matching under the edit distance, is that it does not require any training to create the mapping, and it can handle large query lengths with negligible losses in accuracy. Our experimental evaluation demonstrates that, for higher range values and large query sizes, our method produces significantly lower costs and runtimes compared to two state-of-the-art competitor methods.

• # Embedding-based subsequence matching with gaps-range-tolerances

2015. Alexios Kotsifakos (et al.). The VLDB journal 24 (4), 519-536

Article

We present a subsequence matching framework that allows for gaps in both query and target sequences, employs variable matching tolerance efficiently tuned for each query and target sequence, and constrains the maximum matching range. Using this framework, a dynamic programming method is proposed, called SMBGT, that, given a short query sequence Q and a large database, identifies in quadratic time the subsequence of the database that best matches Q. SMBGT is highly applicable to music retrieval. However, in Query-By-Humming applications, runtime is critical. Hence, we propose a novel embedding-based approach, called ISMBGT, for speeding up search under SMBGT. Using a set of reference sequences, ISMBGT maps both Q and each position of each database sequence into vectors. The database vectors closest to the query vector are identified, and SMBGT is then applied between Q and the subsequences that correspond to those database vectors. The key novelties of ISMBGT are that it does not require training, it is query sensitive, and it exploits the flexibility of SMBGT. We present an extensive experimental evaluation using synthetic and hummed queries on a large music database. Our findings show that ISMBGT can achieve speedups of up to an order of magnitude against brute-force search and over an order of magnitude against cDTW, while maintaining a retrieval accuracy very close to that of brute-force search.

• # Extracting news text from web pages

2015. Erik Lundgren, Lars Asker, Panagiotis Papapetrou. Proceeding PETRA '15 Proceedings of the 8th ACM International Conference on PErvasive Technologies Related to Assistive Environments

Conference
• # Finding the longest common sub-pattern in sequences of temporal intervals

2015. Orestis Kostakis, Panagiotis Papapetrou. Data mining and knowledge discovery 29 (5), 1178-1210

Article

We study the problem of finding the longest common sub-pattern (LCSP) shared by two sequences of temporal intervals. In particular we are interested in finding the LCSP of the corresponding arrangements. Arrangements of temporal intervals are a powerful way to encode multiple concurrent labeled events that have a time duration. Discovering commonalities among such arrangements is useful for a wide range of scientific fields and applications, as it can be seen by the number and diversity of the datasets we use in our experiments. In this paper, we define the problem of LCSP and prove that it is NP-complete by demonstrating a connection between graphs and arrangements of temporal intervals. This connection leads to a series of interesting open problems. In addition, we provide an exact algorithm to solve the LCSP problem, and also propose and experiment with three polynomial time and space under-approximation techniques. Finally, we introduce two upper bounds for LCSP and study their suitability for speeding up 1-NN search. Experiments are performed on seven datasets taken from a wide range of real application domains, plus two synthetic datasets. Lastly, we describe several application cases that demonstrate the need and suitability of LCSP.

Read more about Finding the longest common sub-pattern in sequences of temporal intervals
• # Forests of Randomized Shapelet Trees

2015. Isak Karlsson, Panagiotis Papapetrou, Henrik Boström. Statistical Learning and Data Sciences, 126-136

Conference

Shapelets have recently been proposed for data series classification, due to their ability to capture phase independent and local information. Decision trees based on shapelets have been shown to provide not only interpretable models, but also, in many cases, state-of-the-art predictive performance. Shapelet discovery is however computationally costly, and although several techniques for speeding up the technique have been proposed, the computational cost is still in many cases prohibitive. In this work, an ensemble based method, referred to as Random Shapelet Forest (RSF), is proposed, which builds on the success of the random forest algorithm, and which is shown to have a lower computational complexity than the original shapelet tree learning algorithm. An extensive empirical investigation shows that the algorithm provides competitive predictive performance and that a proposed way of calculating importance scores can be used to successfully identify influential regions.

• # GoldenEye++

2015. Andreas Henelius (et al.). Statistical Learning and Data Sciences, 96-105

Conference

Models with high predictive performance are often opaque, i.e., they do not allow for direct interpretation, and are hence of limited value when the goal is to understand the reasoning behind predictions. A recently proposed algorithm, GoldenEye, allows detection of groups of interacting variables exploited by a model. We employed this technique in conjunction with random forests generated from data obtained from electronic patient records for the task of detecting adverse drug events (ADEs). We propose a refined version of the GoldenEye algorithm, called GoldenEye++, utilizing a more sensitive grouping metric. An empirical investigation comparing the two algorithms on 27 datasets related to detecting ADEs shows that the new version of the algorithm in several cases finds groups of medically relevant interacting attributes, corresponding to prescribed drugs, undetected by the previous version. This suggests that the GoldenEye++ algorithm can be a useful tool for finding novel (adverse) drug interactions.

• # Multi-channel ECG classification using forests of randomized shapelet trees

2015. Isak Karlsson, Panagiotis Papapetrou, Lars Asker. Proceedings of the 8th ACM International Conference on PErvasive Technologies Related to Assistive Environments

Conference

Data series of multiple channels occur at high rates and in massive quantities in several application domains, such as healthcare. In this paper, we study the problem of multi-channel ECG classification. We map this problem to multivariate data series classification and propose five methods for solving it, using a split-and-combine approach. The proposed framework is evaluated using three base-classifiers on real-world data for detecting Myocardial Infarction. Extensive experiments are performed on real ECG data extracted from the Physiobank data repository. Our findings emphasize the importance of selecting an appropriate base-classifier for multivariate data series classification, while demonstrating the superiority of the Random Shapelet Forest (0.825 accuracy) against competitor methods (0.664 accuracy for 1-NN under cDTW).

Read more about Multi-channel ECG classification using forests of randomized shapelet trees
• # Size matters

2015. Jefrey Lijffijt, Panagiotis Papapetrou, Kai Puolamaki. Data mining and knowledge discovery 29 (6), 1838-1864

Article

In order to find patterns in data, it is often necessary to aggregate or summarise data at a higher level of granularity. Selecting the appropriate granularity is a challenging task and often no principled solutions exist. This problem is particularly relevant in analysis of data with sequential structure. We consider this problem for a specific type of data, namely event sequences. We introduce the problem of finding the best set of window lengths for analysis of event sequences for algorithms with real-valued output. We present suitable criteria for choosing one or multiple window lengths and show that these naturally translate into a computational optimisation problem. We show that the problem is NP-hard in general, but that it can be approximated efficiently and even analytically in certain cases. We give examples of tasks that demonstrate the applicability of the problem and present extensive experiments on both synthetic data and real data from several domains. We find that the method works well in practice, and that the optimal sets of window lengths themselves can provide new insight into the data.

• # A peek into the black box

2014. Andreas Henelius (et al.). Data mining and knowledge discovery 28 (5-6), 1503-1529

Article

Classifiers are often opaque and cannot easily be inspected to gain understanding of which factors are of importance. We propose an efficient iterative algorithm to find the attributes and dependencies used by any classifier when making predictions. The performance and utility of the algorithm is demonstrated on two synthetic and 26 real-world datasets, using 15 commonly used learning algorithms to generate the classifiers. The empirical investigation shows that the novel algorithm is indeed able to find groupings of interacting attributes exploited by the different classifiers. These groupings allow for finding similarities among classifiers for a single dataset as well as for determining the extent to which different classifiers exploit such interactions in general.

• # Benchmarking Dynamic Time Warping on Nearest Neighbor Classification of Electrocardiograms

2014. Nikolaos Tselas, Panagiotis Papapetrou. Proceedings of the 7th International Conference on PErvasive Technologies Related to Assistive Environments

Conference

The human cardiovascular system is a complicated structure that has been the focus of research in many different domains, such as medicine, biology, as well as computer science. Due to the complexity of the heart, even nowadays some of the most common disorders are still hard to identify. In this paper, we map each ECG to a time series or set of time series and explore the applicability of two common time series similarity matching methods, namely, DTW and cDTW, to the problem of ECG classification. We benchmark the two methods on four different datasets in terms of accuracy. In addition, we explore their predictive performance when various ECG channels are taken into account. The latter is performed using a dataset taken from Physiobank. Our findings suggest that different ECG channels are more appropriate for different cardiovascular malfunctions.

Read more about Benchmarking Dynamic Time Warping on Nearest Neighbor Classification of Electrocardiograms
• # Inferring Offline Hierarchical Ties from Online Social Networks

2014. Mohammad Jaber (et al.). Proceedings of the companion publication of the 23rd international conference on World wide web companion, 1261-1266

Conference

Social networks can represent many different types of relationships between actors, some explicit and some implicit. For example, email communications between users may be represented explicitly in a network, while managerial relationships may not. In this paper we focus on analyzing explicit interactions among actors in order to detect hierarchical social relationships that may be implicit. We start by employing three well-known ranking-based methods, PageRank, Degree Centrality, and Rooted-PageRank (RPR) to infer such implicit relationships from interactions between actors. Then we propose two novel approaches which take into account the time-dimension of interactions in the process of detecting hierarchical ties. We experiment on two datasets, the Enron email dataset to infer manager-subordinate relationships from email exchanges, and a scientific publication co-authorship dataset to detect PhD advisor-advisee relationships from paper co-authorships. Our experiments show that time-based methods perform considerably better than ranking-based methods. In the Enron dataset, they detect 48% of manager-subordinate ties versus 32% found by Rooted-PageRank. Similarly, in co-author dataset, they detect 62% of advisor-advisee ties compared to only 39% by Rooted-PageRank.

• # Mining Candidates for Adverse Drug Interactions in Electronic Patient Records

2014. Lars Asker (et al.). PETRA '14 Proceedings of the 7th International Conference on Pervasive Technologies Related to Assistive Environments, PETRA’14

Conference

Electronic patient records provide a valuable source of information for detecting adverse drug events. In this paper, we explore two different but complementary approaches to extracting useful information from electronic patient records with the goal of identifying candidate drugs, or combinations of drugs, to be further investigated for suspected adverse drug events. We propose a novel filter-and-refine approach that combines sequential pattern mining and disproportionality analysis. The proposed method is expected to identify groups of possibly interacting drugs suspected for causing certain adverse drug events. We perform an empirical investigation of the proposed method using a subset of the Stockholm electronic patient record corpus. The data used in this study consists of all diagnoses and medications for a group of patients diagnoses with at least one heart related diagnosis during the period 2008--2010. The study shows that the method indeed is able to detect combinations of drugs that occur more frequently for patients with cardiovascular diseases than for patients in a control group, providing opportunities for finding candidate drugs that cause adverse drug effects through interaction.

• # Model-Based Time Series Classification

2014. Alexios Kotsifakos, Panagiotis Papapetrou. Advances in Intelligent Data Analysis XIII, 179-191

Conference

We propose MTSC, a filter-and-refine framework for time series Nearest Neighbor (NN) classification. Training time series belonging to certain classes are first modeled through Hidden Markov Models (HMMs). Given an unlabeled query, and at the filter step, we identify the top K models that have most likely produced the query. At the refine step, a distance measure is applied between the query and all training time series of the top K models. The query is then assigned with the class of the NN. In our experiments, we first evaluated the NN classification error rate of HMMs compared to three state-of-the-art distance measures on 45 time series datasets of the UCR archive, and showed that modeling time series with HMMs achieves lower error rates in 30 datasets and equal error rates in 4. Secondly, we compared MTSC with Cross Validation defined over the three measures on 33 datasets, and we observed that MTSC is at least as good as the competitor method in 23 datasets, while achieving competitive speedups, showing its effectiveness and efficiency.

• # Using Time-Sensitive Rooted PageRank to Detect Hierarchical Social Relationships

2014. Mohammad Jaber (et al.). Advances in Intelligent Data Analysis XIII, 143-154

Conference

We study the problem of detecting hierarchical ties in a social network by exploiting the interaction patterns between the actors (members) involved in the network. Motivated by earlier work using a rank-based approach, i.e., Rooted-PageRank, we introduce a novel time-sensitive method, called T-RPR, that captures and exploits the dynamics and evolution of the interaction patterns in the network in order to identify the underlying hierarchical ties. Experiments on two real datasets demonstrate the performance of T-RPR in terms of recall and show its superiority over a recent competitor method

Read more about Using Time-Sensitive Rooted PageRank to Detect Hierarchical Social Relationships
• # Genre classification of symbolic music with SMBGT

2013. Alexios Kotsifakos (et al.). Proceedings of the 6th International Conference on PErvasive Technologies Related to Assistive Environments (PETRA)

Conference

Automatic music genre classification is a task that has attracted the interest of the music community for more than two decades. Music can be of high importance within the area of assistive technologies as it can be seen as an assistive technology with high therapeutic and educational functionality for children and adults with disabilities. Several similarity methods and machine learning techniques have been applied in the literature to deal with music genre classification, and as a result data mining and Music Information Retrieval (MIR) are strongly interconnected. In this paper, we deal with music genre classification for symbolic music, and specifically MIDI, by combining the recently proposed novel similarity measure for sequences, SMBGT, with the k-Nearest Neighbor (k-NN) classifier. For all MIDI songs we first extract all of their channels and then transform each channel into a sequence of 2D points, providing information for pitch and duration of their music notes. The similarity between two songs is found by computing the SMBGT for all pairs of the songs' channels and getting the maximum pairwise channel score as their similarity. Each song is treated as a query to which k-NN is applied, and the returned genre of the classifier is the one with the majority of votes in the k neighbors. Classification accuracy results indicate that there is room for improvement, especially due to the ambiguous definitions of music genres that make it hard to clearly discriminate them. Using this framework can also help us analyze and understand potential disadvantages of SMBGT, and thus identify how it can be improved when used for classification of real-time sequences.

• # Hum-a-song

2012. Alexios Kotsifakos (et al.). Proceedings of the VLDB Endowment 5 (12), 1930-1933

Article

We present "Hum-a-song", a system built for music retrieval, and particularly for the Query-By-Humming (QBH) application. According to QBH, the user is able to hum a part of a song that she recalls and would like to learn what this song is, or find other songs similar to it in a large music repository. We present a simple yet efficient approach that maps the problem to time series subsequence matching. The query and the database songs are represented as 2-dimensional time series conveying information about the pitch and the duration of the notes. Then, since the query is a short sequence and we want to find its best match that may start and end anywhere in the database, subsequence matching methods are suitable for this task. In this demo, we present a system that employs and exposes to the user a variety of state-of-the-art dynamic programming methods, including a newly proposed efficient method named SMBGT that is robust to noise and considers all intrinsic problems in QBH; it allows variable tolerance levels when matching elements, where tolerances are defined as functions of the compared sequences, gaps in both the query and target sequences, and bounds the matching length and (optionally) the minimum number of matched elements. Our system is intended to become open source, which is to the best of our knowledge the first non-commercial effort trying to solve QBH with a variety of methods, and that also approaches the problem from the time series perspective.