Stockholm university

Hercules DalianisProfessor

About me

Hercules Dalianis holds a master's degree (civ.ing.) in electrical engineering from KTH, Stockholm, 1984. He became a doctor of technology at KTH in 1996, and a professor of computer and systems science in 2011 at Stockholm University.

Dalianis specialization is natural language processing of text, mainly Swedish. He has carried out research in natural language generation, automatic summary of text, information retrieval and the last 15 years in clinical text mining. The area is also popularly called Artificial Intelligence. Dalianis has published a textbook in the field entitled Clinical Text Mining: Secondary Use of Electronic Patient Records, the book is published under Open Access.

Hercules Dalianis is director of the research infrastructure Health Bank - Swedish Health Record Research Bank.

Hercules homepage at DSV to read more.

Teaching

Research projects

Publications

A selection from Stockholm University publication database

  • Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data

    2022. Thomas Vakili (et al.). Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 4245-4252

    Conference

    Automatic de-identification is a cost-effective and straightforward way of removing large amounts of personally identifiable information from large and sensitive corpora. However, these systems also introduce errors into datasets due to their imperfect precision. These corruptions of the data may negatively impact the utility of the de-identified dataset. This paper de-identifies a very large clinical corpus in Swedish either by removing entire sentences containing sensitive data or by replacing sensitive words with realistic surrogates. These two datasets are used to perform domain adaptation of a general Swedish BERT model. The impact of the de-identification techniques is assessed by training and evaluating the models using six clinical downstream tasks. The results are then compared to a similar BERT model domain-adapted using an unaltered version of the clinical corpus. The results show that using an automatically de-identified corpus for domain adaptation does not negatively impact downstream performance. We argue that automatic de-identification is an efficient way of reducing the privacy risks of domain-adapted models and that the models created in this paper should be safe to distribute to other academic researchers.

    Read more about Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data
  • Implementation of specialised attention mechanisms: ICD-10 classification of Gastrointestinal discharge summaries in English, Spanish and Swedish

    2022. Alberto Blanco (et al.). Journal of Biomedical Informatics 130

    Article

    Multi-label classification according to the International Classification of Diseases (ICD) is an Extreme Multi-label Classification task aiming to categorise health records according to a set of relevant ICD codes. We implemented PlaBERT, a new multi-label text classification head with per-label attention, on top of a BERT model. The model assessment is conducted on Electronic Health Records, conveying Discharge Summaries in three languages – English, Spanish, and Swedish. The study focuses on 157 diagnostic codes from the ICD. We additionally measure the labelling noise to estimate the consistency of the gold standard. Our specialised attention mechanism computes attention weights for each input token and label pair, obtaining the specific relevance of every word concerning each ICD code. The PlaBERT model outputs the computed attention importance for each token and label, allowing for visualisation. Our best results are 40.65, 38.36, and 41.13 F1-Score points on the English, Spanish and Swedish datasets, respectively, for the 157 gastrointestinal codes. Besides, Precision is the metric that most significantly improves owing to the attention mechanism of PlaBERT, with an increase of 44.63, 40.93, and 12.92 points, respectively, for the Spanish, Swedish and English datasets.

    Read more about Implementation of specialised attention mechanisms
  • Utility Preservation of Clinical Text After De-Identification

    2022. Thomas Vakili, Hercules Dalianis. Proceedings of the 21st Workshop on Biomedical Language Processing, 383-388

    Conference

    Electronic health records contain valuable information about symptoms, diagnosis, treatment and outcomes of the treatments of individual patients. However, the records may also contain information that can reveal the identity of the patients. Removing these identifiers - the Protected Health Information (PHI) - can protect the identity of the patient. Automatic de-identification is a process which employs machine learning techniques to detect and remove PHI. However, automatic techniques are imperfect in their precision and introduce noise into the data. This study examines the impact of this noise on the utility of Swedish de-identified clinical data by using human evaluators and by training and testing BERT models. Our results indicate that de-identification does not harm the utility for clinical NLP and that human evaluators are less sensitive to noise from de-identification than expected.

    Read more about Utility Preservation of Clinical Text After De-Identification
  • Are Clinical BERT Models Privacy Preserving? The Difficulty of Extracting Patient-Condition Associations

    2021. Thomas Vakili, Hercules Dalianis. Proceedings of the AAAI 2021 Fall Symposium on Human Partnership with Medical AI

    Conference

    Language models may be trained on data that contain personal information, such as clinical data. Such sensitive data must not leak for privacy reasons. This article explores whether BERT models trained on clinical data are susceptible to training data extraction attacks. Multiple large sets of sentences generated from the model with top-k sampling and nucleus sampling are studied. The sentences are examined to determine the degree to which they contain information associating patients with their conditions. The sentence sets are then compared to determine if there is a correlation between the degree of privacy leaked and the linguistic quality attained by each generation technique. We find that the relationship between linguistic quality and privacy leakage is weak and that the risk of a successful training data extraction attack on a BERT-based model is small.

    Read more about Are Clinical BERT Models Privacy Preserving? The Difficulty of Extracting Patient-Condition Associations
  • De-identification of Clinical Text for Secondary Use: Research Issues

    2021. Hanna Berg (et al.). Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies - (Volume 5), 592-599

    Conference

    Privacy is challenged by both advances in AI-related technologies and recently introduced legal regulations. The problem of privacy has been extensively studied within the privacy community, but has largely focused on methods for protecting and assessing the privacy of structured data. Research aiming to protect the integrity of patients based on clinical text has primarily referred to US law and relied on automatically recognising predetermined, both direct and indirect, identifiers. This article discusses the various challenges concerning the re-use of unstructured clinical data, in particular in the form of clinical text, and focuses on ambiguous and vague terminology, how different legislation affects the requirements for de-identification, differences between methods for unstructured and structured data, the impact of approaches based on named entity recognition and replacing sensitive data with surrogates, as well as the lack of measures for usability and re-identification risk.

    Read more about De-identification of Clinical Text for Secondary Use
  • Multi-label Diagnosis Classification of Swedish Discharge Summaries – ICD-10 Code Assignment Using KB-BERT

    2021. Sonja Remmer, Anastasios Lamproudis, Hercules Dalianis. INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING 2021, 1158-1166

    Conference

    The International Classification of Diseases (ICD) is a system for systematically recording patients’ diagnoses. Clinicians or professional coders assign ICD codes to patients’ medical records to facilitate funding, research, and ad- ministration. In most health facilities, clinical coding is a manual, time-demanding task that is prone to errors. A tool that automatically assigns ICD codes to free-text clinical notes could save time and reduce erroneous coding. While many previous studies have focused on ICD coding, research on Swedish patient records is scarce. This study explored different approaches to pairing Swedish clinical notes with ICD codes. KB-BERT, a BERT model pre-trained on Swedish text, was compared to the traditional supervised learning models Support Vector Machines, Decision Trees, and K-nearest Neighbours used as the baseline. When considering ICD codes grouped into ten blocks, the KB-BERT was superior to the baseline models, obtaining an F1-micro of 0.80 and an F1-macro of 0.58. When considering the 263 full ICD codes, the KB-BERT was outperformed by all baseline models at an F1-micro and F1-macro of zero. Wilcoxon signed-rank tests showed that the performance differences between the KB-BERT and the baseline mod- els were statistically significant.

    Read more about Multi-label Diagnosis Classification of Swedish Discharge Summaries – ICD-10 Code Assignment Using KB-BERT
  • Terminology Expansion with Prototype Embeddings: Extracting Symptoms of Urinary Tract Infection from Clinical Text

    2021. Mahbub Ul Alam (et al.). Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2021) - Volume 5: HEALTHINF, 47-57

    Conference

    Many natural language processing applications rely on the availability of domain-specific terminologies containing synonyms. To that end, semi-automatic methods for extracting additional synonyms of a given concept from corpora are useful, especially in low-resource domains and noisy genres such as clinical text, where nonstandard language use and misspellings are prevalent. In this study, prototype embeddings based on seed words were used to create representations for (i) specific urinary tract infection (UTI) symptoms and (ii) UTI symptoms in general. Four word embedding methods and two phrase detection methods were evaluated using clinical data from Karolinska University Hospital. It is shown that prototype embeddings can effectively capture semantic information related to UTI symptoms. Using prototype embeddings for specific UTI symptoms led to the extraction of more symptom terms compared to using prototype embeddings for UTI symptoms in general. Overall, 142 additional UTI symp tom terms were identified, yielding a more than 100% increment compared to the initial seed set. The mean average precision across all UTI symptoms was 0.51, and as high as 0.86 for one specific UTI symptom. This study provides an effective and cost-effective solution to terminology expansion with small amounts of labeled data.

    Read more about Terminology Expansion with Prototype Embeddings: Extracting Symptoms of Urinary Tract Infection from Clinical Text
  • A Semi-supervised Approach for De-identification of Swedish Clinical Text

    2020. Hanna Berg, Hercules Dalianis. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 4444-4450

    Conference

    An abundance of electronic health records (EHR) is produced every day within healthcare. The records possess valuable information for research and future improvement of healthcare. Multiple efforts have been done to protect the integrity of patients while making electronic health records usable for research by removing personally identifiable information in patient records. Supervised machine learning approaches for de-identification of EHRs need annotated data for training, annotations that are costly in time and human resources. The annotation costs for clinical text is even more costly as the process must be carried out in a protected environment with a limited number of annotators who must have signed confidentiality agreements. In this paper is therefore, a semi-supervised method proposed, for automatically creating high-quality training data. The study shows that the method can be used to improve recall from 84.75% to 89.20% without sacrificing precision to the same extent, dropping from 95.73% to 94.20%. The model’s recall is arguably more important for de-identification than precision.

    Read more about A Semi-supervised Approach for De-identification of Swedish Clinical Text
  • De-Identifying Swedish EHR Text Using Public Resources in the General Domain

    2020. Taridzo Chomutare (et al.). Digital Personalized Health and Medicine, 148-152

    Conference

    Sensitive data is normally required to develop rule-based or train machine learning-based models for de-identifying electronic health record (EHR) clinical notes; and this presents important problems for patient privacy. In this study, we add non-sensitive public datasets to EHR training data; (i) scientific medical text and (ii) Wikipedia word vectors. The data, all in Swedish, is used to train a deep learning model using recurrent neural networks. Tests on pseudonymized Swedish EHR clinical notes showed improved precision and recall from 55.62% and 80.02% with the base EHR embedding layer, to 85.01% and 87.15% when Wikipedia word vectors are added. These results suggest that non-sensitive text from the general domain can be used to train robust models for de-identifying Swedish clinical text; and this could be useful in cases where the data is both sensitive and in low-resource languages.

    Read more about De-Identifying Swedish EHR Text Using Public Resources in the General Domain
  • Deep Learning from Heterogeneous Sequences of Sparse Medical Data for Early Prediction of Sepsis

    2020. Mahbub Ul Alam (et al.). Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2020), 45-55

    Conference

    Sepsis is a life-threatening complication to infections, and early treatment is key for survival. Symptoms of sepsis are difficult to recognize, but prediction models using data from electronic health records (EHRs) can facilitate early detection and intervention. Recently, deep learning architectures have been proposed for the early prediction of sepsis. However, most efforts rely on high-resolution data from intensive care units (ICUs). Prediction of sepsis in the non-ICU setting, where hospitalization periods vary greatly in length and data is more sparse, is not as well studied. It is also not clear how to learn effectively from longitudinal EHR data, which can be represented as a sequence of time windows. In this article, we evaluate the use of an LSTM network for early prediction of sepsis according to Sepsis-3 criteria in a general hospital population. An empirical investigation using six different time window sizes is conducted. The best model uses a two-hour window and assumes data is missing not at random, clearly outperforming scoring systems commonly used in healthcare today. It is concluded that the size of the time window has a considerable impact on predictive performance when learning from heterogeneous sequences of sparse medical data for early prediction of sepsis.

    Read more about Deep Learning from Heterogeneous Sequences of Sparse Medical Data for Early Prediction of Sepsis
  • Evaluating Pretraining Strategies for Clinical BERT Models

    2022. Anastasios Lamproudis, Aron Henriksson, Hercules Dalianis. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 410-416

    Conference

    Research suggests that using generic language models in specialized domains may be sub-optimal due to significant domain differences. As a result, various strategies for developing domain-specific language models have been proposed, including techniques for adapting an existing generic language model to the target domain, e.g. through various forms of vocabulary modifications and continued domain-adaptive pretraining with in-domain data. Here, an empirical investigation is carried out in which various strategies for adapting a generic language model to the clinical domain are compared to pretraining a pure clinical language model. Three clinical language models for Swedish, pretrained for up to ten epochs, are fine-tuned and evaluated on several downstream tasks in the clinical domain. A comparison of the language models’ downstream performance over the training epochs is conducted. The results show that the domain-specific language models outperform a general-domain language model, although there is little difference in performance between the various clinical language models. However, compared to pretraining a pure clinical language model with only in-domain data, leveraging and adapting an existing general-domain language model requires fewer epochs of pretraining with in-domain data.

    Read more about Evaluating Pretraining Strategies for Clinical BERT Models
  • Vocabulary Modifications for Domain-adaptive Pretraining of Clinical Language Models

    2022. Anastasios Lamproudis, Aron Henriksson, Hercules Dalianis. Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies - HEALTHINF, 180-188

    Conference

    Research has shown that using generic language models – specifically, BERT models – in specialized domains may be sub-optimal due to domain differences in language use and vocabulary. There are several techniques for developing domain-specific language models that leverage the use of existing generic language models, including continued and domain-adaptive pretraining with in-domain data. Here, we investigate a strategy based on using a domain-specific vocabulary, while leveraging a generic language model for initialization. The results demonstrate that domain-adaptive pretraining, in combination with a domain-specific vocabulary – as opposed to a general-domain vocabulary – yields improvements on two downstream clinical NLP tasks for Swedish. The results highlight the value of domain-adaptive pretraining when developing specialized language models and indicate that it is beneficial to adapt the vocabulary of the language model to the target domain prior to continued, domain-adaptive pretraining of a generic language model.

    Read more about Vocabulary Modifications for Domain-adaptive Pretraining of Clinical Language Models
  • Developing a Clinical Language Model for Swedish: Continued Pretraining of Generic BERT with In-Domain Data

    2021. Anastasios Lamproudis, Aron Henriksson, Hercules Dalianis. INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING 2021, 790-797

    Conference

    The use of pretrained language models, finetuned to perform a specific downstream task, has become widespread in NLP. Using a generic language model in specialized domains may, however, be sub-optimal due to differences in language use and vocabulary. In this paper, it is investigated whether an existing, generic language model for Swedish can be improved for the clinical domain through continued pretraining with clinical text.

    The generic and domain-specific language models are fine-tuned and evaluated on three representative clinical NLP tasks: (i) identifying protected health information, (ii) assigning ICD-10 diagnosis codes to discharge summaries, and (iii) sentence-level uncertainty prediction. The results show that continued pretraining on in-domain data leads to improved performance on all three downstream tasks, indicating that there is a potential added value of domain-specific language models for clinical NLP.

    Read more about Developing a Clinical Language Model for Swedish
  • The accuracy of fully automated algorithms for surveillance of healthcare-associated urinary tract infections in hospitalized patients

    2021. S. D. van der Werff (et al.). Journal of Hospital Infection 110, 139-147

    Article

    Background: Surveillance for healthcare-associated infections such as healthcareassociated urinary tract infections (HA-UTI) is important for directing resources and evaluating interventions. However, traditional surveillance methods are resourceintensive and subject to bias.

    Aim: To develop and validate a fully automated surveillance algorithm for HA-UTI using electronic health record (EHR) data.

    Methods: Five algorithms were developed using EHR data from 2979 admissions at Karolinska University Hospital from 2010 to 2011: (1) positive urine culture (UCx); (2) positive UCx + UTI codes (International Statistical Classification of Diseases and Related Health Problems, 10th revision); (3) positive UCx + UTI-specific antibiotics; (4) positive UCx + fever and/or UTI symptoms; (5) algorithm 4 with negation for fever without UTI symptoms. Natural language processing (NLP) was used for processing free-text medical notes. The algorithms were validated in 1258 potential UTI episodes from January to March 2012 and results extrapolated to all UTI episodes within this period (N 1/4 16,712). The reference standard for HA-UTIs was manual record review according to the European Centre for Disease Prevention and Control (and US Centers for Disease Control and Prevention) definitions by trained healthcare personnel.

    Findings: Of the 1258 UTI episodes, 163 fulfilled the ECDC HA-UTI definition and the algorithms classified 391, 150, 189, 194, and 153 UTI episodes, respectively, as HA-UTI. Algorithms 1, 2, and 3 had insufficient performances. Algorithm 4 achieved better performance and algorithm 5 performed best for surveillance purposes with sensitivity 0.667 (95% confidence interval: 0.594-0.733), specificity 0.997 (0.996-0.998), positive predictive value 0.719 (0.624-0.807) and negative predictive value 0.997 (0.996-0.997).

    Conclusion: A fully automated surveillance algorithm based on NLP to find UTI symptoms in free-text had acceptable performance to detect HA-UTI compared to manual record review. Algorithms based on administrative and microbiology data only were not sufficient.

    Read more about The accuracy of fully automated algorithms for surveillance of healthcare-associated urinary tract infections in hospitalized patients
  • HAI-Proactive: Development of an Automated Surveillance System for Healthcare-Associated Infections in Sweden

    2020. Pontus Nauclér (et al.). Infection control and hospital epidemiology 41 (S1), 39-39

    Article

    Background: Healthcare-associated infection (HAI) surveillance is essential for most infection prevention programs and continuous epidemiological data can be used to inform healthcare personal, allocate resources, and evaluate interventions to prevent HAIs. Many HAI surveillance systems today are based on time-consuming and resource-intensive manual reviews of patient records. The objective of HAI-proactive, a Swedish triple-helix innovation project, is to develop and implement a fully automated HAI surveillance system based on electronic health record data. Furthermore, the project aims to develop machine-learning–based screening algorithms for early prediction of HAI at the individual patient level. Methods: The project is performed with support from Sweden’s Innovation Agency in collaboration among academic, health, and industry partners. Development of rule-based and machine-learning algorithms is performed within a research database, which consists of all electronic health record data from patients admitted to the Karolinska University Hospital. Natural language processing is used for processing free-text medical notes. To validate algorithm performance, manual annotation was performed based on international HAI definitions from the European Center for Disease Prevention and Control, Centers for Disease Control and Prevention, and Sepsis-3 criteria. Currently, the project is building a platform for real-time data access to implement the algorithms within Region Stockholm. Results: The project has developed a rule-based surveillance algorithm for sepsis that continuously monitors patients admitted to the hospital, with a sensitivity of 0.89 (95% CI, 0.85–0.93), a specificity of 0.99 (0.98–0.99), a positive predictive value of 0.88 (0.83–0.93), and a negative predictive value of 0.99 (0.98–0.99). The healthcare-associated urinary tract infection surveillance algorithm, which is based on free-text analysis and negations to define symptoms, had a sensitivity of 0.73 (0.66–0.80) and a positive predictive value of 0.68 (0.61–0.75). The sensitivity and positive predictive value of an algorithm based on significant bacterial growth in urine culture only was 0.99 (0.97–1.00) and 0.39 (0.34–0.44), respectively. The surveillance system detected differences in incidences between hospital wards and over time. Development of surveillance algorithms for pneumonia, catheter-related infections and Clostridioides difficile infections, as well as machine-learning–based models for early prediction, is ongoing. We intend to present results from all algorithms. Conclusions: With access to electronic health record data, we have shown that it is feasible to develop a fully automated HAI surveillance system based on algorithms using both structured data and free text for the main healthcare-associated infections.

    Read more about HAI-Proactive
  • Improving Named Entity Recognition and Classification in Class Imbalanced Swedish Electronic Patient Records through Resampling

    2020. Mila Grancharova, Hanna Berg, Hercules Dalianis.

    Conference

    A key step in the de-identification of sensitive information in natural language text is the detection and identification of sensitive entities through Named Entity Recognition and Classification (NERC). Natural language data is often class imbalanced in two ways. First, with respect to the majority negative class consisting of all tokens other than named entities, and second, between the different classes of named entities. NERC of class imbalanced data often suffers in recall. This is an issue in deidentification systems where recall is essential in ensuring protection of sensitive information.

    This study attempts to improve NERC, with focus on improving recall, in Swedish electronic patient records through resampling strategies involving negative class undersampling and minority class oversampling. The methods are evaluated in two NERC models based on machine learning methods. In both models, an increase in recall is achieved through undersampling, oversampling and combinations there of.Undersampling, however, has negative effects on precision.

    Read more about Improving Named Entity Recognition and Classification in Class Imbalanced Swedish Electronic Patient Records through Resampling
  • The Impact of De-identification on Downstream Named Entity Recognition in Clinical Text

    2020. Hanna Berg, Aron Henriksson, Hercules Dalianis. The 11th International Workshop on Health Text Mining and Information Analysis LOUHI 2020, 1-11

    Conference

    The impact of de-identification on data quality and, in particular, utility for developing models for downstream tasks has been more thoroughly studied for structured data than for unstructured text. While previous studies indicate that text de-identification has a limited impact on models for downstream tasks, it remains unclear what the impact is with various levels and forms of de-identification, in particular concerning the trade-off between precision and recall. In this paper, the impact of de-identification is studied on downstream named entity recognition in Swedish clinical text. The results indicate that de-identification models with moderate to high precision lead to similar downstream performance, while low precision has a substantial negative impact. Furthermore, different strategies for concealing sensitive information affect performance to different degrees, ranging from pseudonymisation having a low impact to the removal of entire sentences with sensitive information having a high impact. This study indicates that it is possible to increase the recall of models for identifying sensitive information without negatively affecting the use of de-identified text data for training models for clinical named entity recognition; however, there is ultimately a trade-off between the level of de-identification and the subsequent utility of the data.

    Read more about The Impact of De-identification on Downstream Named Entity Recognition in Clinical Text
  • Augmenting a De-identification System for Swedish Clinical Text Using Open Resources and Deep Learning

    2019. Hanna Berg, Hercules Dalianis. Proceedings of the Workshop on NLP and Pseudonymisation, 8-15

    Conference

    Electronic patient records are produced in abundance every day and there is a demand to use them for research or management purposes. The records, however, contain information in the free text that can identify the patient and therefore tools are needed to identify this sensitive information. The aim is to compare two machine learning algorithms, Long Short-Term Memory (LSTM) and Conditional Random Fields (CRF) applied to a Swedish clinical data set annotated for de-identification. The results show that CRF performs better than deep learning with LSTM, with CRF giving the best results with an F1 score of 0.91 when adding more data from within the same domain. Adding general open data did, on the other hand, not improve the results.

    Read more about Augmenting a De-identification System for Swedish Clinical Text Using Open Resources and Deep Learning
  • Knowledge patterns for online health portal development

    2019. Andrea Andrenucci, Hercules Dalianis, Sumithra Velupillai. Health Informatics Journal 25 (4), 1779-1799

    Article

    This article describes the development and evaluation of a set of knowledge patterns that provide guidelines and implications of design for developers of mental health portals. The knowledge patterns were based on three foundations: 1) Knowledge integration of language technology approaches; 2) Experiments with language technology applications and 3) User studies of portal interaction. A mixed-methods approach was employed for the evaluation of the knowledge patterns: formative workshops with knowledge pattern experts and summative surveys with experts in specific domains. The formative evaluation improved the cohesion of the patterns. The results of the summative evaluation showed that the problems discussed in the patterns were relevant for the domain and that the knowledge embedded was useful to solve them. Ten patterns out of thirteen achieved an average score above 4.0, which is a positive result that leads us to conclude that they can be used as guidelines for developing health portals.

    Read more about Knowledge patterns for online health portal development
  • Slutrapport KVALPA: Vilka KVaLitetsindikatorer i PAtientjournalens fria text behövs för att kunna mäta kvalitén på vården? Skapandet av en automatisk metod genom maskininlärning

    2019. Hercules Dalianis.

    Report

    Detta är en förstudie för att automatiskt hitta kvalitetsindikatorer i den fria texten i elektroniska patientjournaler från Karolinska universitetssjukhuset. Kvalitetsindikatorerna som studerats indikerar urinvägsinfektioner, sepsis, fallskada, trycksår, nutrition och biverkan av läkemedel. En intervjustudie genomfördes för att förstå problematiken, ett regelbaserat system implementerades i programmerings- språket Python. Systemet kallas för KVALPA och använder sig av triggerord och applicerades på 100 patientjournaler från fem olika kliniska enheter. 102 kvalitetsindikatorer hittades varav 26 var negerade och ytterligare hittades genom manuell analys. De negerade indikatorerna visar att det saknas indikatorer på dålig kvalitet, utom i fallet nutrition. Framtida utvecklingar är att utöka triggerlistan med synonymer framtagna automatiskt men också att annotera upp en guldstandard som kan användas för att evaluera precision och täckning av systemet.

    Read more about Slutrapport KVALPA
  • Clinical Natural Language Processing in languages other than English: opportunities and challenges

    2018. Aurélie Névéol (et al.). Journal of Biomedical Semantics 9

    Article

    Background: Natural language processing applied to clinical text or aimed at a clinical outcome has been thriving in recent years. This paper offers the first broad overview of clinical Natural Language Processing (NLP) for languages other than English. Recent studies are summarized to offer insights and outline opportunities in this area. Main Body: We envision three groups of intended readers: (1) NLP researchers leveraging experience gained in other languages, (2) NLP researchers faced with establishing clinical text processing in a language other than English, and (3) clinical informatics researchers and practitioners looking for resources in their languages in order to apply NLP techniques and tools to clinical practice and/or investigation. We review work in clinical NLP in languages other than English. We classify these studies into three groups: (i) studies describing the development of new NLP systems or components de novo, (ii) studies describing the adaptation of NLP architectures developed for English to another language, and (iii) studies focusing on a particular clinical application. Conclusion: We show the advantages and drawbacks of each method, and highlight the appropriate application context. Finally, we identify major challenges and opportunities that will affect the impact of NLP on clinical practice and public health studies in a context that encompasses English as well as other languages.

    Read more about Clinical Natural Language Processing in languages other than English
  • Detecting hospital-acquired infections: A document classification approach using support vector machines and gradient tree boosting

    2018. Claudia Ehrentraut (et al.). Health Informatics Journal 24 (1), 24-42

    Article

    Hospital-acquired infections pose a significant risk to patient health, while their surveillance is an additional workload for hospital staff. Our overall aim is to build a surveillance system that reliably detects all patient records that potentially include hospital-acquired infections. This is to reduce the burden of having the hospital staff manually check patient records. This study focuses on the application of text classification using support vector machines and gradient tree boosting to the problem. Support vector machines and gradient tree boosting have never been applied to the problem of detecting hospital-acquired infections in Swedish patient records, and according to our experiments, they lead to encouraging results. The best result is yielded by gradient tree boosting, at 93.7percent recall, 79.7percent precision and 85.7percent F1 score when using stemming. We can show that simple preprocessing techniques and parameter tuning can lead to high recall (which we aim for in screening patient records) with appropriate precision for this task.

    Read more about Detecting hospital-acquired infections
  • Ensembles for clinical entity extraction

    2018. Rebecka Weegar (et al.). Revista de Procesamiento de Lenguaje Natural (SEPLN) (60), 13-20

    Article

    Health records are a valuable source of clinical knowledge and Natural Language Processing techniques have previously been applied to the text in health records for a number of applications. Often, a first step in clinical text processing is clinical entity recognition; identifying, for example, drugs, disorders, and body parts in clinical text. However, most of this work has focused on records in English. Therefore, this work aims to improve clinical entity recognition for languages other than English by comparing the same methods on two different languages, specifically by employing ensemble methods. Models were created for Spanish and Swedish health records using SVM, Perceptron, and CRF and four different feature sets, including unsupervised features. Finally, the models were combined in ensembles. Weighted voting was applied according to the models individual F-scores. In conclusion, the ensembles improved the overall performance for Spanish and the precision for Swedish.

    Read more about Ensembles for clinical entity extraction
  • Efficient Encoding of Pathology Reports Using Natural Language Processing

    2017. Rebecka Weegar, Jan F. Nygård, Hercules Dalianis. Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, 778-783

    Conference

    In this article we present a system that extracts information from pathology reports. The reports are written in Norwegian and contain free text describing prostate biopsies. Currently, these reports are manually coded for research and statistical purposes by trained experts at the Cancer Registry of Norway where the coders extract values for a set of predefined fields that are specific for prostate cancer. The presented system is rule based and achieves an average F-score of 0.91 for the fields Gleason grade, Gleason score, the number of biopsies that contain tumor tissue, and the orientation of the biopsies. The system also identifies reports that contain ambiguity or other content that should be reviewed by an expert. The system shows potential to encode the reports considerably faster, with less resources, and similar high quality to the manual encoding.

    Read more about Efficient Encoding of Pathology Reports Using Natural Language Processing
  • Peripheral Oxygen Saturation Facilitates Assessment of Respiratory Dysfunction in the Sequential Organ Failure Assessment Score With Implications for the Sepsis-3 Criteria

    2022. John Karlsson Valik (et al.). Critical Care Medicine 50 (3), e272-e283

    Article

    OBJECTIVES:

    Sequential Organ Failure Assessment score is the basis of the Sepsis-3 criteria and requires arterial blood gas analysis to assess respiratory function. Peripheral oxygen saturation is a noninvasive alternative but is not included in neither Sequential Organ Failure Assessment score nor Sepsis-3. We aimed to assess the association between worst peripheral oxygen saturation during onset of suspected infection and mortality.

    DESIGN:

    Cohort study of hospital admissions from a main cohort and emergency department visits from four external validation cohorts between year 2011 and 2018. Data were collected from electronic health records and prospectively by study investigators.

    SETTING:

    Eight academic and community hospitals in Sweden and Canada.

    PATIENTS:

    Adult patients with suspected infection episodes.

    INTERVENTIONS:

    None.

    MEASUREMENTS AND MAIN RESULTS:

    The main cohort included 19,396 episodes (median age, 67.0 [53.0–77.0]; 9,007 [46.4%] women; 1,044 [5.4%] died). The validation cohorts included 10,586 episodes (range of median age, 61.0–76.0; women 42.1–50.2%; mortality 2.3–13.3%). Peripheral oxygen saturation levels 96–95% were not significantly associated with increased mortality in the main or pooled validation cohorts. At peripheral oxygen saturation 94%, the adjusted odds ratio of death was 1.56 (95% CI, 1.10–2.23) in the main cohort and 1.36 (95% CI, 1.00–1.85) in the pooled validation cohorts and increased gradually below this level. Respiratory assessment using peripheral oxygen saturation 94–91% and less than 91% to generate 1 and 2 Sequential Organ Failure Assessment points, respectively, improved the discrimination of the Sequential Organ Failure Assessment score from area under the receiver operating characteristics 0.75 (95% CI, 0.74–0.77) to 0.78 (95% CI, 0.77–0.80; p < 0.001). Peripheral oxygen saturation/Fio2 ratio had slightly better predictive performance compared with peripheral oxygen saturation alone, but the clinical impact was minor.

    CONCLUSIONS:

    These findings provide evidence for assessing respiratory function with peripheral oxygen saturation in the Sequential Organ Failure Assessment score and the Sepsis-3 criteria. Our data support using peripheral oxygen saturation thresholds 94% and 90% to get 1 and 2 Sequential Organ Failure Assessment respiratory points, respectively. This has important implications primarily for emergency practice, rapid response teams, surveillance, research, and resource-limited settings.

    Read more about Peripheral Oxygen Saturation Facilitates Assessment of Respiratory Dysfunction in the Sequential Organ Failure Assessment Score With Implications for the Sepsis-3 Criteria
  • Challenges and opportunities beyond structured data in analysis of electronic health records

    2021. Maryam Tayefi (et al.). Wiley Interdisciplinary Reviews 13 (6)

    Article

    Electronic health records (EHR) contain a lot of valuable information about individual patients and the whole population. Besides structured data, unstructured data in EHRs can provide extra, valuable information but the analytics processes are complex, time-consuming, and often require excessive manual effort. Among unstructured data, clinical text and images are the two most popular and important sources of information. Advanced statistical algorithms in natural language processing, machine learning, deep learning, and radiomics have increasingly been used for analyzing clinical text and images. Although there exist many challenges that have not been fully addressed, which can hinder the use of unstructured data, there are clear opportunities for well-designed diagnosis and decision support tools that efficiently incorporate both structured and unstructured data for extracting useful information and provide better outcomes. However, access to clinical data is still very restricted due to data sensitivity and ethical issues. Data quality is also an important challenge in which methods for improving data completeness, conformity and plausibility are needed. Further, generalizing and explaining the result of machine learning models are important problems for healthcare, and these are open challenges. A possible solution to improve data quality and accessibility of unstructured data is developing machine learning methods that can generate clinically relevant synthetic data, and accelerating further research on privacy preserving techniques such as deidentification and pseudonymization of clinical text.

    Read more about Challenges and opportunities beyond structured data in analysis of electronic health records
  • On the Contribution of Per-ICD Attention Mechanisms to Classify Health Records in Languages With Fewer Resources than English

    2021. Alberto Blanco (et al.). INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING 2021, 165-172

    Conference

    We introduce a multi-label text classifier with per-label attention for the classification of Electronic Health Records according to the International Classification of Diseases. We apply the model on two Electronic Health Records datasets with Discharge Summaries in two languages with fewer resources than En- glish, Spanish and Swedish. Our model lever- ages the BERT Multilingual model (specifically the Wikipedia, as the model have been trained with 104 languages, including Spanish and Swedish, with the largest Wikipedia dumps1) to share the language modelling capabilities across the languages. With the per-label attention, the model can compute the relevance of each word from the EHR towards the prediction of each label. For the experimental framework, we apply 157 labels from Chapter XI – Diseases of the Digestive System of the ICD, which makes the attention especially important as the model has to discriminate between similar diseases.

    Read more about On the Contribution of Per-ICD Attention Mechanisms to Classify Health Records in Languages With Fewer Resources than English
  • Detecting Adverse Drug Events from Swedish Electronic Health Records using Text Mining

    2020. Maria Bampa, Hercules Dalianis. Proceedings of the LREC 2020 Workshop on Multilingual Biomedical Text Processing (MultilingualBIO 2020), 1-8

    Conference

    Electronic Health Records are a valuable source of patient information which can be leveraged to detect Adverse Drug Events (ADEs) and aid post-mark drug-surveillance. The overall aim of this study is to scrutinize text written by clinicians in the EHRs and build a model for ADE detection that produces medically relevant predictions. Natural Language Processing techniques will be exploited to create important predictors and incorporate them into the learning process. The study focuses on the 5 most frequent ADE cases found ina Swedish electronic patient record corpus. The results indicate that considering textual features, rather than the structured, can improve the classification performance by 15{\%} in some ADE cases. Additionally, variable patient history lengths are incorporated in the models, demonstrating the importance of the above decision rather than using an arbitrary number for a history length. The experimental findings suggest that the clinical text in EHRs includes information that can capture data beyond the ones that are found in a structured format.

    Read more about Detecting Adverse Drug Events from Swedish Electronic Health Records using Text Mining
  • Natural language processing and machine learning to enable automatic extraction and classification of patients' smoking status from electronic medical records

    2020. Andrea Caccamisi (et al.). Upsala Journal of Medical Sciences 125 (4), 316-324

    Article

    Background: The electronic medical record (EMR) offers unique possibilities for clinical research, but some important patient attributes are not readily available due to its unstructured properties. We applied text mining using machine learning to enable automatic classification of unstructured information on smoking status from Swedish EMR data.

    Methods: Data on patients' smoking status from EMRs were used to develop 32 different predictive models that were trained using Weka, changing sentence frequency, classifier type, tokenization, and attribute selection in a database of 85,000 classified sentences. The models were evaluated using F-score and accuracy based on out-of-sample test data including 8500 sentences. The error weight matrix was used to select the best model, assigning a weight to each type of misclassification and applying it to the model confusion matrices. The best performing model was then compared to a rule-based method.

    Results: The best performing model was based on the Support Vector Machine (SVM) Sequential Minimal Optimization (SMO) classifier using a combination of unigrams and bigrams as tokens. Sentence frequency and attributes selection did not improve model performance. SMO achieved 98.14% accuracy and 0.981 F-score versus 79.32% and 0.756 for the rule-based model.

    Conclusion: A model using machine-learning algorithms to automatically classify patients' smoking status was successfully developed. Such algorithms may enable automatic assessment of smoking status and other unstructured data directly from EMRs without manual classification of complete case notes.

    Read more about Natural language processing and machine learning to enable automatic extraction and classification of patients' smoking status from electronic medical records
  • Validation of automated sepsis surveillance based on the Sepsis-3 clinical criteria against physician record review in a general hospital population: observational study using electronic health records data

    2020. John Karlsson Valik (et al.). BMJ Quality and Safety 29 (9), 735-745

    Article

    Background Surveillance of sepsis incidence is important for directing resources and evaluating quality-of-care interventions. The aim was to develop and validate a fully-automated Sepsis-3 based surveillance system in non-intensive care wards using electronic health record (EHR) data, and demonstrate utility by determining the burden of hospital-onset sepsis and variations between wards.

    Methods A rule-based algorithm was developed using EHR data from a cohort of all adult patients admitted at an academic centre between July 2012 and December 2013. Time in intensive care units was censored. To validate algorithm performance, a stratified random sample of 1000 hospital admissions (674 with and 326 without suspected infection) was classified according to the Sepsis-3 clinical criteria (suspected infection defined as having any culture taken and at least two doses of antimicrobials administered, and an increase in Sequential Organ Failure Assessment (SOFA) score by >2 points) and the likelihood of infection by physician medical record review.

    Results In total 82 653 hospital admissions were included. The Sepsis-3 clinical criteria determined by physician review were met in 343 of 1000 episodes. Among them, 313 (91%) had possible, probable or definite infection. Based on this reference, the algorithm achieved sensitivity 0.887 (95% CI: 0.799 to 0.964), specificity 0.985 (95% CI: 0.978 to 0.991), positive predictive value 0.881 (95% CI: 0.833 to 0.926) and negative predictive value 0.986 (95% CI: 0.973 to 0.996). When applied to the total cohort taking into account the sampling proportions of those with and without suspected infection, the algorithm identified 8599 (10.4%) sepsis episodes. The burden of hospital-onset sepsis (>48 hour after admission) and related in-hospital mortality varied between wards.

    Conclusions A fully-automated Sepsis-3 based surveillance algorithm using EHR data performed well compared with physician medical record review in non-intensive care wards, and exposed variations in hospital-onset sepsis incidence between wards.

    Read more about Validation of automated sepsis surveillance based on the Sepsis-3 clinical criteria against physician record review in a general hospital population
  • Building a De-identification System for Real Swedish Clinical Text Using Pseudonymised Clinical Text

    2019. Hanna Berg, Chomutare Taridzo, Hercules Dalianis. Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019), 118-125

    Conference

    This article presents experiments with pseudonymised Swedish clinical text used as training data to de-identify real clinical text with the future aim to transfer non-sensitive training data to other hospitals. Conditional Random Fields (CFR) and Long Short-Term Memory (LSTM) machine learning algorithms were used to train de-identification models. The two models were trained on pseudonymised data and evaluated on real data. For benchmarking, models were also trained on real data, and evaluated on real data as well as trained on pseudonymised data and evaluated on pseudonymised data. CRF showed better performance for some PHI information like Date Part, First Name and Last Name; consistent with some reports in the literature. In contrast, poor performances on Location and Health Care Unit information were noted, partially due to the constrained vocabulary in the pseudonymised training data. It is concluded that it is possible to train transferable models based on pseudonymised Swedish clinical data, but even small narrative and distributional variation could negatively impact performance.

    Read more about Building a De-identification System for Real Swedish Clinical Text Using Pseudonymised Clinical Text
  • Pseudonymisation of Swedish Electronic Patient Records Using a Rule-based Approach

    2019. Hercules Dalianis. Proceedings of the Workshop on NLP and Pseudonymisation, 16-23

    Conference

    This study describes a rule-based pseudonymisation system for Swedish clinical text and its evaluation. The pseudonymisation system replaces already tagged Protected Health Information (PHI) with realistic surrogates. There are eight types of manually annotated PHIs in the electronic patient records; personal first and last names, phone numbers, locations, dates, ages and healthcare units. Two evaluators, both computer scientists, one junior and one senior, evaluated whether a set of 98 electronic patients records where pseudonymised or not. Only 3.5 percent of the records were correctly judged as pseudonymised and 1.5 percent of the real ones were wrongly judged as pseudo, giving that in average 91 percent of the pseudonymised records were judged as real.

    Read more about Pseudonymisation of Swedish Electronic Patient Records Using a Rule-based Approach
  • Special Issue of Selected Contributions from the Seventh Swedish Language Technology Conference (SLTC 2018)

    2019. .

    Conference

    This Special Issue contains three papers that are extended versions of abstracts presented at the Seventh Swedish Language Technology Conference (SLTC 2018), held at Stockholm University 8–9 November 2018.1 SLTC 2018 received 34 submissions, of which 31 were accepted for presentation. The number of registered participants was 113, including both attendees at SLTC 2018 and two co-located workshops that took place on 7 November. 32 participants were internationally affiliated, of which 14 were from outside the Nordic countries. Overall participation was thus on a par with previous editions of SLTC, but international participation was higher.

    Read more about Special Issue of Selected Contributions from the Seventh Swedish Language Technology Conference (SLTC 2018)
  • Clinical Text Mining: Secondary Use of Electronic Patient Records

    2018. Hercules Dalianis.

    Book

    Patient records are written by the physician during the treatment of the patient for mnemonic reasons and internal use within the clinical unit, but the patient record is also written for legal reasons. Today a very large number of patient records are produced in the healthcare system. The patient records are mostly in electronic form and are written by health personnel. They describe initial symptoms, diagnosis, treatment and outcomes of the treatment, but they may also contain nursing narratives or daily notes. In addition, patient records contain valuable structured information such as laboratory results, blood tests and drugs. These records are seldom reused, most likely because of ignorance, but also due to a lack of tools to process them adequately, and last but not least, there are ethical policies that make the records difficult to use for research and for developing tools for physicians and researchers. There is a plethora of reasons to unlock and reuse the content of electronic patient records, since they contain valuable information about a vast number of patients who have been treated by highly skilled physicians and taken care of by well- trained and experienced nurses. Over time a massive amount of patient record data is accumulated where old knowledge can be confirmed and new knowledge can be obtained. This book was written since there was a lack of a textbook describing the area of clinical text mining. The healthcare domain area is complex and can be difficult to apprehend. There are plenty of specialised disciplines in healthcare. Applying text mining and natural language processing to health records needs special care and understanding of the domain. This book will help the reader to quickly and easily understand the health care domain. Some issues that are treated in this book are: What are the problems in clinical text mining and what are their solutions? Which are the coding and classification systems in the health care domain? What do they actually contain and how are they used? How do physicians reason to make vii viii Preface a diagnosis? What is their typical jargon when writing in the patient record? Does jargon differ between different medical specialities? This book will give the reader the background knowledge on the research front on clinical text mining and health informatics, and specifically in healthcare analytics. It is valuable for a researcher or a student who needs to learn the clinical research area in a fast and efficient way. A book is also a valuable resource for targeting a new natural language in the domain. Each additional language will add a piece to the whole equation. The experiences described in this book originate mainly from research that utilised over two million Swedish hospital records from the Karolinska University Hospital during the years 2007–2014. The general aim was to build basic tools for clinical text mining for Swedish patient records and to address specific issues. These tools were used to automatically: • detect and predict healthcare associated infections; • find adverse (drug) events; and • detect early symptoms of cancer. To accomplish this, the text in the patient records was manually annotated by physicians and then different machine learning tools were trained on these annotated texts to simulate the physicians’ skills, knowledge and intelligence. The book is also based on the extensive source of scientific literature from the large research community in clinical text mining that has been compiled and explained in this book. This book will also describe how to get access to patient records, the ethical problems involved and how to de-identify the patient records automatically before using the records, and finally, methods to build tools that will improve healthcare. The research question of this 10-year research project are many fold, and started with the general research question(s): • Using artificial intelligence to analyse patient records: Is it possible and will it improve healthcare? This actually can be distilled to several research questions of which one is of special interest: • Can one process clinical text written in Swedish with natural language processing tools developed for standard Swedish such as news paper and web texts to extract named entities such as symptoms, diagnosis, drugs and body parts from clinical text? This major issue can then be subdivided into the following questions: • Can one decide the factuality of a diagnosis found in a clinical text? What does Pneumonia? or Angina pectoris cannot be excluded or just No signs of pneumonia? really mean? • Can one determine of the temporal order of clinical events? Have the symptoms occurred a week ago or two years ago? Preface ix • Can new adverse drug effects be found by extracting relations between drug intake and adverse drug effect? • How much clinical text must be annotated manually to obtain correct and useful results? • How can patient privacy be maintained while carrying out research in clinical text mining?

    Read more about Clinical Text Mining
  • Negation detection in Norwegian medical text: Porting a Swedish NegEx to Norwegian. Work in progress

    2018. Andrius Budrionis (et al.).

    Conference

    This paper presents an initial effort in developing a negation detection algorithm for Norwegian clinical text. An evaluated version of NegEx for Swedish was extended to support Norwegian clinical text, by translating the negation triggers and adding more negation rules as well as using a pre-processed Norwegian ICD-10 diagnosis code list to detect symptoms and diagnoses. Due to limited access to the Norwegian clinical text the Norwegian NegEx was tested on Norwegian medical scientific text. NegEx found 70 negated symptoms/diagnoses in the text combined of 170 publications in the medical domain. The results are not completely evaluated due to the lacking gold standard. Some challenging erroneous tokenizations of Norwegian words were found in addition to the need for improved preprocessing and matching techniques for the Norwegian ICD-10 code list. This work pointed out the weaknesses of the current implementation and provided insights for future work.

    Read more about Negation detection in Norwegian medical text
  • Detecting Protected Health Information in Heterogeneous Clinical Notes

    2017. Aron Henriksson, Maria Kvist, Hercules Dalianis. MEDINFO 2017, 393-397

    Conference

    To enable secondary use of healthcare data in a privacy-preserving manner, there is a need for methods capable of automatically identifying protected health information (PHI) in clinical text. To that end, learning predictive models from labeled examples has emerged as a promising alternative to rule-based systems. However, little is known about differences with respect to PHI prevalence in different types of clinical notes and how potential domain differences may affect the performance of predictive models trained on one particular type of note and applied to another. In this study, we analyze the performance of a predictive model trained on an existing PHI corpus of Swedish clinical notes and applied to a variety of clinical notes: written (i) in different clinical specialties, (ii) under different headings, and (iii) by persons in different professions. The results indicate that domain adaption is needed for effective detection of PHI in heterogeneous clinical notes.

    Read more about Detecting Protected Health Information in Heterogeneous Clinical Notes
  • Prevalence Estimation of Protected Health Information in Swedish Clinical Text

    2017. Aron Henriksson, Maria Kvist, Hercules Dalianis. Informatics for Health, 216-220

    Conference

    Obscuring protected health information (PHI) in the clinical text of health records facilitates the secondary use of healthcare data in a privacy-preserving manner. Although automatic de-identification of clinical text using machine learning holds much promise, little is known about the relative prevalence of PHI in different types of clinical text and whether there is a need for domain adaptation when learning predictive models from one particular domain and applying it to another. In this study, we address these questions by training a predictive model and using it to estimate the prevalence of PHI in clinical text written (1) in different clinical specialties, (2) in different types of notes (i.e., under different headings), and (3) by persons in different professional roles. It is demonstrated that the overall PHI density is 1.57%; however, substantial differences exist across domains.

    Read more about Prevalence Estimation of Protected Health Information in Swedish Clinical Text
  • Semi-supervised medical entity recognition: A study on Spanish and Swedish clinical corpora

    2017. Alicia Pérez (et al.). Journal of Biomedical Informatics 71, 16-30

    Article

    Objective: The goal of this study is to investigate entity recognition within Electronic Health Records (EHRs) focusing on Spanish and Swedish. Of particular importance is a robust representation of the entities. In our case, we utilized unsupervised methods to generate such representations. Methods: The significance of this work stands on its experimental layout. The experiments were carried out under the same conditions for both languages. Several classification approaches were explored: maximum probability, CRF, Perceptron and SVM. The classifiers were enhanced by means of ensembles of semantic spaces and ensembles of Brown trees. In order to mitigate sparsity of data, without a significant increase in the dimension of the decision space, we propose the use of clustered approaches of the hierarchical Brown clustering represented by trees and vector quantization for each semantic space. Results: The results showed that the semi-supervised approaches significantly improved standard supervised techniques for both languages. Moreover, clustering the semantic spaces contributed to the quality of the entity recognition while keeping the dimension of the feature-space two orders of magnitude lower than when directly using the semantic spaces. Conclusions: The contributions of this study are: (a) a set of thorough experiments that enable comparisons regarding the influence of different types of features on different classifiers, exploring two languages other than English; and (b) the use of ensembles of clusters of Brown trees and semantic spaces on EHRs to tackle the problem of scarcity of available annotated data.

    Read more about Semi-supervised medical entity recognition
  • Applying deep learning on electronic health records in Swedish to predict healthcare-associated infections

    2016. Olof Jacobson, Hercules Dalianis. Proceedings of the 15th Workshop on Biomedical Natural Language Processing, 191-195

    Conference

    Detecting healthcare-associated infections pose a major challenge in healthcare. Using natural language processing and machine learning applied on electronic patient records is one approach that has been shown to work. However the results indicate that there was room for improvement and therefore we have applied deep learning methods. Specifically we implemented a network of stacked sparse auto encoders and a network of stacked restricted Boltzmann machines. Our best results were obtained using the stacked restricted Boltzmann machines with a precision of 0.79 and a recall of 0.88.

    Read more about Applying deep learning on electronic health records in Swedish to predict healthcare-associated infections
  • Development and Enhancement of a Stemmer for the Greek Language

    2016. Georgios Ntais (et al.). Proceedings of the 20th Pan-Hellenic Conference on Informatics

    Conference

    Although there are three stemmers published for the Greek language, only the one presented in this paper and called Ntais’ stemmer is freely open and available, together with its enhancements and extensions according to Saroukos’ algorithm. The primary algorithm (Ntais’ algorithm) uses only capital letters and works with better performance than other past stemming algorithms for the Greek language, giving 92.1 percent correct results. Further extensions of the proposed stemming system (e.g. from capital to small letters) and more evaluation methods are presented according to a new and improved algorithm, Saroukos’ algorithm. Stemmer performance metrics are further used for evaluating the existing stemming system and algorithm and show how its accuracy and completeness are enhanced. The improvements were possible by providing an alternative implementation in the programming language PHP, which offers more syntactical rules and exceptions. The two versions of the stemming algorithm are tested and compared.

    Read more about Development and Enhancement of a Stemmer for the Greek Language
  • Ensembles of randomized trees using diverse distributed representations of clinical events

    2016. Aron Henriksson (et al.). BMC Medical Informatics and Decision Making 16

    Article

    Background: Learning deep representations of clinical events based on their distributions in electronic health records has been shown to allow for subsequent training of higher-performing predictive models compared to the use of shallow, count-based representations. The predictive performance may be further improved by utilizing multiple representations of the same events, which can be obtained by, for instance, manipulating the representation learning procedure. The question, however, remains how to make best use of a set of diverse representations of clinical events – modeled in an ensemble of semantic spaces – for the purpose of predictive modeling. Methods: Three different ways of exploiting a set of (ten) distributed representations of four types of clinical events – diagnosis codes, drug codes, measurements, and words in clinical notes – are investigated in a series of experiments using ensembles of randomized trees. Here, the semantic space ensembles are obtained by varying the context window size in the representation learning procedure. The proposed method trains a forest wherein each tree is built from a bootstrap replicate of the training set whose entire original feature set is represented in a randomly selected set of semantic spaces – corresponding to the considered data types – of a given context window size. Results: The proposed method significantly outperforms concatenating the multiple representations of the bagged dataset; it also significantly outperforms representing, for each decision tree, only a subset of the features in a randomly selected set of semantic spaces. A follow-up analysis indicates that the proposed method exhibits less diversity while significantly improving average tree performance. It is also shown that the size of the semantic space ensemble has a significant impact on predictive performance and that performance tends to improve as the size increases. Conclusions: The strategy for utilizing a set of diverse distributed representations of clinical events when constructing ensembles of randomized trees has a significant impact on predictive performance. The most successful strategy – significantly outperforming the considered alternatives – involves randomly sampling distributed representations of the clinical events when building each decision tree in the forest.

    Read more about Ensembles of randomized trees using diverse distributed representations of clinical events
  • Minimum data set and standards in the context of nosocomial infection surveillance and antimicrobial stewardship

    2016. Thomas Karopka (et al.).

    Conference

    Antimicrobial resistance (AMR), i.e., the ability of microbes such as bacteria, viruses, fungi and parasites to resist the actions of one or more antimicrobial drugs or agents, is a serious global threat. Bacterial antibiotic resistance poses the largest threat to public health. The prevention of antimicrobial infections and their spread relies heavily on infection control management, and requires urgent, coordinated action by many stakeholders. This is especially true for nosocomial infections, also known as healthcare-associated infections (HAIs), i.e., infections that are acquired in healthcare settings. It is known that continuous, systematic collection, analysis and interpretation of data relevant to nosocomial infections and feedback for the use by doctors and nurses can reduce the frequency of these infections. Data from one hospital are more valid and more effective when they are compared with those from other hospitals. In order to avoid false conclusions, comparisons are only possible when identical methods of data collection with fixed diagnostic definitions are used. The automatic aggregation of standardized data using data from electronic medical records (EMRs), lab data, surveillance data and data on antibiotic use would greatly enhance comparison and computerized decision support systems (CDSSs). Once standardized, data can be aggregated from unit to institutional, regional, national and EU level, analysed and fed back to enhance local decision support on antibiotic use and detection of nosocomial infections.

    Read more about Minimum data set and standards in the context of nosocomial infection surveillance and antimicrobial stewardship
  • PRM92 - Automatic Extraction and Classification of Patients’ Smoking Status from Free Text Using Natural Language Processing

    2016. Andrea Caccamisi (et al.). Value in Health 19 (7)

    Article

    Objectives

    To develop a machine learning algorithm for automatic classification of smoking status (smoker, ex-smoker, non-smoker and unknown status) in EMRs, and validate the predictive accuracy compared to a rule-based method. Smoking is a leading cause of death worldwide and may introduce confounding in research based on real world data (RWD). Information on smoking is often documented in free text fields in Electronic Medical Records (EMRs), but structured RWD on smoking is sparse.

    Methods

    32 predictive models were trained with the Weka machine learning suite, tweaking sentence frequency, classifier type, tokenization and attribute selection using a database of 85,000 classified sentences. The models were evaluated using F-Score and Accuracy based on out-of-sample test data including 8,500 sentences. The error weight matrix was used to select the best model, assigning a weight to each type of misclassification and applying it to the models confusion matrices.

    Results

    The best performing model was based on the Support Vector Machine (SVM) Sequential Minimal Optimization (SMO) classifier using a polynomial kernel with parameter C equal to 6 and a combination of unigrams and bigrams as tokens. Sentence frequency and attributes selection did not improve model performance. SMO achieved 98.25% accuracy and 0.982 F-Score versus 79.32% and 0.756, respectively, for the rule-based model.

    Conclusions

    A model using machine learning algorithms to automatically classify patients smoking status was successfully developed. This algorithm would enable automatic assessment of smoking status directly from EMRs, obviating the need to extract complete case notes and manual classification.

    Read more about PRM92 - Automatic Extraction and Classification of Patients’ Smoking Status from Free Text Using Natural Language Processing
  • Pathology text mining - on Norwegian prostate cancer reports

    2016. Anders Dahl, Atilla Özkan, Hercules Dalianis. 2016 IEEE 32nd International Conference on Data Engineering Workshops (ICDEW), 84-87

    Conference

    Pathology reports are written by pathologists, skilled physicians, that know how to interpret disorders in various tissue samples from the human body. To obtain valuable statistics on outcome of disorders, as for example cancer and effect of treatment, statistics are collected. Therefore, cancer pathology reports interpreted and coded into databases at cancer registries. In Norway is this task carried out by the Cancer Registry of Norway (Kreftregisteret) by 25 different human coders. There is a need to automate this process. The authors of this article received 25 prostate cancer pathology reports written in Norwegian from the Cancer Registry of Norway, each documenting various stages of prostate cancer and the corresponding correct manual coding. A rule-based algorithm was produced that processed the reports in order to prototype automation. The output of the algorithm was compared to the output of the manual coding. The evaluation showed an average F-Score of 0.94 on four of these data points namely Total Malign, Primary Gleason, Secondary Gleason and Total Gleason and a lower result with on average F-score of 0.76 on all ten data points. The results are in line with previous research.

    Read more about Pathology text mining - on Norwegian prostate cancer reports
  • Adverse drug event classification of health records using dictionary-based pre-processing and machine learning

    2015. Stefanie Friedrich, Hercules Dalianis. Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis, 121-130

    Conference

    A method to find adverse drug reactions in electronic health records written in Swedish is presented. A total of 14,751 health records were manually classified into four groups. The records are normalised by pre-processing using both dic- tionaries and manually created word lists. Three different supervised machine learning algorithm were used to find the best results; decision tree, random forest and LibSVM. The best performance on a test dataset was with LibSVM obtaining a pre- cision of 0.69 and a recall of 0.66, and a F-score of 0.67. Our method found 865 of 981 true positives (88.2%) in a 3-class dataset which is an improvement of 49.5% over previous approaches.

    Read more about Adverse drug event classification of health records using dictionary-based pre-processing and machine learning
  • Creating a rule based system for text mining of Norwegian breast cancer pathology reports

    2015. Rebecka Weegar, Hercules Dalianis. Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis, 73-78

    Conference

    National cancer registries collect cancer related information from multiple sources and make it available for research. Part of this information originates from pathology reports, and in this pre-study the possibility of a system for automatic extraction of information from Norwegian pathology reports is investigated. A set of 40 pathology reports describing breast cancer tissue samples has been used to develop a rule based system for information extraction. To validate the performance of this system its output has been compared to the data produced by experts doing manual encoding of the same pathology reports. On average, a precision of 80%, a recall of 98% and an F-score of 86% has been achieved, showing that such a system is indeed feasible.

    Read more about Creating a rule based system for text mining of Norwegian breast cancer pathology reports
  • Exploration of known and unknown early symptoms of cervical cancer and development of a symptom spectrum: Outline of a data and text mining based approach

    2015. Claudia Ehrentraut, Karin Sundström, Hercules Dalianis. Proceedings of the CAiSE-2015 Industry Track, 34-44

    Conference

    This position paper lays up the structure of some experiments to detect early symptoms of cervical cancer. We are using a large corpora of electronic patient records texts in Swedish from Karolinska University Hosptital from the years 2009-2010, where we extracted in total 1,660 patients with the diagnosis code C53. We used a Named Entity Recogniser called Clinical Entity Finder to detect the diagnosis and symptoms expressed in these clinical texts containing in total 2,988,118 words. We found 28,218 symptoms and diagnoses on these 1,660 patients. We present some initial findings, and discuss them and propose a set of experiments to find possible early symptoms or at least a spectrum or finger prints for early symptoms of cervical cancer.

    Read more about Exploration of known and unknown early symptoms of cervical cancer and development of a symptom spectrum
  • Finding Cervical Cancer Symptoms in Swedish Clinical Text using a Machine Learning Approach and NegEx

    2015. Rebecka Weegar (et al.). AMIA Annual Symposium Proceedings, 1296-1305

    Conference

    Detection of early symptoms in cervical cancer is crucial for early treatment and survival. To find symptoms of cervical cancer in clinical text, Named Entity Recognition is needed. In this paper the Clinical Entity Finder, a machine-learning tool trained on annotated clinical text from a Swedish internal medicine emergency unit, is evaluated on cervical cancer records. The Clinical Entity Finder identifies entities of the types body part, finding and disorder and is extended with negation detection using the rule-based tool NegEx, to distinguish between negated and non-negated entities. To measure the performance of the tools on this new domain, two physicians annotated a set of clinical notes from the health records of cervical cancer patients. The inter-annotator agreement for finding, disorder and body part obtained an average F-score of 0.677 and the Clinical Entity Finder extended with NegEx had an average F-score of 0.667.

    Read more about Finding Cervical Cancer Symptoms in Swedish Clinical Text using a Machine Learning Approach and NegEx
  • HEALTH BANK - A Workbench for Data Science Applications in Healthcare

    2015. Hercules Dalianis (et al.). Industry Track Workshop, 1-18

    Conference

    The enormous amounts of data that are generated in the healthcare process and stored in electronic health record (EHR) systems are an underutilized resource that, with the use of data science applica- tions, can be exploited to improve healthcare. To foster the development and use of data science applications in healthcare, there is a fundamen- tal need for access to EHR data, which is typically not readily available to researchers and developers. A relatively rare exception is the large EHR database, the Stockholm EPR Corpus, comprising data from more than two million patients, that has been been made available to a lim- ited group of researchers at Stockholm University. Here, we describe a number of data science applications that have been developed using this database, demonstrating the potential reuse of EHR data to support healthcare and public health activities, as well as facilitate medical re- search. However, in order to realize the full potential of this resource, it needs to be made available to a larger community of researchers, as well as to industry actors. To that end, we envision the provision of an in- frastructure around this database called HEALTH BANK – the Swedish Health Record Research Bank. It will function both as a workbench for the development of data science applications and as a data explo- ration tool, allowing epidemiologists, pharmacologists and other medical researchers to generate and evaluate hypotheses. Aggregated data will be fed into a pipeline for open e-access, while non-aggregated data will be provided to researchers within an ethical permission framework. We believe that HEALTH BANK has the potential to promote a growing industry around the development of data science applications that will ultimately increase the efficiency and effectiveness of healthcare.

    Read more about HEALTH BANK - A Workbench for Data Science Applications in Healthcare
  • Identifying adverse drug event information in clinical notes with distributional semantic representations of context

    2015. Aron Henriksson (et al.). Journal of Biomedical Informatics 57, 333-349

    Article

    For the purpose of post-marketing drug safety surveillance, which has traditionally relied on the volun- tary reporting of individual cases of adverse drug events (ADEs), other sources of information are now being explored, including electronic health records (EHRs), which give us access to enormous amounts of longitudinal observations of the treatment of patients and their drug use. Adverse drug events, which can be encoded in EHRs with certain diagnosis codes, are, however, heavily underreported. It is therefore important to develop capabilities to process, by means of computational methods, the more unstructured EHR data in the form of clinical notes, where clinicians may describe and reason around suspected ADEs. In this study, we report on the creation of an annotated corpus of Swedish health records for the purpose of learning to identify information pertaining to ADEs present in clinical notes. To this end, three key tasks are tackled: recognizing relevant named entities (disorders, symptoms, drugs), labeling attributes of the recognized entities (negation, speculation, temporality), and relationships between them (indication, adverse drug event). For each of the three tasks, leveraging models of distributional semantics – i.e., unsupervised methods that exploit co-occurrence information to model, typically in vector space, the meaning of words – and, in particular, combinations of such models, is shown to improve the predictive performance. The ability to make use of such unsupervised methods is critical when faced with large amounts of sparse and high-dimensional data, especially in domains where annotated resources are scarce.

    Read more about Identifying adverse drug event information in clinical notes with distributional semantic representations of context
  • Louhi 2014: Special issue on health text mining and information analysis: introduction

    2015. Sumithra Velupillai (et al.). BMC Medical Informatics and Decision Making 2 (SI), 1-3

    Article
    Read more about Louhi 2014
  • Modeling Electronic Health Records in Ensembles of Semantic Spaces for Adverse Drug Event Detection

    2015. Aron Henriksson (et al.). 2015 IEEE International Conference on Bioinformatics and Biomedicine, 343-350

    Conference

    Electronic health records (EHRs) are emerging as a potentially valuable source for pharmacovigilance; however, adverse drug events (ADEs), which can be encoded in EHRs by a set of diagnosis codes, are heavily underreported. Alerting systems, able to detect potential ADEs on the basis of patient- specific EHR data, would help to mitigate this problem. To that end, the use of machine learning has proven to be both efficient and effective; however, challenges remain in representing the heterogeneous EHR data, which moreover tends to be high- dimensional and exceedingly sparse, in a manner conducive to learning high-performing predictive models. Prior work has shown that distributional semantics – that is, natural language processing methods that, traditionally, model the meaning of words in semantic (vector) space on the basis of co-occurrence information – can be exploited to create effective representations of sequential EHR data, not only free-text in clinical notes but also various clinical events such as diagnoses, drugs and measurements. When modeling data in semantic space, an im- portant design decision concerns the size of the context window around an object of interest, which governs the scope of co- occurrence information that is taken into account and affects the composition of the resulting semantic space. Here, we report on experiments conducted on 27 clinical datasets, demonstrating that performance can be significantly improved by modeling EHR data in ensembles of semantic spaces, consisting of multiple semantic spaces built with different context window sizes. A follow-up investigation is conducted to study the impact on predictive performance as increasingly more semantic spaces are included in the ensemble, demonstrating that accuracy tends to improve with the number of semantic spaces, albeit not monotonically so. Finally, a number of different strategies for combining the semantic spaces are explored, demonstrating the advantage of early (feature) fusion over late (classifier) fusion. Ensembles of semantic spaces allow multiple views of (sparse) data to be captured (densely) and thereby enable improved performance to be obtained on the task of detecting ADEs in EHRs.

    Read more about Modeling Electronic Health Records in Ensembles of Semantic Spaces for Adverse Drug Event Detection
  • Modeling Heterogeneous Clinical Sequence Data in Semantic Space for Adverse Drug Event Detection

    2015. Aron Henriksson (et al.). 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

    Conference

    The enormous amounts of data that are continuously recorded in electronic health record systems offer ample opportunities for data science applications to improve healthcare. There are, however, challenges involved in using such data for machine learning, such as high dimensionality and sparsity, as well as an inherent heterogeneity that does not allow the distinct types of clinical data to be treated in an identical manner. On the other hand, there are also similarities across data types that may be exploited, e.g., the possibility of representing some of them as sequences. Here, we apply the notions underlying distributional semantics, i.e., methods that model the meaning of words in semantic (vector) space on the basis of co-occurrence information, to four distinct types of clinical data: free-text notes, on the one hand, and clinical events, in the form of diagnosis codes, drug codes and measurements, on the other hand. Each semantic space contains continuous vector representations for every unique word and event, which can then be used to create representations of, e.g., care episodes that, in turn, can be exploited by the learning algorithm. This approach does not only reduce sparsity, but also takes into account, and explicitly models, similarities between various items, and it does so in an entirely data-driven fashion. Here, we report on a series of experiments using the random forest learning algorithm that demonstrate the effectiveness, in terms of accuracy and area under ROC curve, of the proposed representation form over the commonly used bag-of-items counterpart. The experiments are conducted on 27 real datasets that each involves the (binary) classification task of detecting a particular adverse drug event. It is also shown that combining structured and unstructured data leads to significant improvements over using only one of them.

    Read more about Modeling Heterogeneous Clinical Sequence Data in Semantic Space for Adverse Drug Event Detection
  • Recent Advances in Clinical Natural Language Processing in Support of Semantic Analysis

    2015. Sumithra Velupillai (et al.). IMIA Yearbook of Medical Informatics 10 (1), 183-193

    Article

    Objectives

    We present a review of recent advances in clinical Natural Language Processing (NLP), with a focus on semantic analysis and key subtasks that support such analysis.

    Methods

    We conducted a literature review of clinical NLP research from 2008 to 2014, emphasizing recent publications (2012-2014), based on PubMed and ACL proceedings as well as relevant referenced publications from the included papers.

    Results

    Significant articles published within this time-span were included and are discussed from the perspective of semantic analysis. Three key clinical NLP subtasks that enable such analysis were identified: 1) developing more efficient methods for corpus creation (annotation and de-identification), 2) generating building blocks for extracting meaning (morphological, syntactic, and semantic subtasks), and 3) leveraging NLP for clinical utility (NLP applications and infrastructure for clinical use cases). Finally, we provide a reflection upon most recent developments and potential areas of future NLP development and applications.

    Conclusions

    There has been an increase of advances within key NLP subtasks that support semantic analysis. Performance of NLP semantic analysis is, in many cases, close to that of agreement between humans. The creation and release of corpora annotated with complex semantic information models has greatly supported the development of new tools and approaches. Research on non-English languages is continuously growing. NLP methods have sometimes been successfully employed in real-world clinical tasks. However, there is still a gap between the development of advanced resources and their utilization in clinical settings. A plethora of new clinical use cases are emerging due to established health care initiatives and additional patient-generated sources through the extensive use of social media and other devices.

    Read more about Recent Advances in Clinical Natural Language Processing in Support of Semantic Analysis
  • Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study

    2014. Maria Skeppstedt (et al.). Journal of Biomedical Informatics 49, 148-158

    Article

    Automatic recognition of clinical entities in the narrative text of health records is useful for constructing applications for documentation of patient care, as well as for secondary usage in the form of medical knowledge extraction. There are a number of named entity recognition studies on English clinical text, but less work has been carried out on clinical text in other languages. This study was performed on Swedish health records, and focused on four entities that are highly relevant for constructing a patient overview and for medical hypothesis generation, namely the entities: Disorder, Finding, Pharmaceutical Drug and Body Structure. The study had two aims: to explore how well named entity recognition methods previously applied to English clinical text perform on similar texts written in Swedish; and to evaluate whether it is meaningful to divide the more general category Medical Problem, which has been used in a number of previous studies, into the two more granular entities, Disorder and Finding. Clinical notes from a Swedish internal medicine emergency unit were annotated for the four selected entity categories, and the inter-annotator agreement between two pairs of annotators was measured, resulting in an average F-score of 0.79 for Disorder, 0.66 for Finding, 0.90 for Pharmaceutical Drug and 0.80 for Body Structure. A subset of the developed corpus was thereafter used for finding suitable features for training a conditional random fields model. Finally, a new model was trained on this subset, using the best features and settings, and its ability to generalise to held-out data was evaluated. This final model obtained an F-score of 0.81 for Disorder, 0.69 for Finding, 0.88 for Pharmaceutical Drug, 0.85 for Body Structure and 0.78 for the combined category Disorder + Finding. The obtained results, which are in line with or slightly lower than those for similar studies on English clinical text, many of them conducted using a larger training data set, show that the approaches used for English are also suitable for Swedish clinical text. However, a small proportion of the errors made by the model are less likely to occur in English text, showing that results might be improved by further tailoring the system to clinical Swedish. The entity recognition results for the individual entities Disorder and Finding show that it is meaningful to separate the general category Medical Problem into these two more granular entity types, e.g. for knowledge mining of co-morbidity relations and disorder-finding relations.

    Read more about Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text
  • Clinical Text Retrieval - An Overview of Basic Building Blocks and Applications

    2014. Hercules Dalianis. Professional search in the modern world, 147-165

    Chapter

    This article describes information retrieval, natural language processing and text mining of electronic patient record text, also called clinical text. Clinical text is written by physicians and nurses to document the health care process of the patient. First we describe some characteristics of clinical text, followed by the automatic preprocessing of the text that is necessary for making it usable for some applications. We also describe some applications for clinicians including spelling and grammar checking, ICD-10 diagnosis code assignment, as well as other applications for hospital management such as ICD-10 diagnosis code validation and detection of adverse events such as hospital acquired infections. Part of the preprocessing makes the clinical text useful for faceted search, although clinical text already has some keys for performing faceted search such as gender, age, ICD-10 diagnosis codes, ATC drug codes, etc. Preprocessing makes use of ICD-10 codes and the SNOMED-CT textual descriptions. ICD-10 codes and SNOMED-CT are available in several languages and can be considered the modern Greek or Latin of medical language. The basic research presented here has its roots in the challenges described by the health care sector. These challenges have been partially solved in academia, and we believe the solutions will be adapted to the health care sector in real world applications.

    Read more about Clinical Text Retrieval - An Overview of Basic Building Blocks and Applications
  • Cue-based assertion classification for Swedish clinical text-Developing a lexicon for pyConTextSwe

    2014. Sumithra Velupillai (et al.). Artificial Intelligence in Medicine 61 (3), 137-144

    Article

    Objective: The ability of a cue-based system to accurately assert whether a disorder is affirmed, negated, or uncertain is dependent, in part, on its cue lexicon. In this paper, we continue our study of porting an assertion system (pyConTextNLP) from English to Swedish (pyConTextSwe) by creating an optimized assertion lexicon for clinical Swedish. Methods and material: We integrated cues from four external lexicons, along with generated inflections and combinations. We used subsets of a clinical corpus in Swedish. We applied four assertion classes (definite existence, probable existence, probable negated existence and definite negated existence) and two binary classes (existence yes/no and uncertainty yes/no) to pyConTextSwe. We compared pyConTextSwe's performance with and without the added cues on a development set, and improved the lexicon further after an error analysis. On a separate evaluation set, we calculated the system's final performance. Results: Following integration steps, we added 454 cues to pyConTextSwe. The optimized lexicon developed after an error analysis resulted in statistically significant improvements on the development set (83%F-score, overall). The system's final F-scores on an evaluation set were 81% (overall). For the individual assertion classes, F-score results were 88% (definite existence), 81% (probable existence), 55% (probable negated existence), and 63% (definite negated existence). For the binary classifications existence yes/no and uncertainty yes/no, final system performance was 97%/87% and 78%/86% F-score, respectively. Conclusions: We have successfully ported pyConTextNLP to Swedish (pyConTextSwe). We have created an extensive and useful assertion lexicon for Swedish clinical text, which could form a valuable resource for similar studies, and which is publicly available.

    Read more about Cue-based assertion classification for Swedish clinical text-Developing a lexicon for pyConTextSwe
  • Detecting Healthcare-Associated Infections in Electronic Health Records: Evaluation of Machine Learning and Preprocessing Techniques

    2014. Claudia Ehrentraut (et al.). Proceedings of the 6th International Symposium on Semantic Mining in Biomedicine (SMBM 2014), 3-10

    Conference

    Healthcare-associated infections (HAI) are in- fections that patients acquire in the course of medical treatment. Being a severe pub- lic health problem, detecting and monitoring HAI in healthcare documentation is an impor- tant topic to address. Research on automated systems has increased over the past years, but performance is yet to be enhanced. The dataset in this study consists of 214 records obtained from a Point-Prevalence Survey. The records are manually classified into HAI and NoHAI records. Nine different preprocess- ing steps are carried out on the data. Two learning algorithms, Random Forest (RF) and Support Vector Machines (SVM), are applied to the data. The aim is to determine which of the two algorithms is more applicable to the task and if preprocessing methods will affect the performance. RF obtains the best performance results, yielding an F1 -score of 85% and AUC of 0.85 when lemmatisation is used as a preprocessing technique. Irrespec- tive of which preprocessing method is used, RF yields higher recall values than SVM, with a statistically significant difference for all but one preprocessing method. Regarding each classifier separately, the choice of preprocess- ing method led to no statistically significant improvement in performance results.

    Read more about Detecting Healthcare-Associated Infections in Electronic Health Records
  • Detection of Spelling Errors in Swedish Clinical Text

    2014. Uddin Nizamuddin, Hercules Dalianis. NorWES T2014

    Conference

    Spelling errors are common in clinical text because such text is written under pressure and lack of time. It is mostly used for internal communication. To improve text mining and other type of text processing tools, spelling error detection and correction is needed. In this paper we will count spelling errors in Swedish clinical text. The developed algorithm uses word lists for detection such as a Swedish general dictionary, a medical dictionary and a list of abbreviations. The final algorithm has been tested on a Swedish clinical corpus, we obtained 12 per cent spelling errors. After error analysis of the result, it was concluded that many errors were detected by the algorithm due to inadequate word list and faulty preprocessing such as lemmatization and compound splitting. By manually removing these correct words from the list, total spelling errors were decreased to 7.6 per cent.

    Read more about Detection of Spelling Errors in Swedish Clinical Text
  • Didactic Panel: clinical Natural Language Processing in Languages Other Than English

    2014. Hercules Dalianis (et al.). AMIA Annual Symposium 2014, S 84

    Conference

    Natural Language Processing (NLP) of clinical free-text has received a lot of attention from the scientific community. Clinical documents are routinely created across health care providing institutions and are generally written in the official language(s) of the country these institutions are located in. As a result, free-text clinical information is written in a large variety of languages. While most of the efforts for clinical NLP have focused on English, there is a strong need to extend this work to other languages, for instance in order to gain medical information about patient cohorts in geographical areas where English is not an official language. Furthermore, adapting current NLP methods developed for English to other languages may provide useful insight on the generalizability of algorithms and lead to increased robustness. This panel aims to provide an overview of clinical NLP for languages other than English, as for example French, Swedish and Bulgarian and discuss future methodological advances of clinical NLP in a context that encompasses English as well as other languages.

    Read more about Didactic Panel
  • Generating Features for Named Entity Recognition by Learning Prototypes in Semantic Space: The Case of De-Identifying Health Records

    2014. Aron Henriksson, Hercules Dalianis, Stewart Kowalski. 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 450-457

    Conference

    Creating sufficiently large annotated resources for supervised machine learning, and doing so for every problem and every domain, is prohibitively expensive. Techniques that leverage large amounts of unlabeled data, which are often readily available, may decrease the amount of data that needs to be annotated to obtain a certain level of performance, as well as improve performance when large annotated resources are indeed available. Here, the development of one such method is presented, where semantic features are generated by exploiting the available annotations to learn prototypical (vector) representations of each named entity class in semantic space, constructed by employing a model of distributional semantics (random indexing) over a large, unannotated, in-domain corpus. Binary features that describe whether a given word belongs to a specific named entity class are provided to the learning algorithm; the feature values are determined by calculating the (cosine) distance in semantic space to each of the learned prototype vectors and ascertaining whether they are below or above a given threshold, set to optimize Fβ-score. The proposed method is evaluated empirically in a series of experiments, where the case is health-record deidentification, a task that involves identifying protected health information (PHI) in text. It is shown that a conditional random fields model with access to the generated semantic features, in addition to a set of orthographic and syntactic features, significantly outperforms, in terms of F1-score, a baseline model without access to the semantic features. Moreover, the quality of the features is further improved by employing a number of slightly different models of distributional semantics in an ensemble. Finally, the way in which the features are generated allows one to optimize them for various Fβ -scores, giving some degree of control to trade off precision and recall. Methods that are able to improve performance on named entity recognition tasks by exploiting large amounts of unlabeled data may substantially reduce costs involved in creating annotated resources for every domain and every problem.

    Read more about Generating Features for Named Entity Recognition by Learning Prototypes in Semantic Space
  • Text Analysis to support structuring and modelling a public policy problem: Outline of an algorithm to extract inferences from textual data

    2014. Claudia Ehrentraut, Osama Ibrahim, Hercules Dalianis. DSV writers hut 2014

    Conference

    Policy making situations are real-world problems that exhibit complexity in that they are composed of many interrelated problems and issues. To be effective, policies must holistically address the complexity of the situation rather than propose solutions to single problems. Formulating and understanding the situation and its complex dynamics, therefore, is a key to finding holistic solutions. Analysis of text based information on the policy problem, using Natural Language Processing (NLP) and Text analysis techniques, can support modelling of public policy problem situations in a more objective way based on domain experts’ knowledge and scientific evidence. The objective behind this study is to support modelling of public policy problem situations, using text analysis of verbal descriptions of the problem. We propose a formal methodology for analysis of qualitative data from multiple information sources on a policy problem to construct a causal diagram of the problem. The analysis process aims at identifying key variables, linking them by cause-effect relationships and mapping that structure into a graphical representation that is adequate for designing action alternatives, i.e., policy options. This study describes the outline of an algorithm used to automate the initial step of a larger methodological approach, which is so far done manually. In this initial step, inferences about key variables and their interrelationships are extracted from textual data to support a better problem structuring. A small prototype for this step is also presented.

    Read more about Text Analysis to support structuring and modelling a public policy problem
  • Automatic clinical text de-identification: is it worth it, and could it work for me?

    2013. Stéphane M. Meystre (et al.). Proceedings of the 14th World Congress on Medical and Health Informatics, 1242-1242

    Conference

    The increased use and adoption of Electronic Health Records, and parallel growth in patient data available for secondary use by clinicians, researchers, and operational purposes, all cause patient confidentiality protection to become an increasingly more important requirement and expectation. The laws protecting patient confidentiality typically require the informed consent of the patient to use data for research purposes, a requirement that can be waived if the data are de-identified. Several methods to automatically remove identifying information from clinical text have been tested experimen- tally over the last 10 years, guided by the HIPAA “Safe Harbor” methodology. This panel will focus on the issues related with the automatic de-identification of clinical text. It will include an overview of the domain, a demonstration of good examples of such applications in English and in Swedish with their main authors sharing development and adaptation experiences, and a discussion of the HIPAA “Safe Harbor” de-identification quality and the risk for re-identification of de-identified data. The difficulties and issues related to this task will be debated, as well as the main methods used and the performance and adaptability of these methods.

    Read more about Automatic clinical text de-identification
  • Negation Scope Delimitation in Clinical Text Using Three Approaches: NegEx, PyConTextNLP and SynNeg

    2013. Hideyuki Tanushi (et al.). Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), 387-474

    Conference

    Negation detection is a key component in clinical information extraction systems, as health record text contains reasonings in which the physician excludes different diagnoses by negating them. Many systems for negation detection rely on negation cues (e.g. not), but only few studies have investigated if the syntactic structure of the sentences can be used for determining the scope of these cues. We have in this paper compared three different systems for negation detection in Swedish clinical text (NegEx, PyConTextNLP and SynNeg), which have different approaches for determining the scope of negation cues. NegEx uses the distance between the cue and the disease, PyConTextNLP relies on a list of conjunctions limiting the scope of a cue, and in SynNeg the boundaries of the sentence units, provided by a syntactic parser, limit the scope of the cues. The three systems produced similar results, detecting negation with an F-score of around 80%, but using a parser had advantages when handling longer, complex sentences or short sentences with contradictory statements.

    Read more about Negation Scope Delimitation in Clinical Text Using Three Approaches
  • Porting a Rule-based Assertion Classifier for Clinical Text from English to Swedish

    2013. Sumithra Velupillai (et al.). Proceedings of the 4th International Louhi Workshop on Health Document Text Mining and Information Analysis - Louhi 2013

    Conference

    An existing rule-based assertion classier is ported from En- glish to Swedish: pyConTextSwe. Evaluation on Swedish clinical texts shows that the English lexical resources are useful, but that there are assertion cues not obtainable in existing resources. Iterative error cor- rection of cue lexicons improves results for the ported classier. Overall nal results are 82% F-score on a development set and 74% on a test set.

    Read more about Porting a Rule-based Assertion Classifier for Clinical Text from English to Swedish
  • Using text prediction for facilitating input and improving readability of clinical text

    2013. Magnus Ahltorp (et al.). MedInfo 2013, 1149-1149

    Conference

    Text prediction has the potential for facilitating and speeding up the documentation work within health care, making it possible for health personnel to allocate less time to documentation and more time to patient care. It also offers a way to produce clinical text with fewer misspellings and abbreviations, increasing readability. We have explored how text prediction can be used for input of clinical text, and how the specific challenges of text prediction in this domain can be addressed. A text prediction prototype was constructed using data from a medical journal and from medical terminologies. This prototype achieved keystroke savings of 26% when evaluated on texts mimicking authentic clinical text. The results are encouraging, indicating that there are feasible methods for text prediction in the clinical domain.

    Read more about Using text prediction for facilitating input and improving readability of clinical text
  • De-identifying health records by means of active learning

    2012. Henrik Boström, Hercules Dalianis.  

    Conference

    An experiment on classifying words in Swedish health records as belonging to one of eight protected health information (PHI) classes, or to the non-PHI class, by means of active learning has been conducted, in which three selection strategies were evaluated in conjunction with random forests; the commonly employed approach of choosing the most uncertain examples, choosing randomly, and choosing the most certain examples. Surprisingly, random selection outperformed choosing the most uncertain examples with respect to ten considered performance metrics. Moreover, choosing the most certain examples outperformed random selection with respect to nine out of ten metrics.

    Read more about De-identifying health records by means of active learning
  • Detection of Hospital Acquired Infections in sparse and noisy Swedish patient records: A machine learning approach using Naïve Bayes, Support Vector Machines and C4.5

    2012. Claudia Ehrentraut (et al.). Proceedings of the Sixth Workshop on Analytics for Noisy Unstructured Text Data

    Conference

    Hospital Acquired Infections (HAI) pose a significant risk on patients’ health while their surveillance is an additional work load for hospital medical staff and hospital management. Our overall aim is to build a system which reliably retrieves all patient records which potentially include HAI, to reduce the burden of manually checking patient records by the hospital staff. In other words, we emphasize recall when detecting HAI (aiming at 100%) with the highest precision possible. The present study is of experimental nature, focusing on the application of Naïve Bayes (NB), Support Vector Machines (SVM) and a C4.5 Decision Tree to the problem and the evaluation of the efficiency of this approach. The three classifiers showed an overall similar performance. SVM yielded the best recall value, 89.8%, for records that contain HAI. We present a machine learning approach as an alternative to rule-based systems which are more common in this task. The classifiers were applied on a small and noisy dataset, generating results which pinpoint the potentials of using learning algorithms for detecting HAI. Further research will have to focus on optimizing the performance of the classifiers and to test them on larger datasets.

    Read more about Detection of Hospital Acquired Infections in sparse and noisy Swedish patient records
  • Entity Recognition of Pharmaceutical Drugs in Swedish Clinical Text

    2012. Sidrat ul Muntaha (et al.). Proceedings of the Conference, 77-78

    Conference

    An entity recognition system for expressions of pharmaceutical drugs, based on vocabulary lists from FASS, the Medical Subject Headings and SNOMED~CT, achieved a precision of 94\% and a recall of 74\% when evaluated on assessment texts from Swedish emergency unit health records.

    Read more about Entity Recognition of Pharmaceutical Drugs in Swedish Clinical Text
  • Exploration of Adverse Drug Reactions in Semantic Vector Space Models of Clinical Text

    2012. Aron Henriksson (et al.).  

    Conference

    A novel method for identifying potential side-effects to medications through large-scale analysis of clinical data is here introduced and evaluated. By calculating distributional similarities for medication-symptom pairs based on co-occurrence information in a large clinical corpus, many known adverse drug reactions are successfully identified. These preliminary results suggest that semantic vector space models of clinical text could also be used to generate hypotheses about potentially unknown adverse drug reactions. In the best model, 50% of the terms in a list of twenty are considered to be conceivable side-effects. Among the medication-symptom pairs, however, diagnostic indications and terms related to the medication in other ways also appear. These relations need to be distinguished in a more refined method for detecting adverse drug reactions.

    Read more about Exploration of Adverse Drug Reactions in Semantic Vector Space Models of Clinical Text
  • Natural Language Generation from SNOMED Specifications

    2012. Mattias Kanhov, Xuefeng Feng, Hercules Dalianis. CLEFeHealth2012

    Conference

    SNOMED (Systematized Nomenclature of Medicine) is a compre- hensive clinical terminology that contains almost 400,000 concepts, since SNOMED is a formal language; it is hard to understand for users who are not acquainted with the formal specifications. Natural language generation (NLG) is a technique utilizing computers to create natural language descriptions from formal languages. In order to generate descriptions of SNOMED concepts, two NLG tools were implemented for the English and Swedish version of SNOMED respectively. The one for English used a natural language generator called ASTROGEN to produce description texts. This tool also applied several aggregation rules to make the texts shorter and easier to understand. The other tool used C#.Net as the programming language and applied a template-base generation technique to create concepts explanation in Swedish. As a base line same SNOMED concepts were presented in a tree structure browser. To evaluate the English NLG system, 19 SNOMED concepts were randomly chosen for the generation of text. Ten volunteers participated in this evaluation. Five of them estimated the accuracy of the texts and others assessed the fluency aspect. The sample texts got a mean score 4.37 for accuracy and 4.47 for fluen- cy (max 5 score). To evaluate the Swedish NLG system, five concepts were randomly chosen for the generation of texts. In parallel two physicians with knowledge in SNOMED created manually natural language descriptions of the same concepts. Both manual and system generated natural language descriptions were evaluat- ed and compared by in total four physicians. All respondents scored the manual natural language descriptions the highest in average 83 of 100 scores while the system generated natural language texts obtained around 68 of 100 scores. All three respondents unanimously except one respondent (scoring 7 of 10) pre- ferred the system-generated text. This paper presents a possible way using Natural Language Generation to explain the meaning of SNOMED concepts for people who are not familiar with SNOMED formal language. The evaluation results indicate that the NLG techniques can be used to implement this task.

    Read more about Natural Language Generation from SNOMED Specifications
  • Pseudonymisation of Personal Names and other PHIs in an Annotated Clinical Swedish Corpus

    2012. Alyaa Alfalahi, Sara Brissman, Hercules Dalianis. LREC 2012, Eighth International Conference on Language Resources and Evaluation

    Conference

    Today a large number of patient records are produced and these records contain valuable information, often in free text, about the medical treatment of patients. Since these records contain information that can reveal the identity of patients, known as protected health information (PHI), the records cannot easily be made available for the research community. In this research we have used a PHI annotated clinical corpora, written in Swedish, that we have pseudonymised. Pseudonymisation means to replace the sensitive information with fictive information for example real personal names are replaced with fictive personal names based on the gender of the real names and family relations. We have evaluated our results and our five respondents of who three were clinicians found that the clinical text looks real and is readable. We have also added pseudonymisation for telephone numbers, locations, health care units, dates and ages. In this paper we also present the entire de-identification and pseudonymisation process of a sample clinical text.

    Read more about Pseudonymisation of Personal Names and other PHIs in an Annotated Clinical Swedish Corpus
  • Releasing a Swedish Clinical Corpus after Removing all Words – De-identification Experiments with Conditional Random Fields and Random Forests

    2012. Hercules Dalianis, Henrik Boström. Proceedings of the Third Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM 2012), 45-48

    Conference

    Patient records contain valuable information in the form of both structured data and free text; however this information is sensitive since it can reveal the identity of patients. In order to allow new methods and techniques to be developed and evaluated on real world clinical data without revealing such sensitive information, researchers could be given access to de-identified records without protected health information (PHI), such as names, telephone numbers, and so on. One approach to minimizing the risk of revealing PHI when releasing text corpora from such records is to include only features of the words instead of the words themselves. Such features may include parts of speech, word length, and so on from which the sensitive information cannot be derived. In order to investigate what performance losses can be expected when replacing specific words with features, an experiment with two state-of-the-art machine learning methods, conditional random fields and random forests, is presented, comparing their ability to support de-identification, using the Stockholm EPR PHI corpus as a benchmark test. The results indicate severe performance losses when the actual words are removed, leading to the conclusion that the chosen features are not sufficient for the suggested approach to be viable.

    Read more about Releasing a Swedish Clinical Corpus after Removing all Words – De-identification Experiments with Conditional Random Fields and Random Forests
  • Rule-based Entity Recognition and Coverage of SNOMED CT in Swedish Clinical Text

    2012. Maria Skeppstedt, Maria Kvist, Hercules Dalianis. LREC 2012 8th ELRA Conference on Language Resources and Evaluation, 1250-1257

    Conference

    Named entity recognition of the clinical entities disorders, findings and body structures is needed for information extraction from unstructured text in health records. Clinical notes from a Swedish emergency unit were annotated and used for evaluating a rule- and terminology-based entity recognition system. This system used different preprocessing techniques for matching terms to SNOMED CT, and, one by one, four other terminologies were added. For the class body structure, the results improved with preprocessing, whereas only small improvements were shown for the classes disorder and finding. The best average results were achieved when all terminologies were used together. The entity body structure was recognised with a precision of 0.74 and a recall of 0.80, whereas lower results were achieved for disorder (precision: 0.75, recall: 0.55) and for finding (precision: 0.57, recall: 0.30). The proportion of entities containing abbreviations were higher for false negatives than for correctly recognised entities, and no entities containing more than two tokens were recognised by the system. Low recall for disorders and findings shows both that additional methods are needed for entity recognition and that there are many expressions in clinical text that are not included in SNOMED CT.

    Read more about Rule-based Entity Recognition and Coverage of SNOMED CT in Swedish Clinical Text
  • Stockholm EPR Corpus: A Clinical Database Used to Improve Health Care

    2012. Hercules Dalianis (et al.). Proceedings of SLCT 2012, 17-18

    Conference

    The care of patients is well documented in health records. Despite being a valuable source of information that could be mined by computers and used to improve health care, health records are not readily available for research. Moreover, the narrative parts of the records are noisy and need to be interpreted by domain experts. In this abstract we describe our experiences of gaining access to a database of electronic health records for research. We also highlight some important issues in this domain and describe a number of possible applications, including comorbidity networks, detection of hospital-acquired infections and adverse drug reactions, as well as diagnosis coding support.

    Read more about Stockholm EPR Corpus
  • User Centered Development of Automatic E-mail Answering for the Public Sector

    2012. Teresa Cerratto-Pargman (et al.). Human-Computer Interaction, Tourism and Cultural Heritage, 154-156

    Conference

    In Sweden, the use of e-mail by the public sector has become a key communication service between citizens and governmental authorities. Although the integration of e-mail in the public sector has certainly brought citizens and handling officers closer, it has also introduced a particular vision on governmental authorities such as for instance the idea that public service and information should be available to citizens any time, anywhere. Such a belief among citizens puts certainly high demands on the quality and efficiency of the e-service governmental authorities are capable to provide. In fact, the growing number of citizens’ electronic requests must be accurately answered in a limited time. In the research project IMAIL (Intelligent e-mail answering service for eGovernment) [1], we have focused on the work carried out at the Swedish Social Insurance Agency (SSIA) that exemplifies a governmental authority dealing with 500,000 emails per year on top of face-to face meetings, phone calls and chat communication. With the objective of creating an e-mail client capable to ease and ensure the quality of SSIAs’ handling officers public service, we have developed a prototype that: (1) automatically answer a large part of simple questions in the incoming e-mail flow, (2) improve the quality of the semi- automatic answers (i.e. answer templates), and finally, (3) reduce the workload for the handling officers. The development of the prototype is grounded in an empirical study conducted at the SSIA. The study comprises the analysis and clustering of 10,000 citizens e-mails and the working activity of 15 handling officers that were collected through questionnaires, interviews and workshops [2].

    Read more about User Centered Development of Automatic E-mail Answering for the Public Sector
  • Characteristics of Finnish and Swedish intensive care nursing narratives: a comparative analysis to support the development of clinical language technologies

    2011. Helen Allvin (et al.). Journal of Biomedical Semantics 2 (S1), 1-11

    Article

    Background: Free text is helpful for entering information into electronic health records, but reusing it is a challenge. The need for language technology for processing Finnish and Swedish healthcare text is therefore evident; however, Finnish and Swedish are linguistically very dissimilar. In this paper we present a comparison of characteristics in Finnish and Swedish free-text nursing narratives from intensive care. This creates a framework for characterising and comparing clinical text and lays the groundwork for developing clinical language technologies. Methods: Our material included daily nursing narratives from one intensive care unit in Finland and one in Sweden. Inclusion criteria for patients were an inpatient period of least five days and an age of at least 16 years. We performed a comparative analysis as part of a collaborative effort between Finnish- and Swedish-speaking healthcare and language technology professionals that included both qualitative and quantitative aspects. The qualitative analysis addressed the content and structure of three average- sized health records from each country. In the quantitative analysis 514 Finnish and 379 Swedish health records were studied using various language technology tools. Results: Although the two languages are not closely related, nursing narratives in Finland and Sweden had many properties in common. Both made use of specialised jargon and their content was very similar. However, many of these characteristics were challenging regarding development of language technology to support producing and using clinical documentation. Conclusions: The way Finnish and Swedish intensive care nursing was documented, was not country or language dependent, but shared a common context, principles and structural features and even similar vocabulary elements. Technology solutions are therefore likely to be applicable to a wider range of natural languages, but they need linguistic tailoring. Availability: The Finnish and Swedish data can be found at: http://www.dsv.su.se/ hexanord/data/

    Read more about Characteristics of Finnish and Swedish intensive care nursing narratives
  • Comparing Manual Text Patterns and Machine Learning for Classification of E-Mails for Automatic Answering by a Government Agency

    2011. Hercules Dalianis, Jonas Sjöbergh, Eriks Sneiders. Computational Linguistics and Intelligent Text Processing

    Conference

    E-mails to government institutions as well as to large companies may contain a large proportion of queries that can be answered in a uniform way. We analysed and manually annotated 4,404 e-mails from citizens to the Swedish Social Insurance Agency, and compared two methods for detecting an- swerable e-mails: manually-created text patterns (rule-based) and machine learning-based methods. We found that the text pattern-based method gave much higher precision at 89 percent than the machine learning-based method that gave only 63 percent precision. The recall was slightly higher (66 percent) for the machine learning-based methods than for the text patterns (47 percent). We also found that 23 percent of the total e-mail flow was processed by the automatic e-mail answering system.

    Read more about Comparing Manual Text Patterns and Machine Learning for Classification of E-Mails for Automatic Answering by a Government Agency
  • Factuality Levels of Diagnoses in Swedish Clinical Text

    2011. Sumithra Velupillai, Hercules Dalianis, Maria Kvist. User Centred Networked Health Care - Proceedings of MIE 2011, 559-563

    Conference

    Different levels of knowledge certainty, or factuality levels, are expressed in clinical health record documentation. This information is currently not fully exploited, as the subtleties expressed in natural language cannot easily be machine analyzed. Extracting relevant information from knowledge-intensive resources such as electronic health records can be used for improving health care in general by e.g. building automated information access systems. We present an annotation model of six factuality levels linked to diagnoses in Swedish clinical assessments from an emergency ward. Our main findings are that overall agreement is fairly high (0.7/0.58 F-measure, 0.73/0.6 Cohen's κ, Intra/Inter). These distinctions are important for knowledge models, since only approx. 50% of the diagnoses are affirmed with certainty. Moreover, our results indicate that there are patterns inherent in the diagnosis expressions themselves conveying factuality levels, showing that certainty is not only dependent on context cues.

    Read more about Factuality Levels of Diagnoses in Swedish Clinical Text
  • Retrieving disorders and findings: Results using SNOMED CT and NegEx adapted for Swedish

    2011. Maria Skeppstedt, Hercules Dalianis, Gunnar H. Nilsson. LOUHI 2011 Health Document Text Mining and Information Analysis 2011, 11-17

    Conference

    Access to reliable data from electronic health records is of high importance in several key areas in patient care, biomedical research, and education. However, many of the clinical entities are negated in the patient record text. Detecting what is a negation and what is not is therefore a key to high quality text mining. In this study we used the NegEx system adapted for Swedish to investigate negated clinical entities. We applied the system to a subset of free-text entries under a heading containing the word ‘assessment’ from the Stockholm EPR corpus, containing in total 23,171,559 tokens. Speci&#64257;cally, the explored entities were the SNOMED CT terms having the semantic categories ‘&#64257;nding’ or ‘disorder’. The study showed that the proportion of negated clinical entities was around 9%. The results thus support that negations are abundant in clinical text and hence negation detection is vital for high quality text mining in the medical domain.

    Read more about Retrieving disorders and findings: Results using SNOMED CT and NegEx adapted for Swedish
  • Clustering E-Mails for the Swedish Social Insurance Agency - What Part of the E-Mail Thread Gives the Best Quality?

    2010. Hercules Dalianis, Magnus Rosell, Eriks Sneiders. Advances in Natural Language Processing, 115-120

    Conference

    We need to analyse a large number of e-mails sent by the citizens to the customer services department of a governmental organisation based in Sweden. To carry out this analysis we clustered a large number of e-mails with the aim of automatic e-mail answering. One issue that came up was whether we should use the whole e-mail including the thread or just the original query for the clustering. In this paper we describe this investigation. Our results show that only the query and the answering part should be used, but not necessarily the whole e-mail thread. The results clearly show that the original question contains more useful information than only the answer, although a combination is even better. Using the full e-mail thread does not downgrade the result.

    Read more about Clustering E-Mails for the Swedish Social Insurance Agency - What Part of the E-Mail Thread Gives the Best Quality?
  • Creating a Reusable English-Chinese Parallel Corpus for Bilingual Dictionary Construction

    2010. Hercules Dalianis, Haochun Xing, Xin Zhang. Proceedings of the Seventh International Conference on Language Resources and Evaluation, LREC 2010, Valletta, Malta, May 19-21, 2010, 1700-1705

    Conference

    This paper first describes an experiment to construct an English-Chinese parallel corpus, then applying the Uplug word alignment tool on the corpus and finally produce and evaluate an English-Chinese word list. The Stockholm English-Chinese Parallel Corpus (SEC) was created by downloading English-Chinese parallel corpora from a Chinese web site containing law texts that have been manually translated from Chinese to English. The parallel corpus contains 104 563 Chinese characters equivalent to 59 918 Chinese words, and the corresponding English corpus contains 75 766 English words. However Chinese writing does not utilize any delimiters to mark word boundaries so we had to carry out word segmentation as a preprocessing step on the Chinese corpus. Moreover since the parallel corpus is downloaded from Internet the corpus is noisy regarding to alignment between corresponding translated sentences. Therefore we used 60 hours of manually work to align the sentences in the English and Chinese parallel corpus before performing automatic word alignment using Uplug. The word alignment with Uplug was carried out from English to Chinese. Nine respondents evaluated the resulting English-Chinese word list with frequency equal to or above three and we obtained an accuracy of 73.1 percent.

    Read more about Creating a Reusable English-Chinese Parallel Corpus for Bilingual Dictionary Construction
  • Creating and Evaluating a Consensus for Negated and Speculative Words in a Swedish Clinical Corpus

    2010. Hercules Dalianis, Maria Skeppstedt. Proceedings of the Workshop on Negation and Speculation in Natural Language Processing ((NeSp-NLP 2010)), 5-13

    Conference

    In this paper we describe the creation of a consensus corpus that was obtained through combining three individual annotations of the same clinical corpus in Swedish. We used a few basic rules that were executed automatically to create the consensus. The corpus contains negation words, speculative words, uncertain expressions and certain expressions. We evaluated the consensus using it for negation and speculation cue detection. We used Stanford NER, which is based on the machine learning algorithm Conditional Random Fields for the training and detection. For comparison we also used the clinical part of the BioScope Corpus and trained it with Stanford NER. For our clinical consensus corpus in Swedish we obtained a precision of 87.9 percent and a recall of 91.7 percent for negation cues, and for English with the Bioscope Corpus we obtained a precision of 97.6 percent and a recall of 96.7 percent for negation cues.

    Read more about Creating and Evaluating a Consensus for Negated and Speculative Words in a Swedish Clinical Corpus
  • De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields

    2010. Hercules Dalianis, Sumithra Velupillai. Journal of Biomedical Semantics 1:6

    Article

    Background

    In order to perform research on the information contained in Electronic Patient Records (EPRs), access to the data itself is needed. This is often very difficult due to confidentiality regulations. The data sets need to be fully de-identified before they can be distributed to researchers. De-identification is a difficult task where the definitions of annotation classes are not self-evident.

    Results

    We present work on the creation of two refined variants of a manually annotated Gold standard for de-identification, one created automatically, and one created through discussions among the annotators. The data is a subset from the Stockholm EPR Corpus, a data set available within our research group. These are used for the training and evaluation of an automatic system based on the Conditional Random Fields algorithm. Evaluating with four-fold cross-validation on sets of around 4-6 000 annotation instances, we obtained very promising results for both Gold Standards: F-score around 0.80 for a number of experiments, with higher results for certain annotation classes. Moreover, 49 false positives that were verified true positives were found by the system but missed by the annotators.

    Conclusions

    Our intention is to make this Gold standard, The Stockholm EPR PHI Corpus, available to other research groups in the future. Despite being slightly more time-consuming we believe the manual consensus gold standard is the most valuable for further research. We also propose a set of annotation classes to be used for similar de-identification tasks.

    Read more about De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields
  • Finding the Parallel: Automatic Dictionary Construction and Identification of Parallel Text Pairs

    2010. Sumithra Velupillai, Martin Hassel, Hercules Dalianis. Using Corpora in Contrastive and Translation Studies

    Chapter
    Read more about Finding the Parallel
  • How Certain are Clinical Assessments?: Annotating Swedish Clinical Text for (Un)certainties, Speculations and Negations

    2010. Hercules Dalianis, Sumithra Velupillai. Proceedings of the of the Seventh International Conference on Language Resources and Evaluation, LREC 2010, 3071-3075

    Conference

    Clinical texts contain a large amount of information. Some of this information is embedded in contexts where e.g. a patient status is reasoned about, which may lead to a considerable amount of statements that indicate uncertainty and speculation. We believe that distinguishing such instances from factual statements will be very beneficial for automatic information extraction. We have annotated a subset of the Stockholm Electronic Patient Record Corpus for certain and uncertain expressions as well as speculative and negation keywords, with the purpose of creating a resource for the development of automatic detection of speculative language in Swedish clinical text. We have analyzed the results from the initial annotation trial by means of pairwise Inter-Annotator Agreement (IAA) measured with F-score. Our main findings are that IAA results for certain expressions and negations are very high, but for uncertain expressions and speculative keywords results are less encouraging. These instances need to be defined in more detail. With this annotation trial, we have created an important resource that can be used to further analyze the properties of speculative language in Swedish clinical text. Our intention is to release this subset to other research groups in the future after removing identifiable information.

    Read more about How Certain are Clinical Assessments?
  • Influence of Module Order on Rule-Based De-identification of Personal Names in Electronic Patient Records Written in Swedish

    2010. Elin Carlsson, Hercules Dalianis. Proceedings of the Seventh International Conference on Language Resources and Evaluation, LREC 2010, Valletta, Malta, May 19-21, 2010, 3442-3446

    Conference

    Electronic patient records (EPRs) are a valuable resource for research but for confidentiality reasons they cannot be used freely. In order to make EPRs available to a wider group of researchers, sensitive information such as personal names has to be removed. Deidentification is a process that makes this possible. Both rule-based as well as statistical and machine learning based methods exist to perform de-identification, but the second method requires annotated training material which exists only very sparsely for patient names. It is therefore necessary to use rule-based methods for de-identification of EPRs. Not much is known, however, about the order in which the various rules should be applied and how the different rules influence precision and recall. This paper aims to answer this research question by implementing and evaluating four common rules for de-identification of personal names in EPRs written in Swedish: (1) dictionary name matching, (2) title matching, (3) common words filtering and (4) learning from previous modules. The results show that to obtain the highest recall and precision, the rules should be applied in the following order: title matching, common words filtering and dictionary name matching.

    Read more about Influence of Module Order on Rule-Based De-identification of Personal Names in Electronic Patient Records Written in Swedish
  • Uncertainty Detection as Approximate Max-Margin Sequence Labelling

    2010. Oscar Täckström (et al.). CoNLL 2010: Proceedings of the Fourteenth Conference on Computational Natural Language Learning, 2010, 84-91

    Conference

    This paper reports experiments for the CoNLL 2010 shared task on learning to detect hedges and their scope in natural language text. We have addressed the experimental tasks as supervised linear maximum margin prediction problems. For sentence level hedge detection in the biological domain we use an L1-regularised binary support vector machine, while for sentence level weasel detection in the Wikipedia domain, we use an L2-regularised approach. We model the in-sentence uncertainty cue and scope detection task as an L2-regularised approximate maximum margin sequence labelling problem, using the BIO-encoding. In addition to surface level features, we use a variety of linguistic features based on a functional dependency analysis. A greedy forward selection strategy is used in exploring the large set of potential features. Our official results for Task 1 for the biological domain are 85.2 F1-score, for the Wikipedia set 55.4 F1-score. For Task 2, our official results are 2.1 for the entire task with a score of 62.5 for cue detection. After resolving errors and final bugs, our final results are for Task 1, biological: 86.0, Wikipedia: 58.2; Task 2, scopes: 39.6 and cues: 78.5.

    Read more about Uncertainty Detection as Approximate Max-Margin Sequence Labelling
  • Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial

    2009. Sumithra Velupillai (et al.). International Journal of Medical Informatics 78 (12), e19-e26

    Article

    Background

    Electronic patient records (EPRs) contain a large amount of information written in free text. This information is considered very valuable for research but is also very sensitive since the free text parts may contain information that could reveal the identity of a patient. Therefore, methods for de-identifying EPRs are needed. The work presented here aims to perform a manual and automatic Protected Health Information (PHI)-annotation trial for EPRs written in Swedish.

    Methods

    This study consists of two main parts: the initial creation of a manually PHI-annotated gold standard, and the porting and evaluation of an existing de-identification software written for American English to Swedish in a preliminary automatic de-identification trial. Results are measured with precision, recall and F-measure.

    Results

    This study reports fairly high Inter-Annotator Agreement (IAA) results on the manually created gold standard, especially for specific tags such as names. The average IAA over all tags was 0.65 F-measure (0.84 F-measure highest pairwise agreement). For name tags the average IAA was 0.80 F-measure (0.91 F-measure highest pairwise agreement). Porting a de-identification software written for American English to Swedish directly was unfortunately non-trivial, yielding poor results.

    Conclusion

    Developing gold standard sets as well as automatic systems for de-identification tasks in Swedish is feasible. However, discussions and definitions on identifiable information is needed, as well as further developments both on the tag sets and the annotation guidelines, in order to get a reliable gold standard. A completely new de-identification software needs to be developed.

    Read more about Developing a standard for de-identifying electronic patient records written in Swedish
  • Using Uplug and SiteSeeker to construct a cross language search engine for Scandinavian

    2007. Hercules Dalianis, Martin Rimka, Viggo Kann.

    Conference

    This paper presents how we adapted a website search engine for cross language information retrieval, using the Uplug word alignment tool for parallel corpora.We first studied the monolingual search queries posed by the visitors of the website of the Nordic council containing five different languages. In order to compare how well different types of bilingual dictionaries covered the most common queries and terms on the website we tried a collection of ordinary bilingual dictionaries, a small manually constructed trilingual dictionary and an automatically constructed trilingual dictionary, constructed from the news corpus in the website using Uplug. The pre-cision and recall of the automatically constructed Swedish-English dictionary using Uplug were 71 and 93 percent, re-spectively. We found that precision and recall increase significantly in samples with high word frequency, but we could not confirm that POS-tags improve pre-cision. The collection of ordinary dic-tionaries, consisting of about 200 000 words, only cover 41 of the top 100 search queries at the website. The automatically built trilingual dictionary com-bined with the small manually built trilingual dictionary, consisting of about 2 300 words, and covers 36 of the top search queries.

    Read more about Using Uplug and SiteSeeker to construct a cross language search engine for Scandinavian

Show all publications by Hercules Dalianis at Stockholm University