Stockholm university

Thomas VakiliPhD student

About me

I am a PhD student at the Department of Computer and Systems Sciences where I am supervised by Professor Hercules Dalianis and Associate Professor Aron Henriksson. My research is about the intersection of language technology and privacy.

The natural language processing field has seen great advances through the introduction of pre-trained language models, like BERT. At DSV, we have successfully applied these language models for medical applications by training on large amounts of electronic health record data.

One of the main reasons for the success of these language models is that they are very large, and that they are trained on enormous corpora. Because of this, the success of these models comes with an important drawback: they have a tendency to leak information about their training data. My research tackles this issue, and my goal is to find ways of creating models that preserve the privacy of people in the training data.

I defended my licentiate thesis in May of 2023 and plan to defend my doctoral dissertation at the end of 2025. You can read more about my research at my academic webpage.

Teaching

I teach several courses and I also supervise bachelor's and master's theses. I am teaching or have taught in the following courses:

Research projects

Publications

A selection from Stockholm University publication database

  • A Pseudonymized Corpus of Occupational Health Narratives for Clinical Entity Recognition in Spanish

    2024. Jocelyn Dunstan (et al.). BMC Medical Informatics and Decision Making (24)

    Article

    Despite the high creation cost, annotated corpora are indispensable for robust natural language processing systems. In the clinical field, in addition to annotating medical entities, corpus creators must also remove personally identifiable information (PII). This has become increasingly important in the era of large language models where unwanted memorization can occur. This paper presents a corpus annotated to anonymize personally identifiable information in 1,787 anamneses of work-related accidents and diseases in Spanish. Additionally, we applied a previously released model for Named Entity Recognition (NER) trained on referrals from primary care physicians to identify diseases, body parts, and medications in this work-related text. We analyzed the differences between the models and the gold standard curated by a physician in detail. Moreover, we compared the performance of the NER model on the original narratives, in narratives where personal information has been masked, and in texts where the personal data is replaced by another similar surrogate value (pseudonymization). Within this publication, we share the annotation guidelines and the annotated corpus.

    Read more about A Pseudonymized Corpus of Occupational Health Narratives for Clinical Entity Recognition in Spanish
  • A Privacy-Preserving Corpus for Occupational Health in Spanish: Evaluation for NER and Classification Tasks

    2024. Claudio Aracena (et al.). Proceedings of the 6th Clinical Natural Language Processing Workshop, 111-121

    Conference

    Annotated corpora are essential to reliable natural language processing. While they are expensive to create, they are essential for building and evaluating systems. This study introduces a new corpus of 2,869 medical and admission reports collected by an occupational insurance and health provider. The corpus has been carefully annotated for personally identifiable information (PII) and is shared, masking this information. Two annotators adhered to annotation guidelines during the annotation process, and a referee later resolved annotation conflicts in a consolidation process to build a gold standard subcorpus. The inter-annotator agreement values, measured in F1, range between 0.86 and 0.93 depending on the selected subcorpus. The value of the corpus is demonstrated by evaluating its use for NER of PII and a classification task. The evaluations find that fine-tuned models and GPT-3.5 reach F1 of 0.911 and 0.720 in NER of PII, respectively. In the case of the insurance coverage classification task, using the original or de-identified corpus results in similar performance. The annotated data are released in de-identified form.

    Read more about A Privacy-Preserving Corpus for Occupational Health in Spanish
  • End-to-End Pseudonymization of Fine-Tuned Clinical BERT Models: Privacy Preservation with Maintained Data Utility

    2024. Thomas Vakili, Aron Henriksson, Hercules Dalianis. BMC Medical Informatics and Decision Making

    Article

    Many state-of-the-art results in natural language processing (NLP) rely on large pre-trained language models (PLMs). These models consist of large amounts of parameters that are tuned using vast amounts of training data. These factors cause the models to memorize parts of their training data, making them vulnerable to various privacy attacks. This is cause for concern, especially when these models are applied in the clinical domain, where data are very sensitive.

    One privacy-preserving technique that aims to mitigate these problems is training data pseudonymization. This technique automatically identifies and replaces sensitive entities with realistic but non-sensitive surrogates. Pseudonymization has yielded promising results in previous studies. However, no previous study has applied pseudonymization to both the pre-training data of PLMs and the fine-tuning data used to solve clinical NLP tasks.

    This study evaluates the predictive performance effects of end-to-end pseudonymization of clinical BERT models on five clinical NLP tasks compared to pre-training and fine-tuning on unaltered sensitive data. A large number of statistical tests are performed, revealing minimal harm to performance when using pseudonymized fine-tuning data. The results also find no deterioration from end-to-end pseudonymization of pre-training and fine-tuning data. These results demonstrate that pseudonymizing training data to reduce privacy risks can be done without harming data utility for training PLMs.

    Read more about End-to-End Pseudonymization of Fine-Tuned Clinical BERT Models
  • When Is a Name Sensitive? Eponyms in Clinical Text and Implications for De-Identification

    2024. Thomas Vakili (et al.). Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024), 76-80

    Conference

    Clinical data, in the form of electronic health records, are rich resources that can be tapped using natural language processing. At the same time, they contain very sensitive information that must be protected. One strategy is to remove or obscure data using automatic de-identification. However, the detection of sensitive data can yield false positives. This is especially true for tokens that are similar in form to sensitive entities, such as eponyms. These names tend to refer to medical procedures or diagnoses rather than specific persons. Previous research has shown that automatic de-identification systems often misclassify eponyms as names, leading to a loss of valuable medical information. In this study, we estimate the prevalence of eponyms in a real Swedish clinical corpus. Furthermore, we demonstrate that modern transformer-based de-identification systems are more accurate in distinguishing between names and eponyms than previous approaches.

    Read more about When Is a Name Sensitive? Eponyms in Clinical Text and Implications for De-Identification
  • Attacking and Defending the Privacy of Clinical Language Models

    2023. Thomas Vakili.

    Thesis (Lic)

    The state-of-the-art methods in natural language processing (NLP) increasingly rely on large pre-trained transformer models. The strength of the models stems from their large number of parameters and the enormous amounts of data used to train them. The datasets are of a scale that makes it difficult, if not impossible, to audit them manually. When unwieldy amounts of potentially sensitive data are used to train large machine learning models, a difficult problem arises: the unintended memorization of the training data.

    All datasets—including those based on publicly available data—can contain sensitive information about individuals. When models unintentionally memorize these sensitive data, they become vulnerable to different types of privacy attacks. Very few datasets for NLP can be guaranteed to be free from sensitive data. Thus, to varying degrees, most NLP models are susceptible to privacy leakage. This susceptibility is especially concerning in clinical NLP, where the data typically consist of electronic health records. Unintentionally leaking publicly available data can be problematic, but leaking data from electronic health records is never acceptable from a privacy perspective. At the same time, clinical NLP has great potential to improve the quality and efficiency of healthcare.

    This licentiate thesis investigates how these privacy risks can be mitigated using automatic de-identification. This is done by exploring the privacy risks of pre-training using clinical data and then evaluating the impact on the model accuracy of decreasing these risks. A BERT model pre-trained using clinical data is subjected to a training data extraction attack. The same model is also used to evaluate a membership inference attack that has been proposed to quantify the privacy risks associated with masked language models. Then, the impact of automatic de-identification on the performance of BERT models is evaluated for both pre-training and fine-tuning data.

    The results show that extracting training data from BERT models is non-trivial and suggest that the risks can be further decreased by automatically de-identifying the training data. Automatic de-identification is found to preserve the utility of the data used for pre-training and fine-tuning BERT models, resulting in no reduction in performance compared to models trained using unaltered data. However, we also find that the current state-of-the-art membership inference attacks are unable to quantify the privacy benefits of automatic de-identification. The results show that automatic de-identification reduces the privacy risks of using sensitive data for NLP without harming the utility of the data, but that these privacy benefits may be difficult to quantify.

    Read more about Attacking and Defending the Privacy of Clinical Language Models
  • Using Membership Inference Attacks to Evaluate Privacy-Preserving Language Modeling Fails for Pseudonymizing Data

    2023. Thomas Vakili, Hercules Dalianis. 24th Nordic Conference on Computational Linguistics (NoDaLiDa), 318-323

    Conference

    Large pre-trained language models dominate the current state-of-the-art for many natural language processing applications, including the field of clinical NLP. Several studies have found that these can be susceptible to privacy attacks that are unacceptable in the clinical domain where personally identifiable information (PII) must not be exposed.

    However, there is no consensus regarding how to quantify the privacy risks of different models. One prominent suggestion is to quantify these risks using membership inference attacks. In this study, we show that a state-of-the-art membership inference attack on a clinical BERT model fails to detect the privacy benefits from pseudonymizing data. This suggests that such attacks may be inadequate for evaluating token-level privacy preservation of PIIs.

    Read more about Using Membership Inference Attacks to Evaluate Privacy-Preserving Language Modeling Fails for Pseudonymizing Data
  • Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data

    2022. Thomas Vakili (et al.). Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 4245-4252

    Conference

    Automatic de-identification is a cost-effective and straightforward way of removing large amounts of personally identifiable information from large and sensitive corpora. However, these systems also introduce errors into datasets due to their imperfect precision. These corruptions of the data may negatively impact the utility of the de-identified dataset. This paper de-identifies a very large clinical corpus in Swedish either by removing entire sentences containing sensitive data or by replacing sensitive words with realistic surrogates. These two datasets are used to perform domain adaptation of a general Swedish BERT model. The impact of the de-identification techniques is assessed by training and evaluating the models using six clinical downstream tasks. The results are then compared to a similar BERT model domain-adapted using an unaltered version of the clinical corpus. The results show that using an automatically de-identified corpus for domain adaptation does not negatively impact downstream performance. We argue that automatic de-identification is an efficient way of reducing the privacy risks of domain-adapted models and that the models created in this paper should be safe to distribute to other academic researchers.

    Read more about Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data
  • Utility Preservation of Clinical Text After De-Identification

    2022. Thomas Vakili, Hercules Dalianis. Proceedings of the 21st Workshop on Biomedical Language Processing, 383-388

    Conference

    Electronic health records contain valuable information about symptoms, diagnosis, treatment and outcomes of the treatments of individual patients. However, the records may also contain information that can reveal the identity of the patients. Removing these identifiers - the Protected Health Information (PHI) - can protect the identity of the patient. Automatic de-identification is a process which employs machine learning techniques to detect and remove PHI. However, automatic techniques are imperfect in their precision and introduce noise into the data. This study examines the impact of this noise on the utility of Swedish de-identified clinical data by using human evaluators and by training and testing BERT models. Our results indicate that de-identification does not harm the utility for clinical NLP and that human evaluators are less sensitive to noise from de-identification than expected.

    Read more about Utility Preservation of Clinical Text After De-Identification
  • Are Clinical BERT Models Privacy Preserving? The Difficulty of Extracting Patient-Condition Associations

    2021. Thomas Vakili, Hercules Dalianis. Proceedings of the AAAI 2021 Fall Symposium on Human Partnership with Medical AI

    Conference

    Language models may be trained on data that contain personal information, such as clinical data. Such sensitive data must not leak for privacy reasons. This article explores whether BERT models trained on clinical data are susceptible to training data extraction attacks. Multiple large sets of sentences generated from the model with top-k sampling and nucleus sampling are studied. The sentences are examined to determine the degree to which they contain information associating patients with their conditions. The sentence sets are then compared to determine if there is a correlation between the degree of privacy leaked and the linguistic quality attained by each generation technique. We find that the relationship between linguistic quality and privacy leakage is weak and that the risk of a successful training data extraction attack on a BERT-based model is small.

    Read more about Are Clinical BERT Models Privacy Preserving? The Difficulty of Extracting Patient-Condition Associations

Show all publications by Thomas Vakili at Stockholm University