Stockholm university

Thomas VakiliPhD student

About me

I am a PhD student at the Department of Computer and Systems Sciences where I am supervised by Professor Hercules Dalianis. My research is about the intersection of language technology and privacy.

The natural language processing field has seen great advances through the introduction of pre-trained language models, like BERT. At DSV, we have successfully applied these language models for medical applications by training on large amounts of electronic health record data.

One of the main reasons for the success of these language models is that they are very large, and trained on enormous corpora. Because of this, the success of these models comes with an important drawback: they have a tendency to leak information about their training data. My research tackles this issue, and my goal is to find ways of creating models that preserve the privacy of people in the training data.

You can read more about my research at my academic webpage.

Teaching

I teach several courses and I also supervise bachelor's and master's theses. I am teaching or have taught in the following courses:

Research projects

Publications

A selection from Stockholm University publication database

  • Attacking and Defending the Privacy of Clinical Language Models

    2023. Thomas Vakili.

    Thesis (Lic)

    The state-of-the-art methods in natural language processing (NLP) increasingly rely on large pre-trained transformer models. The strength of the models stems from their large number of parameters and the enormous amounts of data used to train them. The datasets are of a scale that makes it difficult, if not impossible, to audit them manually. When unwieldy amounts of potentially sensitive data are used to train large machine learning models, a difficult problem arises: the unintended memorization of the training data.

    All datasets—including those based on publicly available data—can contain sensitive information about individuals. When models unintentionally memorize these sensitive data, they become vulnerable to different types of privacy attacks. Very few datasets for NLP can be guaranteed to be free from sensitive data. Thus, to varying degrees, most NLP models are susceptible to privacy leakage. This susceptibility is especially concerning in clinical NLP, where the data typically consist of electronic health records. Unintentionally leaking publicly available data can be problematic, but leaking data from electronic health records is never acceptable from a privacy perspective. At the same time, clinical NLP has great potential to improve the quality and efficiency of healthcare.

    This licentiate thesis investigates how these privacy risks can be mitigated using automatic de-identification. This is done by exploring the privacy risks of pre-training using clinical data and then evaluating the impact on the model accuracy of decreasing these risks. A BERT model pre-trained using clinical data is subjected to a training data extraction attack. The same model is also used to evaluate a membership inference attack that has been proposed to quantify the privacy risks associated with masked language models. Then, the impact of automatic de-identification on the performance of BERT models is evaluated for both pre-training and fine-tuning data.

    The results show that extracting training data from BERT models is non-trivial and suggest that the risks can be further decreased by automatically de-identifying the training data. Automatic de-identification is found to preserve the utility of the data used for pre-training and fine-tuning BERT models, resulting in no reduction in performance compared to models trained using unaltered data. However, we also find that the current state-of-the-art membership inference attacks are unable to quantify the privacy benefits of automatic de-identification. The results show that automatic de-identification reduces the privacy risks of using sensitive data for NLP without harming the utility of the data, but that these privacy benefits may be difficult to quantify.

    Read more about Attacking and Defending the Privacy of Clinical Language Models
  • Using Membership Inference Attacks to Evaluate Privacy-Preserving Language Modeling Fails for Pseudonymizing Data

    2023. Thomas Vakili, Hercules Dalianis. 24th Nordic Conference on Computational Linguistics (NoDaLiDa), 318-323

    Conference

    Large pre-trained language models dominate the current state-of-the-art for many natural language processing applications, including the field of clinical NLP. Several studies have found that these can be susceptible to privacy attacks that are unacceptable in the clinical domain where personally identifiable information (PII) must not be exposed.

    However, there is no consensus regarding how to quantify the privacy risks of different models. One prominent suggestion is to quantify these risks using membership inference attacks. In this study, we show that a state-of-the-art membership inference attack on a clinical BERT model fails to detect the privacy benefits from pseudonymizing data. This suggests that such attacks may be inadequate for evaluating token-level privacy preservation of PIIs.

    Read more about Using Membership Inference Attacks to Evaluate Privacy-Preserving Language Modeling Fails for Pseudonymizing Data
  • Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data

    2022. Thomas Vakili (et al.). Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 4245-4252

    Conference

    Automatic de-identification is a cost-effective and straightforward way of removing large amounts of personally identifiable information from large and sensitive corpora. However, these systems also introduce errors into datasets due to their imperfect precision. These corruptions of the data may negatively impact the utility of the de-identified dataset. This paper de-identifies a very large clinical corpus in Swedish either by removing entire sentences containing sensitive data or by replacing sensitive words with realistic surrogates. These two datasets are used to perform domain adaptation of a general Swedish BERT model. The impact of the de-identification techniques is assessed by training and evaluating the models using six clinical downstream tasks. The results are then compared to a similar BERT model domain-adapted using an unaltered version of the clinical corpus. The results show that using an automatically de-identified corpus for domain adaptation does not negatively impact downstream performance. We argue that automatic de-identification is an efficient way of reducing the privacy risks of domain-adapted models and that the models created in this paper should be safe to distribute to other academic researchers.

    Read more about Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data
  • Utility Preservation of Clinical Text After De-Identification

    2022. Thomas Vakili, Hercules Dalianis. Proceedings of the 21st Workshop on Biomedical Language Processing, 383-388

    Conference

    Electronic health records contain valuable information about symptoms, diagnosis, treatment and outcomes of the treatments of individual patients. However, the records may also contain information that can reveal the identity of the patients. Removing these identifiers - the Protected Health Information (PHI) - can protect the identity of the patient. Automatic de-identification is a process which employs machine learning techniques to detect and remove PHI. However, automatic techniques are imperfect in their precision and introduce noise into the data. This study examines the impact of this noise on the utility of Swedish de-identified clinical data by using human evaluators and by training and testing BERT models. Our results indicate that de-identification does not harm the utility for clinical NLP and that human evaluators are less sensitive to noise from de-identification than expected.

    Read more about Utility Preservation of Clinical Text After De-Identification
  • Cross-Clinic De-Identification of Swedish Electronic Health Records: Nuances and Caveats

    2022. OIle Bridal, Thomas Vakili, Marina Santini. Proceedings of the Language Resources and Evaluation Conference, 49-52

    Conference

    Privacy preservation of sensitive information is one of the main concerns in clinical text mining. Due to the inherent privacy-keeping problems that arise when handling clinical data, the clinical corpora used to create the clinical Named Entity Recognition (NER) models underlying clinical de-identification systems cannot be shared. This implies that clinical NER models are trained and tested on data coming from the same institution because it is rarely possible to evaluate them on data belonging to a different institution. Given this sharing restrictions, it is very to assess whether a clinical NER model has overfitted the data or if it is driven by undetected biases. In this paper we present the results of the first-ever cross-institution evaluation of a Swedish de-identification system on Swedish clinical data. Alongside the encouraging results, we present a discussion about differences and similarities across EHR naming conventions and NER tagsets.

    Read more about Cross-Clinic De-Identification of Swedish Electronic Health Records
  • Are Clinical BERT Models Privacy Preserving? The Difficulty of Extracting Patient-Condition Associations

    2021. Thomas Vakili, Hercules Dalianis. Proceedings of the AAAI 2021 Fall Symposium on Human Partnership with Medical AI

    Conference

    Language models may be trained on data that contain personal information, such as clinical data. Such sensitive data must not leak for privacy reasons. This article explores whether BERT models trained on clinical data are susceptible to training data extraction attacks. Multiple large sets of sentences generated from the model with top-k sampling and nucleus sampling are studied. The sentences are examined to determine the degree to which they contain information associating patients with their conditions. The sentence sets are then compared to determine if there is a correlation between the degree of privacy leaked and the linguistic quality attained by each generation technique. We find that the relationship between linguistic quality and privacy leakage is weak and that the risk of a successful training data extraction attack on a BERT-based model is small.

    Read more about Are Clinical BERT Models Privacy Preserving? The Difficulty of Extracting Patient-Condition Associations
  • Evaluation of LIME and SHAP in Explaining Automatic ICD-10 Classifications of Swedish Gastrointestinal Discharge Summaries

    2022. Alexander Dolk (et al.). Proceedings of the 18th Scandinavian Conference on Health Informatics, 166-173

    Conference

    A computer-assisted coding tool could alleviate the burden on medical staff to assign ICD diagnosis codes to discharge summaries by utilising deep learning models to generate recommendations. However, the opaque nature of deep learning models makes it hard for humans to trust them. In this study, the explainable AI models LIME and SHAP have been applied to the clinical language model SweDeClin-BERT to explain ICD-10 codes assigned to Swedish gastrointestinal discharge summaries. The explanations have been evaluated by eight medical experts, showing a statistically higher significant difference in explainable performance for SHAP compared to LIME.

    Read more about Evaluation of LIME and SHAP in Explaining Automatic ICD-10 Classifications of Swedish Gastrointestinal Discharge Summaries

Show all publications by Thomas Vakili at Stockholm University