Thesis defence: Thomas Vakili

THESIS DEFENCE
Date: Tuesday 13 January 2026
Time: 13:30 - 17:30
Location: Room Lilla hörsalen, DSV, Borgarfjordsgatan 12, Kista

Welcome to a thesis defence at DSV! In his PhD thesis, Thomas Vakili studies how large language models can be trained, without risking privacy leakage from the data that’s being used.

On January 13, 2026, Thomas Vakili will present his PhD thesis at the Department of Computer and Systems Sciences (DSV), Stockholm University. The thesis defence takes place in Lilla hörsalen, Nodhuset on Borgarfjordsgatan 12 in Kista, starting at 1:30 pm.

The title of the thesis is “Preserving the Privacy of Language Models: Experiments in Clinical NLP”.

PhD student: Thomas Vakili, DSV
External reviewer: Martin Krallinger, Barcelona Supercomputing Center (BSC), Spain
Main supervisor: Hercules Dalianis, DSV
Supervisor: Aron Henriksson, DSV

The thesis can be downloaded from Diva

Contact Thomas Vakili

Find your way to DSV

Abstract

State-of-the-art methods in natural language processing (NLP) increasingly rely on large pre-trained language models. The strength of these models stems from their large number of parameters and the enormous amounts of data used to train them. The datasets are of a scale that makes it difficult, if not impossible, to audit them manually. When unwieldy amounts of potentially sensitive data are used to train large models, an important problem arises: unwelcome memorization of the training data.

All datasets—including those based on publicly available data—can contain personally identifiable information (PII). When models memorize sensitive data, they become vulnerable to privacy attacks. Very few datasets for NLP can be guaranteed to be free of sensitive data. Consequently, most NLP models are susceptible to privacy leakage. This susceptibility is especially concerning in clinical NLP, where the data typically consist of electronic health records (EHRs). Leaking data from EHRs is never acceptable from a privacy perspective. This doctoral thesis investigates the privacy risks of using sensitive data and how they can be mitigated—while maintaining their utility as training data.

A BERT model pre-trained using clinical data is subjected to a training data extraction attack. The same model is used to evaluate a membership inference attack that has been proposed to quantify the privacy risks of masked language models. Multiple experiments assess the performance gains from adapting pre-trained models to the clinical domain. Then, the impact of automatic de-identification on the performance of BERT models is evaluated for both pre-training and finetuning data. The final experiments of the thesis explore how synthetic training corpora can be generated while limiting the use of sensitive data, and working under computational constraints. The quality of these corpora, and the factors affecting their utility, are explored by training and evaluating BERT models.

The results show that domain adaptation leads to significantly better performance on clinical NLP tasks. They also show that extracting training data from BERT models is difficult and suggest that the risks can be further decreased by automatically de-identifying the training data. Automatic de-identification is found to preserve the utility of the data used for pre-training and fine-tuning BERT models. However, we also find that contemporary membership inference attacks are unable to quantify the privacy benefits of this technique. Similarly, high-quality synthetic corpora can be generated using limited resources, but further research is needed to determine the privacy gains from using them. The results show that automatic de-identification and training data synthesis reduce the privacy risks of using sensitive data for NLP while preserving the utility of the data. However, these benefits are difficult to quantify, and there are no rigorous methods for comparing different privacy-preserving techniques.

Keywords: natural language processing, privacy, membership inference, training data extraction, automatic deidentification, synthetic data, named entity recognition, domain adaptation, large language models

Last updated: 2026-01-07

Source: Department of Computer and Systems Sciences