Thomas VakiliPhD student
I am a PhD student at the Department of Computer and Systems Sciences where I am supervised by Professor Hercules Dalianis. My research is about the intersection of language technology and privacy.
The natural language processing field has seen great advances through the introduction of pre-trained language models, like BERT. At DSV, we have successfully applied these language models for medical applications by training on large amounts of electronic health record data.
One of the main reasons for the success of these language models is that they are very large, and trained on enormous corpora. Because of this, the success of these models comes with an important drawback: they have a tendency to leak information about their training data. My research tackles this issue, and my goal is to find ways of creating models which preserve the privacy of people in the training data.
You can read more about my research at my academic webpage.
A selection from Stockholm University publication database
Are Clinical BERT Models Privacy Preserving? The Difficulty of Extracting Patient-Condition Associations
2021. Thomas Vakili, Hercules Dalianis. Proceedings of the AAAI 2021 Fall Symposium on Human Partnership with Medical AIConference
Language models may be trained on data that contain personal information, such as clinical data. Such sensitive data must not leak for privacy reasons. This article explores whether BERT models trained on clinical data are susceptible to training data extraction attacks. Multiple large sets of sentences generated from the model with top-k sampling and nucleus sampling are studied. The sentences are examined to determine the degree to which they contain information associating patients with their conditions. The sentence sets are then compared to determine if there is a correlation between the degree of privacy leaked and the linguistic quality attained by each generation technique. We find that the relationship between linguistic quality and privacy leakage is weak and that the risk of a successful training data extraction attack on a BERT-based model is small.