Language models that don’t leak – new method protects your data

For large language models such as Chat GPT to work, they must be trained on enormous amounts of data. If these systems are hacked, there is always a risk that sensitive information leaks. Thomas Vakili has developed methods to protect privacy, while still making use of the many advantages of language models.

Health care staff working with digital patient records.

Patient data are sensitive and need to be protected. Photo: Suriyo Munkaew/Mostphotos.


Almost every day we hear about hacker attacks that have led to data breaches. This is especially serious when sensitive personal data is exposed – such as patient records. Thomas Vakili is a researcher in language technology at the Department of Computer and Systems Sciences (DSV).

“The fields of language technology and AI have exploded in recent years. When I started my PhD in 2020, people were not familiar with the topic. I could barely explain it to my parents – and they’re quite patient,” Thomas Vakili says with a smile.

Today, the technology is part of everyday life, as are terms like “prompt” and “chatbot”. Large language models – or LLMs for short – had their major breakthrough in the winter of 2022, when Open AI released Chat GPT 3.5. Since then, we have become used to turning to chatbots for guidance in personal and profession matters.

“LLM is a technical term, but today everyone uses it,” Vakili notes.

Privacy has been overlooked

The rapid adoption of large language models, and their obvious benefits, has meant that privacy issues have sometimes been pushed aside. Just a few years ago, privacy was almost a non-issue, according to Vakili.

“The risk of personal data leaking was seen as a niche problem. People said, ‘Does that really happen?’ Today we see that it happens regularly. There’s no doubt that models risk exposing sensitive data.”

Early on in his PhD work, it became clear to Vakili that data leaks were a real problem. He therefore chose to focus on how the risks could be reduced. Together with colleagues at DSV, he has developed methods to protect personal data in large datasets. The basic idea is quite intuitive:

“The best way to avoid exposing sensitive data is to never put them in the model in the first place.”

When training a language model, vast amounts of data are fed into it. Even if confidentiality rules are in place, there is always a risk that the model will “memorise” data it should not retain. This means that sensitive information may leak – for example if the system is compromised.

Portrait photo of Thomas Vakili, Department of Computer and Systems Sciences (DSV), Stockholm University.

“There’s no doubt that large language models risk exposing sensitive data,” says Thomas Vakili. Photo: Åse Karlén.

Two million patient records

Vakili works, among other things, with sensitive healthcare data. The material comes from two million patient records from Karolinska University Hospital. It includes information such as age, gender, admission and discharge dates, diagnoses, prescribed medication, and laboratory data.

The DSV methods involve de-identifying the data before it is used to train a language model, so that diseases and treatments cannot be linked to specific individuals. The researchers also experiment with pseudonymisation, where names and other personal details are replaced with fictitious ones.

“My results show that pseudonymisation works well. Replacing one first name with another makes the data look natural. The meaning and structure are preserved.”

Privacy isn’t just about technology or maths

A hospital must of course store all patient information in its internal medical record system. But if patient data are to be processed and analysed using AI models, it is safest to de-identify them first. This makes it possible to draw conclusions about which symptoms indicate certain diseases, which treatments work best, and which medications cause the fewest side effects – without identifying individual patients.

“Previous research trying to solve privacy problems has often been very technical. You practically needed a PhD in mathematics to understand it.”

De-identification is a simpler approach in that respect, Vakili says – even a layperson can understand the intention behind the method.

“I think there’s real value in being able to explain how a method protects your data. Privacy isn’t just about technology or maths. It’s also cultural, and closely linked to trust and a sense of security.”

Synthetic data can train the model

Vakili also studies how synthetic (artificial) data can be used to train language models so that they can process information from very different sources. This could include doctors’ clinical notes, X-ray images, and blood test results – a tough challenge even for AI.

Normally, enormous datasets and vast computing power are required to train models to handle different types of data. With synthetic datasets, training becomes both safer and cheaper.

“Our studies show that the patterns that machine-learning algorithms need to detect are strong enough even when using synthetic data.”

“We’ve shown that it’s possible to create large datasets with limited resources, without reducing performance. Chat GPT requires massive data centres, but our language models could potentially run on a standard gaming computer. That has important implications for accessibility,” Vakili says.

Tech giants dominate

Vakili is referring to the fact that global AI development today is concentrated to a small number of US-based companies, such as Google, Microsoft, Meta, and Open AI. And privacy does not appear to be a top priority for them.

“From my perspective, it would be fairly easy for a company like Open AI to de-identify data before training their language models. They employ plenty of very smart people. Instead, they focus on making the model behave ‘nicely’ – for example by not revealing personal data. Unfortunately, there are probably always ways to get around that.”

“The methods we develop ensure that sensitive data is never fed into the model. Then it simply isn’t possible to trick the model into revealing personal information,” says Thomas Vakili.

It would be fairly easy for a company like Open AI to de-identify data

According to Vakili, the Swedish public sector is relatively advanced in this area. In the private sector, however, many organisations still need to take action.

“For companies, it’s a competitive advantage to show that you protect customer data and actively work to reduce the risk of incidents. When I talk about my research at conferences, many business leaders become interested. They’re worried about how to use large datasets safely,” Vakili says.

More about Thomas’s research

Thomas Vakili defended his doctoral thesis at the Department of Computer and Systems Sciences (DSV), Stockholm University, on January 13, 2026.

The thesis is titled “Preserving the Privacy of Language Models: Experiments in Clinical NLP”.

It consists of six scientific papers, with an additional ten listed as related publications.

The thesis can be downloaded from DiVA


Martin Krallinger, Barcelona Supercomputing Center (BSC) in Spain, was the external reviewer.


Hercules Dalianis, DSV, is the main supervisor and Aron Henriksson, DSV, is the supervisor.

Contact Thomas Vakili
Contact Hercules Dalianis
Contact Aron Henriksson


Read more about research and education at DSV

Den här artikeln finns också i en svensk version

Last updated: 2026-03-16

Source: Department of Computer and Systems Sciences