Denna sida på svenska

Research project DataLEASH: LEarning And SHaring under Privacy Constraints

With massive amounts of personal data being generated, privacy has become a great challenge. This project studies how machine learning can be used for sharing language models without risking to share information that may identify individuals.

Denna sida på svenska

Theme

Contact person at SU

Hercules Dalianis

Professor

Department of Computer and Systems Sciences

08-16 16 16

hercules@dsv.su.se

Overview

Project period

01-05-2019 - 30-06-2025

Responsible

Department of Computer and Systems Sciences

Research subjects

AI and Data Science Language Technology

Status

Completed

Research groups

Natural Language Processing Research Group

The Natural Language Processing Research Group develops, applies and evaluates NLP methods, in particular involving large language models, across various domains. We focus on topics such as privacy, explainability, and domain adaptation.

HPV-16 cells - a high-risk type for cancer.

More information

Partners

KTH

RISE

Photo: Shahadat Rahman/Unsplash.

The recent confluence of digitalization, increasingly data heavy technologies, advances in machine learning, and legal regulations has turned privacy into a great challenge.

While regulations such as the GDPR serve as a major step toward protecting society, there is a lack of guidelines and technical specifications of what kind of privacy leakage is acceptable. Currently, this prevents data from being exploited and shared to the full extent possible.

To solve these problems, we need:
1. quantitative and legal privacy risk and utility assessments, and
2. mechanisms for data transformation and learning that improve the results of these assessments.

Thus, the DataLEASH project will develop and test the methods that will lead to more open data. Participants in this project are Stockholm University, KTH and RISE.

HB Deid is a tool that has been developed for de-identification of texts.
See how the HB Deid tool works

Former members of this project are Hanna Berg, Mila Grancharova och Tasos Lamproudis.

Recipients of this project are Charlotte Dingertz, City of Stockholm, Sven-Åke Lööv, Region Stockholm, Henrik Löf, Karolinska University Hospital, Marina Santini, RISE, and Peter Lundberg, Linköping University Hospital.

In the continuation project, DataLEASH in Action, the stakeholders are Region Halland and the National Library of Sweden. Region Halland wants to de-identify and pseudonymise patient records to be able to make these available for research but also to be able to build clinical language models.

This project is financed in KTH’s digitalization initiative in 2019 for IT and mobile communication (ICT TNG) through the government’s strategic research areas (SFO) to create world-leading research.

Project description

In WP5, we will start with a problem formulation from existing projects where medical, municipal, and other data repositories have been facing challenges with privacy, anonymization, pseudonymization and similar.

We will perform a series of experiments on existing very large data sets from, e. g., the Stockholm county council (medical records), Elekta (medical imaging data) and City of Stockholm (data from numerous systems that are linked to many areas of the city), to investigate possibilities and challenges with the mechanisms developed within DataLEASH.

WP5 will also create demo applications for demonstrating possibilities of DataLEASH mechanisms on different types of data. At Stockholm University, we have access to the research infrastructure Health Bank, the Swedish Health Record Research Bank that contains over two million electronic patient records from Karolinska University Hospital from the years 2007–2014.

They are stored in a relational database with over 80 tables where experiments can be carried out to study when and where anonymity can be preserved for example by using different privacy preserving data record linkage methods beyond regular pseudonymization. Experiments can be done and data securely shared between partners on the RISE ICE computer cluster.

Project members

Project managers

Hercules Dalianis

Professor

Department of Computer and Systems Sciences

08-16 16 16

hercules@dsv.su.se

Uno Fors

Researcher

Department of Computer and Systems Sciences

08-674 74 79

uno@dsv.su.se

Members

Thomas Vakili

PhD Student

Department of Computer and Systems Sciences

08-16 16 59

thomas.vakili@dsv.su.se

Martin Hansson

Amanuens

Department of Computer and Systems Sciences

martin.hansson@dsv.su.se

Publications

Dunstan, J., Vakili, T., Miranda, L., Villena, F., Aracena, C., Quiroga, T., et al (2024)

“A Pseudonymized Corpus of Occupational Health Narratives for Clinical Entity Recognition in Spanish”

Vakili, T., Henriksson, A. and H Dalianis (2024)

“End-to-End Pseudonymization of Fine-Tuned Clinical BERT Models”

Aracena, C., Miranda, L., Vakili, T., et al (2024)

“Privacy-Preserving Corpus for Occupational Health in Spanish: Evaluation for NER and Classification Tasks”

Vakili, T., Hullmann T., Henriksson A. and H. Dalianis (2024)

“When Is a Name Sensitive? Eponyms in Clinical Text and Implications for De-Identification”

Ngo, P., Tejedor M., Olsen Svenning T., Chomutare T., Budrionis A. and H. Dalianis (2024)

“Deidentifying a Norwegian clinical corpus – An effort to create a privacy-preserving Norwegian large clinical language model”

Vakili, T., Hullmann T., Henriksson A. and Dalianis, H. (2024)

“When Is a Name Sensitive? Eponyms in Clinical Text and Implications for De-Identification”

Ngo, P., Tejedor M., Olsen Svenning T., Chomutare T., Budrionis A. and Dalianis, H. (2024)

“Deidentifying a Norwegian clinical corpus - An effort to create a privacy-preserving Norwegian large clinical language model”

Lamproudis, A., Olsen Svenning T., Torsvik T., Chomutare T., Budrionis A. et al (2023)

“De-identifying Norwegian Clinical Text using Resources from Swedish and Danish”

Vakili, T. and Dalianis, H. (2023)

“Using Membership Inference Attacks to Evaluate Privacy-Preserving Language Modeling Fails for Pseudonymizing Data”

Vakili, T. and Dalianis, H. (2022)

“Utility Preservation of Clinical Text After De-Identification”

Vakili, T., Lamproudis, A., Henriksson, A. and Dalianis, H. (2022)

“Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data”

Vakili, T. and Dalianis, H. (2021)

“Are Clinical BERT Models Privacy Preserving? The Difficulty of Extracting Patient-Condition Associations”

Lamproudis, A., Henriksson, A. and Dalianis, H. (2021)

“Developing a Clinical Language Model for Swedish: Continued Pretraining of Generic BERT with In-Domain Data”

Grancharova, M. and Dalianis, H. (2021)

“Applying and Sharing pre-trained BERT-models for Named Entity Recognition and Classification in Swedish Electronic Patient Records”

Dalianis, H. and Berg, H. (2021)

“HB Deid – HB De-identification tool demonstrator”

Berg, H., Henriksson, A., Fors, U. and Dalianis, H. (2021)

“De-identification of Clinical Text for Secondary Use: Research Issues”

Grancharova, M., Berg, H. and Dalianis, H. (2020)

“Improving Named Entity Recognition and Classiﬁcation in Class Imbalanced Swedish Electronic Patient Records through Resampling”

Berg, H., Henriksson, A. and Dalianis, H. (2020)

“The Impact of De-identification on Downstream Named Entity Recognition in Clinical Text”

Berg, H. and Dalianis, H. (2019)

“Augmenting a De-identification System for Swedish Clinical Text Using Open Resources and Deep learning”

Dalianis, H. (2019)

“Pseudonymisation of Swedish Electronic Patient Records Using a Rule-based Approach”