Research project DataLEASH: LEarning And SHaring under Privacy Constraints
With massive amounts of personal data being generated, privacy has become a great challenge. This project studies how machine learning can be used for sharing language models without risking to share information that may identify individuals.
The recent confluence of digitalization, increasingly data heavy technologies, advances in machine learning, and legal regulations has turned privacy into a great challenge.
While regulations such as the GDPR serve as a major step toward protecting society, there is a lack of guidelines and technical specifications of what kind of privacy leakage is acceptable. Currently, this prevents data from being exploited and shared to the full extent possible.
To solve these problems, we need:
1. quantitative and legal privacy risk and utility assessments, and
2. mechanisms for data transformation and learning that improve the results of these assessments.
Thus, the DataLEASH project will develop and test the methods that will lead to more open data. Participants in this project are Stockholm University, KTH and RISE.
HB Deid is a tool that has been developed for de-identification of texts.
See how the HB Deid tool works
Former members of this project are Hanna Berg and Mila Grancharova.
Recipients of this project are Charlotte Dingertz, City of Stockholm, Sven-Åke Lööv, Region Stockholm, Henrik Löf, Karolinska University Hospital, Marina Santini, RISE, and Peter Lundberg, Linköping University Hospital.
This project is financed in KTH’s digitalization initiative in 2019 for IT and mobile communication (ICT TNG) through the government’s strategic research areas (SFO) to create world-leading research.
In WP5, we will start with a problem formulation from existing projects where medical, municipal, and other data repositories have been facing challenges with privacy, anonymization, pseudonymization and similar.
We will perform a series of experiments on existing very large data sets from, e. g., the Stockholm county council (medical records), Elekta (medical imaging data) and City of Stockholm (data from numerous systems that are linked to many areas of the city), to investigate possibilities and challenges with the mechanisms developed within DataLEASH.
WP5 will also create demo applications for demonstrating possibilities of DataLEASH mechanisms on different types of data. At Stockholm University, we have access to the research infrastructure Health Bank, the Swedish Health Record Research Bank that contains over two million electronic patient records from Karolinska University Hospital from the years 2007–2014.
They are stored in a relational database with over 80 tables where experiments can be carried out to study when and where anonymity can be preserved for example by using different privacy preserving data record linkage methods beyond regular pseudonymization. Experiments can be done and data securely shared between partners on the RISE ICE computer cluster.
Vakili, T., Lamproudis, A., Henriksson, A. and Dalianis, H. (2022)
"Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data"
Vakili, T. and Dalianis, H. (2021)
“Are Clinical BERT Models Privacy Preserving? The Difficulty of Extracting Patient-Condition Associations”
Lamproudis, A., Henriksson, A. and Dalianis, H. (2021)
“Developing a Clinical Language Model for Swedish: Continued Pretraining of Generic BERT with In-Domain Data”
Grancharova, M. and Dalianis, H. (2021)
“Applying and Sharing pre-trained BERT-models for Named Entity Recognition and Classification in Swedish Electronic Patient Records”
Berg, H., Henriksson, A., Fors, U. and Dalianis, H. (2021)
“De-identification of Clinical Text for Secondary Use: Research Issues”
Grancharova, M., Berg, H. and Dalianis, H. (2020)
“Improving Named Entity Recognition and Classiﬁcation in Class Imbalanced Swedish Electronic Patient Records through Resampling”
Berg, H., Henriksson, A. and Dalianis, H. (2020)
“The Impact of De-identification on Downstream Named Entity Recognition in Clinical Text”
Berg, H. and Dalianis, H. (2020)
“A Semi-supervised Approach for De-identification of Swedish Clinical Text”
Berg, H., Chomutare, T. and Dalianis, H. (2019)
“Building a De-identification System for Real Swedish Clinical Text Using Pseudonymised Clinical Text”
Berg, H. and Dalianis, H. (2019)
“Augmenting a De-identification System for Swedish Clinical Text Using Open Resources and Deep learning”