Aron HenrikssonUniversitetslektor, docent
Om mig
Jag är med och leder en forskargrupp inom språkteknolgi där vi utvecklar, tillämpar och utvärderar språkteknologiska metoder, företrädesvis involverande stora språkmodeller. Vi fokuserar på ämnen som integritet, förklarbet och domänanpassning, och är intresserade av att utforska nya tillämpningar av stora språkmodeller inom områden såsom hälso- och sjukvården och utbildning.
Undervisning
Utöver uppsatshandledning undervisar jag framför allt kurser inom AI, språkteknologi, informationssökning och big data:
- BIGDATA: Big Data with NoSQL Databases (kursansvarig)
- MAIO: Managing AI in the Organization (kursansvarig)
- NLP: Natural Language Processing
- ISBI: Internet Search Techniques and Business Intelligence
Forskningsprojekt
Publikationer
I urval från Stockholms universitets publikationsdatabas
-
Evaluating the Reliability of Self-Explanations in Large Language Models
2025. Korbinian Robert Randl (et al.). Discovery Science, 36-51
KonferensThis paper investigates the reliability of explanations generated by large language models~(LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations -- extractive and counterfactual -- using three state-of-the-art LLMs (2B to 8B parameters) on two different classification tasks (objective and subjective).
Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process, indicating a gap between perceived and actual model reasoning.
We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results. These counterfactuals offer a promising alternative to traditional explainability methods (e.g. SHAP, LIME), provided that prompts are tailored to specific tasks and checked for validity.
-
SweClinEval: A Benchmark for Swedish Clinical Natural Language Processing
2025. Thomas Vakili, Martin Hansson, Aron Henriksson. Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), 767-775
KonferensThe lack of benchmarks in certain domains and for certain languages makes it difficult to track progress regarding the state-of-the-art of NLP in those areas, potentially impeding progress in important, specialized domains. Here, we introduce the first Swedish benchmark for clinical NLP: SweClinEval. The first iteration of the benchmark consists of six clinical NLP tasks, encompassing both document-level classification and named entity recognition tasks, with real clinical data. We evaluate nine different encoder models, both Swedish and multilingual. The results show that domain-adapted models outperform generic models on sequence-level classification tasks, while certain larger generic models outperform the clinical models on named entity recognition tasks. We describe how the benchmark can be managed despite limited possibilities to share sensitive clinical data, and discuss plans for extending the benchmark in future iterations.
-
CICLe: Conformal In-Context Learning for Largescale Multi-Class Food Risk Classification
2024. Korbinian Robert Randl (et al.). Findings of the Association for Computational Linguistics, 7695-7715
KonferensContaminated or adulterated food poses a substantial risk to human health. Given sets of labeled web texts for training, Machine Learning and Natural Language Processing can be applied to automatically detect such risks. We publish a dataset of 7,546 short texts describing public food recall announcements. Each text is manually labeled, on two granularity levels (coarse and fine), for food products and hazards that the recall corresponds to. We describe the dataset and benchmark naive, traditional, and Transformer models. Based on our analysis, Logistic Regression based on a tf-idf representation outperforms RoBERTa and XLM-R on classes with low support. Finally, we discuss different prompting strategies and present an LLM-in-the-loop framework, based on Conformal Prediction, which boosts the performance of the base classifier while reducing energy consumption compared to normal prompting.
-
End-to-End Pseudonymization of Fine-Tuned Clinical BERT Models: Privacy Preservation with Maintained Data Utility
2024. Thomas Vakili, Aron Henriksson, Hercules Dalianis. BMC Medical Informatics and Decision Making
ArtikelMany state-of-the-art results in natural language processing (NLP) rely on large pre-trained language models (PLMs). These models consist of large amounts of parameters that are tuned using vast amounts of training data. These factors cause the models to memorize parts of their training data, making them vulnerable to various privacy attacks. This is cause for concern, especially when these models are applied in the clinical domain, where data are very sensitive.
One privacy-preserving technique that aims to mitigate these problems is training data pseudonymization. This technique automatically identifies and replaces sensitive entities with realistic but non-sensitive surrogates. Pseudonymization has yielded promising results in previous studies. However, no previous study has applied pseudonymization to both the pre-training data of PLMs and the fine-tuning data used to solve clinical NLP tasks.
This study evaluates the predictive performance effects of end-to-end pseudonymization of clinical BERT models on five clinical NLP tasks compared to pre-training and fine-tuning on unaltered sensitive data. A large number of statistical tests are performed, revealing minimal harm to performance when using pseudonymized fine-tuning data. The results also find no deterioration from end-to-end pseudonymization of pre-training and fine-tuning data. These results demonstrate that pseudonymizing training data to reduce privacy risks can be done without harming data utility for training PLMs.
-
Selecting from Multiple Strategies Improves the Foreseeable Reasoning of Tool-Augmented Large Language Models
2024. Yongchao Wu, Aron Henriksson. Joint European Conference on Machine Learning and Knowledge Discovery in Databases
KonferensLarge language models (LLMs) can be augmented by interacting with external tools and knowledge bases, allowing them to overcome some of their known limitations, such as not having access to up-to-date information or struggling to solve math problems, thereby going beyond the knowledge and capabilities obtained during pre-training. Recent prompting techniques have enabled tool-augmented LLMs to combine reasoning and action to solve complex problems with the help of tools. This is essential for allowing LLMs to strategically determine the timing and nature of tool-calling actions in order to enhance their decision-making process and improve their outputs. However, the reliance of current prompting techniques on a single reasoning path or their limited ability to adjust plans within that path can adversely impact the performance of tool-augmented LLMs. In this paper, we introduce a novel prompting method, whereby an LLM agent selects and executes one among multiple candidate strategies. We assess the effectiveness of our method on three question answering datasets, on which it outperforms state-of-the-art methods like ReWOO, while also being a competitive and more cost-efficient alternative to ReAct. We also investigate the impact of selecting a reasoning trajectory from different strategy pool sizes, further highlighting the risks in only considering a single strategy.
-
Supporting Teaching-to-the-Curriculum by Linking Diagnostic Tests to Curriculum Goals: Using Textbook Content as Context for Retrieval-Augmented Generation with Large Language Models
2024. Xiu Li (et al.). Artificial Intelligence in Education, 118-132
KonferensUsing AI for automatically linking exercises to curriculum goals can support many educational use cases and facilitate teaching-to-the-curriculum by ensuring that exercises adequately reflect and encompass the curriculum goals, ultimately enabling curriculum-based assessment. Here, we introduce this novel task and create a manually labeled dataset where two types of diagnostic tests are linked to curriculum goals for Biology G7-9 in Sweden. We cast the problem both as an information retrieval task and a multi-class text classification task and explore unsupervised approaches to both, as labeled data for such tasks is typically scarce. For the information retrieval task, we employ SOTA embedding model ADA-002 for semantic textual similarity (STS), while we prompt a large language model in the form of ChatGPT to classify diagnostic tests into curriculum goals. For both task formulations, we investigate different ways of using textbook content as a pivot and provide additional context for linking diagnostic tests to curriculum goals. We show that a combination of the two approaches in a retrieval-augmented generation model, whereby STS is used for retrieving textbook content as context to ChatGPT that then performs zero-shot classification, leads to the best classification accuracy (73.5%), outperforming both STS-based classification (67.5%) and LLM-based classification without context (71.5%). Finally, we showcase how the proposed method could be used in pedagogical practices.
-
Multimodal fine-tuning of clinical language models for predicting COVID-19 outcomes
2023. Aron Henriksson (et al.). Artificial Intelligence in Medicine 146
ArtikelClinical prediction models tend only to incorporate structured healthcare data, ignoring information recorded in other data modalities, including free-text clinical notes. Here, we demonstrate how multimodal models that effectively leverage both structured and unstructured data can be developed for predicting COVID-19 outcomes. The models are trained end-to-end using a technique we refer to as multimodal fine-tuning, whereby a pre -trained language model is updated based on both structured and unstructured data. The multimodal models are trained and evaluated using a multicenter cohort of COVID-19 patients encompassing all encounters at the emergency department of six hospitals. Experimental results show that multimodal models, leveraging the notion of multimodal fine-tuning and trained to predict (i) 30-day mortality, (ii) safe discharge and (iii) readmission, outperform unimodal models trained using only structured or unstructured healthcare data on all three outcomes. Sensitivity analyses are performed to better understand how well the multimodal models perform on different patient groups, while an ablation study is conducted to investigate the impact of different types of clinical notes on model performance. We argue that multimodal models that make effective use of routinely collected healthcare data to predict COVID-19 outcomes may facilitate patient management and contribute to the effective use of limited healthcare resources.
-
Towards Improving the Reliability and Transparency of ChatGPT for Educational Question Answering
2023. Yongchao Wu (et al.). LNCS Springer Conference Proceedings
KonferensLarge language models (LLMs), such as ChatGPT, have shown remarkable performance on various natural language processing (NLP) tasks, including educational question answering (EQA). However, LLMs generate text entirely based on knowledge obtained during pre-training, which means they struggle with recent information or domain-specific knowledge bases. Moreover, only providing answers to questions posed to LLMs without any grounding materials makes it difficult for students to judge their validity.
We therefore propose a method for integrating information retrieval systems with LLMs when developing EQA systems, which in addition to improving EQA performance grounds the answers in the educational context. Our experiments show that the proposed system outperforms vanilla ChatGPT with a vast margin of 110.9%, 67.8%, and 43.3% on BLEU, ROUGE, and METEOR scores. In addition, we argue that the use of the retrieved educational context enhances the transparency and reliability of the EQA process, making it easier to determine the correctness of the answers.
-
Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data
2022. Thomas Vakili (et al.). Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 4245-4252
KonferensAutomatic de-identification is a cost-effective and straightforward way of removing large amounts of personally identifiable information from large and sensitive corpora. However, these systems also introduce errors into datasets due to their imperfect precision. These corruptions of the data may negatively impact the utility of the de-identified dataset. This paper de-identifies a very large clinical corpus in Swedish either by removing entire sentences containing sensitive data or by replacing sensitive words with realistic surrogates. These two datasets are used to perform domain adaptation of a general Swedish BERT model. The impact of the de-identification techniques is assessed by training and evaluating the models using six clinical downstream tasks. The results are then compared to a similar BERT model domain-adapted using an unaltered version of the clinical corpus. The results show that using an automatically de-identified corpus for domain adaptation does not negatively impact downstream performance. We argue that automatic de-identification is an efficient way of reducing the privacy risks of domain-adapted models and that the models created in this paper should be safe to distribute to other academic researchers.
-
Evaluating Pretraining Strategies for Clinical BERT Models
2022. Anastasios Lamproudis, Aron Henriksson, Hercules Dalianis. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 410-416
KonferensResearch suggests that using generic language models in specialized domains may be sub-optimal due to significant domain differences. As a result, various strategies for developing domain-specific language models have been proposed, including techniques for adapting an existing generic language model to the target domain, e.g. through various forms of vocabulary modifications and continued domain-adaptive pretraining with in-domain data. Here, an empirical investigation is carried out in which various strategies for adapting a generic language model to the clinical domain are compared to pretraining a pure clinical language model. Three clinical language models for Swedish, pretrained for up to ten epochs, are fine-tuned and evaluated on several downstream tasks in the clinical domain. A comparison of the language models’ downstream performance over the training epochs is conducted. The results show that the domain-specific language models outperform a general-domain language model, although there is little difference in performance between the various clinical language models. However, compared to pretraining a pure clinical language model with only in-domain data, leveraging and adapting an existing general-domain language model requires fewer epochs of pretraining with in-domain data.
-
Leveraging Clinical BERT in Multimodal Mortality Prediction Models for COVID-19
2022. Yash Pawar (et al.). In Proceedings of IEEE International Symposium on Computer-Based Medical Systems (CMBS 2022), 199-204
KonferensClinical prediction models are often based solely on the use of structured data in electronic health records, e.g. vital parameters and laboratory results, effectively ignoring potentially valuable information recorded in other modalities, such as free-text clinical notes. Here, we report on the development of a multimodal model that combines structured and unstructured data. In particular, we study how best to make use of a clinical language model in a multimodal setup for predicting 30-day all-cause mortality upon hospital admission in patients with COVID-19. We evaluate three strategies for incorporating a domain-specific clinical BERT model in multimodal prediction systems: (i) without fine-tuning, (ii) with unimodal fine-tuning, and (iii) with multimodal fine-tuning. The best-performing model leverages multimodal fine-tuning, in which the clinical BERT model is updated based also on the structured data. This multimodal mortality prediction model is shown to outperform unimodal models that are based on using either only structured data or only unstructured data. The experimental results indicate that clinical prediction models can be improved by including data in other modalities and that multimodal fine-tuning of a clinical language model is an effective strategy for incorporating information from clinical notes in multimodal prediction systems.
-
Developing a Clinical Language Model for Swedish: Continued Pretraining of Generic BERT with In-Domain Data
2021. Anastasios Lamproudis, Aron Henriksson, Hercules Dalianis. INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING 2021, 790-797
KonferensThe use of pretrained language models, finetuned to perform a specific downstream task, has become widespread in NLP. Using a generic language model in specialized domains may, however, be sub-optimal due to differences in language use and vocabulary. In this paper, it is investigated whether an existing, generic language model for Swedish can be improved for the clinical domain through continued pretraining with clinical text.
The generic and domain-specific language models are fine-tuned and evaluated on three representative clinical NLP tasks: (i) identifying protected health information, (ii) assigning ICD-10 diagnosis codes to discharge summaries, and (iii) sentence-level uncertainty prediction. The results show that continued pretraining on in-domain data leads to improved performance on all three downstream tasks, indicating that there is a potential added value of domain-specific language models for clinical NLP.
Visa alla publikationer av Aron Henriksson vid Stockholms universitet