Half-time seminar: Crina Tudor, PhD student at the Department of Linguistics, Stockholm University
Title: Adapting Large Language Models for Named Entity Recognition in Historical Sources.
Abstract
Named entity recognition (NER)—the task of automatically identifying and labeling entities such as names of people, places, organizations, and other key entities in text—is particularly difficult to apply to historical sources. Such texts often contain inconsistent spelling, errors introduced during digitization (OCR noise), and very limited annotated data for training. Recent work has explored the use of large generative language models in a zero-shot setting (i.e. without task-specific training), but these approaches have generally produced unsatisfactory results for historical NER, especially when dealing with noisy and multilingual archival material.
An alternative strategy is domain-adaptive pre-training (DAPT), where models are further trained on domain-specific data before being fine-tuned for a specific task. While DAPT has shown clear benefits in some low-resource contexts, its effectiveness under realistic archival conditions—where data varies widely in language, time period, and quality—has not been thoroughly examined.
In this study, we investigate how well DAPT supports NER in historical texts by comparing five BERT-based models that differ in size, multilingual capacity, and prior domain exposure. Importantly, we use raw historical newspaper data without extensive filtering, in order to better reflect the kinds of material researchers encounter in archives. We evaluate performance across five languages included in the HIPE-2022 shared task, using standard F1 scores to measure accuracy.
We also explore how well these models transfer to new languages by testing them on a historical Romanian NER dataset that was not included during training. This allows us to examine whether familiarity with historical language or broader multilingual coverage plays a more important role in enabling models to generalize across languages.
Last updated: 2026-05-04
Source: Department of Linguistics