Stockholm university

Research seminar: Raphaela Heil

Seminar

Date: Tuesday 14 November 2023

Time: 12.00

Location: Department of Linguistics, Room C315

Handwritten Text Recognition applied to Astrid Lindgren's stenographic manuscripts, with data provided by the Swedish Institute for Children's Books.

About the speaker

Raphaela Heil is a recent PhD graduate from Uppsala University. Her thesis focused on Handwritten Text Recognition applied to Astrid Lindgren's stenographic manuscripts, with data provided by the Swedish Institute for Children's Books. Raphaela is currently working on HTR for Swedish Labour Union documents atFolkrörelsearkivet in Uppsala, as part of the Labour's Memory project, but her talk will be centered around her thesis project.

Abstract

Document image processing and handwritten text recognition have been applied to a variety of materials, scripts, and languages, both modern and historic. They are crucial building blocks in the on-going digitisation efforts of archives, where they aid in preserving archival materials and foster knowledge sharing. The latter is especially facilitated by making document contents available to interested readers who may have little to no practice in, for example, reading a specific script type, and might therefore face challenges in accessing the material.

The first part of this dissertation focuses on reducing editorial artifacts, specifically in the form of struck-through words, in manuscripts. The main goal of this process is to identify struck-through words and remove as much of the strike-through artifacts as possible in order to regain access to the original word. This step can serve both as preprocessing, to aid human annotators and readers, as well as in computerised pipelines, such as handwritten text recognition. Two deep learning-based approaches, exploring paired and unpaired data settings, are examined and compared. Furthermore, an approach for generating synthetic strike-through data, for example, for training and testing purposes, and three novel datasets are presented. The second part of this dissertation is centered around applying handwritten text recognition to the stenographic manuscripts of Swedish children's book author Astrid Lindgren (1907 - 2002).

Manually transliterating stenography, also known as shorthand, requires special domain knowledge of the script itself. Therefore, the main focus of this part is to reduce the required manual work, aiming to increase the accessibility of the material. In this regard, a baseline for handwritten text recognition of Swedish stenography is established. Two approaches for improving upon this baseline are examined. Firstly, a variety of data augmentation techniques, commonly-used in handwritten text recognition, are studied. Secondly, different target sequence encoding methods, which aim to approximate diplomatic transcriptions, are investigated. The latter, in combination with a pre-training approach, significantly improves the recognition performance. In addition to the two presented studies, the novel LION dataset is published, consisting of excerpts from Astrid Lindgren's stenographic manuscripts. 

Find out more about Raphaela:s thesis project here