Beata MegyesiProfessor

Om mig

Jag är professor i datorlingvistik och arbetar med automatisk analys av texter. Jag är särskilt intresserad av tvärvetenskaplig forskning för att möjliggöra kvantitativa studier med hjälp av artificiell intelligens inom humaniora och samhällsvetenskap. För närvarande arbetar jag med historisk kryptologi för att analysera och lösa hemligt kodade dokument, så kallade chiffer.

Jag har innehaft olika akademiska uppdrag:

Ordförande för Vetenskapsrådets språkvetenskaliga beredningsgrupp (HS-J) 2024-2025 och panelmedlem sedan 2021;
Medlem i styrelsen för Nationella Forskarskolan i Digital Filologi (DigPhil), Sverige, 2023-;
Medlem i styrelsen för Centrum för Digital Humaniora, Uppsala universitet, Sverige, 2020-2023;
President för Northern European Association for Language Technology (NEALT), 2020-2021 och vice president 2018-2019;
Prefekt för institutionen för lingvistik och filologi, Uppsala universitet, 2009-2018;
Direktör för Engelska Park Campus, Uppsala universitet, 2017-2018;

Du kan hitta mer information om min forskning och undervisning nedan.

Undervisning

Jag undervisar regelbundet på grund- och avancerad nivå huvudsakligen inom datorlingvistik. Jag ansvarar också för det internationella masterprogrammet i AI och språk. För närvarande är jag också handledare till två doktorander och bihandledare till en doktorand.

Genom åren har jag undervisat vid tre universitet: Stockholms universitet (SU), Uppsala universitet (UU) och Kungliga tekniska högskolan (KTH). Jag har ansvarat för olika kurser inom datorlingvistik och allmän lingvistik från grundläggande till avancerad nivå. Jag har också medverkat i forskarutbildning.

Kurser på grundnivå:

Korpuslingvistik, 7,5 ECTS: 2023-2025 (SU)
Handledning av kandidatuppsatser, 15 ECTS: 2000- (SU, KTH, UU)
Språk, datorer och textbehandling (på svenska), 7,5 ECTS: 2012-2020 (UU)
Introduktion till språkteknologi, 7,5 ECTS: 2015 (UU)
Handledare för språkteknologiprojekt, 7,5 ECTS: 2011-2016 (UU)
Tekniker för storskalig analys (delar): 2009 (UU)
Korpuslingvistik, 7,5 ECTS: 2005-2007 (UU)
Datorlingvistisk grammatik II, 7,5 ECTS: 2004 (UU)

Kurser på avancerad nivå:

Projektkurs i AI och språk, 15 ECTS: 2025 (SU)
Språkets struktur, 7,5 ECTS: 2024-2025 (SU)
Digital filologi, 7,5 ECTS: 2018-2024 (UU)
Korpusbaserade metoder, 7,5 ECTS: 2023 (SU)
Handledning av masteruppsatser, 30 ECTS: 2004-2023 (UU)
Forskning och utveckling, 15 ECTS: 2021 (UU)
Datorbaserade verktyg för humanistisk forskning, 7.5 ECTS: 2007-2013 (UU)
Examensarbete inom språkteknologi, 30 ECTS: 2005-2007 (UU)
Avancerad kurs i korpuslingvistik, 7,5 ECTS: 2005 (UU)
Handledare för språkteknologiprojekt, 7,5 ECTS: 2011-2016 (UU)

Forskarutbildning:

(Vik.) ledare för den nationella forskarskolan i digital filologi (DigPhil) 2025-2026
Jag är huvudhandledare till Micaella Bruton (SU) och Crina Tudor (SU) samt bihandledare till Oreen Yousuf (UU)
Jag var biträdande handledare för Eva Petterson och Mojgan Seraji (UU)
Kurs i Digtial filologi II, 7,5 ECTS: 2025 (SU/UU)
Kurs i Digtial filoogi I, 7,5 ECTS: 2024 (UU)
Kurs i Naturlig språkbehandling, Den nationella forskarskolan i språkteknologi (GSLT), 2008
Kurs i Infrastrukturella verktyg för studier av språklig variation: Doktorandkurs vid Oslo universitet, juni 2009

Forskning

Jag har alltid varit väldigt intresserad av hur mänskligt språk fungerar och hur det kan bearbetas och analyseras av datorer, delvis för att hjälpa oss förstå mänskligt språk och kommunikation och delvis för att datorer ska göra nytta i våra dagliga liv.

Min forskning fokuserar idag på automatisk analys av historiska handskrivna dokument å ena sidan, och storskalig grammatisk analys av texter för humanistisk och samhällsvetenskaplig forksning å andra sidan. Jag samarbetar både nationellt och internationellt med andra forskare i Sverige, Norge, Spanien, Tyskland, Ungern och USA. Min forskning har fått extern fiannsiering på över 100 miljoner SEK under de senaste 10 åren och mitt vetenskapliga arbete har resulterat i över 100 vetenskapliga artiklar som publicerats i internationella fora.

Några projekt som jag leder/lett och/eller medverkat i:

DESCRYPT: Ekon från det förflutna: Analys och dekryptering av historiska skrifter: PI, Riksbankens Jubileumsfond, 2025-2032
DECRYPT: Dekryptering av historiska manuskript: Huvudansvarig forskare, Vetenskapsrådet, 2018-2024
DECODE: Automatisk avkodning av historiska manuskript: Huvudansvarig forskare, Vetenskapsrådet, 2015-2017
HistoCrypt: Forskarnätverk för historisk kryptologi 2018-
HistCorp: En samling historiska korpusar för 17 europeiska språk 2015-
SWEGRAM: Automatisk annotering och analys av svenska texter, Huvudansvarig forskare; del av Swe-CLARIN-projektet, Vetenskapsrådet, 2014-2024
SWeLL: Forskningsinfrastruktur för svenska som andraspråk: Medsökande, RJ, 2017-2019
Multilingual Parallel Corpora, Vetenskapsrådet: Medlem, 2006-2010
Metoder och verktyg för automatisk grammatikutvinning: Vetenskapsrådet: Medlem, 2005-2007
En infrastruktur för svensk språkteknologi: Medlem, Vetenskapsrådet, 2007-2008

Jag ger också regelbundet intervjuer om min forskning i media, till exempel:

Vetenskapsradion Historia Intervju med mig av Urban Björstadius, 8/2 2025
Forskning och Framsteg 2024/9. AI löser historiska krypton av Lina Wennersten-Gradert
Nobel is calling: Så knäcker vi koderna. Nobelprismuseet, 7/10 2024.
Datorlingvisten som knäcker historiska koder 2024. Artikel och film av Stockholms universitets medieavdelning
New Scientist: How scientists are cracking historical codes to reveal lost secrets
C'T Magazine: Die Kryptografen des Papstes
Populär historia: Historiska chiffer ska knäckas med algoritmer
Curie: Tillsammans knäcker forskarna historisk kod
Sveriges Radio P4 Uppland: Kryptologer från hela världen träffas på historisk konferens
New York Times: How Revolutionary Tools Cracked a 1700s Code and other publications in the press about the Copiale cipher

Mer information om min forskning kan du hitta under publikationer.

Jag har även tjänstgjort vid flertalet kommittéer för doktorsavhandlingar och halvtidskontroller, är granskare regelbundet för konferenser och workshopar och har haft många sakkunniguppdrag vid tjänstetillsättningar i Sverige och utlandet. Jag har också varit bedömare av projekt för Vetenskapsrådet och Wallenbergstiftelsen.

Forskningsprojekt

Publikationer

Beáta Megyesis publikationer per år och per typ.

I urval från Stockholms universitets publikationsdatabas

Decipherment of Historical Manuscripts with Unknown or Rare Writings: The DESCRYPT Project

2025. Beáta Megyesi (et al.). In Proceedings of the 8th International Conference on Historical Cryptology (HistoCrypt 2025)

Konferens

We present a newly funded research program, DESCRYPT, aimed at deciphering and analyzing historical texts with rare or unknown scripts. The project leverages advancements in computational linguistics, artificial intelligence (AI), and image processing, alongside traditional philological methods, to develop innovative tools for transcription, recognition, and interpretation of historical writings with rare/unknown scripts, including ciphertexts. By integrating interdisciplinary expertise, DESCRYPT addresses the challenges posed by complex and undeciphered texts, preserving and unlocking the secrets of our shared cultural heritage.

Läs mer om Decipherment of Historical Manuscripts with Unknown or Rare Writings
Historical Cryptology

2024. Beáta Megyesi (et al.). Learning and Experiencing Cryptography with CrypTool and SageMath

Kapitel

Historical cryptology studies (original) encrypted manuscripts, often handwritten sources, produced in our history. These historical sources can be found in archives, often hidden without any indexing and therefore hard to locate. Once found they need to be digitized and turned into a machine-readable text format before they can be deciphered with computational methods. The focus of historical cryptology is not primarily the development of sophisticated algorithms for decipherment, but rather the entire process of analysis of the encrypted source from collection and digitization to transcription and decryption. The process also includes the interpretation and contextualization of the message set in its historical context. There are many challenges on the way, such as mistakes made by the scribe, errors made by the transcriber, damaged pages, handwriting styles that are difficult to interpret, historical languages from various time periods, and hidden underlying language of the message. Ciphertexts vary greatly in terms of their code system and symbol sets used with more or less distinguishable symbols. Ciphertexts can be embedded in clearly written text, or shorter or longer sequences of cleartext can be embedded in the ciphertext. The ciphers used mostly in historical times are substitutions (simple, homophonic, or polyphonic), with or without nomenclatures, encoded as digits or symbol sequences, with or without spaces. So the circumstances are different from those in modern cryptography which focuses on methods (algorithms) and their strengths and assumes that the algorithm is applied correctly. For both historical and modern cryptology, attack vectors outside the algorithm are applied like implementation flaws and side-channel attacks. In this chapter, we give an introduction to the field of historical cryptology and present an overview of how researchers today process historical encrypted sources.

Läs mer om Historical Cryptology
From Statistics to Neural Networks: Enhancing Ciphertext-Plaintext Alignment in Historical Substitution Ciphers for Automatic Key Extraction

2025. Micaella Bruton, Beáta Megyesi. Proceedings of the 8th International Conference on Historical Cryptology (HistoCrypt 2025)

Konferens

Ciphertext manuscripts found in archival collections are often intermingled with plaintext manuscripts in various languages, making the manual analysis required to match the documents labour-intensive and complex. Automating the alignment of these texts to reconstruct corresponding cipher keys is therefore highly beneficial, particularly when handling large volumes of documents. This study introduces a novel approach using modern neural networks, specifically Long Short-Term Memory (LSTM) architectures, to develop an automated method for aligning homophonic substitution ciphertexts with plaintext. These neural models are compared to traditional statistical approaches, demonstrating that LSTMs achieve significant accuracy improvements, including perfect alignment for ciphertexts of 50 characters or less. Additionally, to facilitate practical application, a program has been developed to enable the upload of transcribed ciphertext and plaintext documents, using the optimized models to automatically align the texts and extract the substitution key.

Läs mer om From Statistics to Neural Networks: Enhancing Ciphertext-Plaintext Alignment in Historical Substitution Ciphers for Automatic Key Extraction
ICDAR 2024 Competition on Handwriting Recognition of Historical Ciphers

2024. Alicia Fornés (et al.). Document Analysis and Recognition - ICDAR 2024

Konferens

Handwritten Text Recognition (HTR) in low-resource scenarios (i.e. when the amount of labeled data is scarce) is a challenging problem. This is particularly true for historical encrypted manuscripts, commonly known as ciphers, which contain secret messages and were typically used in military or diplomatic correspondence, records of secret societies, or private letters. To hide their contents, the sender and receiver created their own secret method of writing. The cipher alphabets often include digits, Latin or Greek letters, Zodiac and alchemical signs, combined with various diacritics, as well as invented ones. The first step in the decryption process is the transcription of these manuscripts, which is difficult due to the great variation in handwriting styles and cipher alphabets with a limited number of pages. Although different strategies can be considered to deal with the insufficient amount of training data (e.g., few-shot learning, self-supervised learning), the performance of available HTR models is not yet satisfactory. Thus, the proposed competition, which includes ciphers with a large number of symbol sets and scribes, aims to boost research in HTR in low-resource scenarios.

Läs mer om ICDAR 2024 Competition on Handwriting Recognition of Historical Ciphers
An STS analysis of a digital humanities collaboration: trading zones, boundary objects, and interactional expertise in the DECRYPT project

2024. Benedek Láng, Beáta Megyesi. Humanities and Social Sciences Communications 11 (1)

Artikel

A widely shared recognition over the past decade is that the methodology and the basic concepts of science and technology studies (STS) can be used to analyze collaborations in the cross-disciplinary field of digital humanities (DH). The concepts of trading zones (Galison, 2010), boundary objects (Star and Griesemer, 1989), and interactional expertise (Collins and Evans, 2007) are particularly fruitful for describing projects in which researchers from massively different epistemic cultures (Knorr Cetina, 1999) are trying to develop a common language. The literature, however, primarily concentrates on examples where only two parties, historians and IT experts, work together. More exciting perspectives open up for analysis when more than two, more nuanced and different epistemic cultures seek a common language and common research goals. In the DECRYPT project funded by the Swedish Research Council, computational linguists, historians, computer scientists and AI experts, cryptologists, computer vision specialists, historical linguists, archivists, and philologists collaborate with strikingly different methodologies, publication patterns, and approaches. They develop and use common resources (including a database and a large collection of European historical texts) and tools (among others a code-breaking software, a hand-written text recognition tool for transcription), researching partly overlapping topics (handwritten historical ciphers and keys) to reach common goals. In this article, we aim to show how the STS concepts are illuminating when describing the mechanisms of the DECRYPT collaboration and shed some light on the best practices and challenges of a truly cross-disciplinary DH project.

Läs mer om An STS analysis of a digital humanities collaboration: trading zones, boundary objects, and interactional expertise in the DECRYPT project
Prompting the Past: Exploring Zero-Shot Learning for Named Entity Recognition in Historical Texts Using Prompt-Answering LLMs

2025. Crina Madalina Tudor, Beáta Megyesi, Robert Östling. Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL2025), 216-226

Konferens

This paper investigates the application of prompt-answering Large Language Models (LLMs) for the task of Named Entity Recognition (NER) in historical texts. Historical NER presents unique challenges due to language change through time, spelling variation, limited availability of digitized data (and, in particular, labeled data), and errors introduced by Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) processes. Leveraging the zero-shot capabilities of prompt-answering LLMs, we address these challenges by prompting the model to extract entities such as persons, locations, organizations, and dates from historical documents. We then conduct an extensive error analysis of the model output in order to identify and address potential weaknesses in the entity recognition process. The results show that, while such models display ability for extracting named entities, their overall performance is lackluster. Our analysis reveals that model performance is significantly affected by hallucinations in the model output, as well as by challenges imposed by the evaluation of NER output.

Läs mer om Prompting the Past
DECODE2LOD: Connecting the DECODE Database with the Linked Open Data Cloud

2025. Cosimo Palma, Beáta Megyesi. Proceedings of the 8th International Conference on Historical Cryptology (HistoCrypt 2025)

Konferens

This paper presents a novel approach to enhancing the analytical power and interoperability of historical cryptology data by transforming the DECODE database into a Linked Open Data (LOD) resource. We introduce a methodology for modeling encrypted historical documents and cipher keys as a knowledge graph, encompassing ontology development, data transformation, and SPARQL-based querying. This integration enables complex queries across domains, encourages collaboration beyond cryptology, and aligns DECODE with broader efforts in digital humanities and open science. By bridging historical cryptology with LOD principles, we offer a scalable framework for enriching specialized research databases through semantic technologies.

Läs mer om DECODE2LOD
Keys with nomenclatures in the early modern Europe

2024. Beáta Megyesi (et al.). Cryptologia 48 (2), 97-139

Artikel

We give an overview of the development of European historical cipher keys originating from early Modern times. We describe the nature and the structure of the keys with a special focus on the nomenclatures. We analyze what was encoded and how and take into account chronological and regional differences. The study is based on the analysis of over 1,600 cipher keys, collected from archives and libraries in 10 European countries. We show that historical cipher keys evolved over time and became more secure, shown by the symbol set used for encoding, the code length and the code types presented in the key, the size of the nomenclature, as well as the diversity and complexity of linguistic entities that are chosen to be encoded.

Läs mer om Keys with nomenclatures in the early modern Europe
Proceedings of the 7th International Conference on Historical Cryptology (HistoCrypt 2024)

2024. .

Konferens

Läs mer om Proceedings of the 7th International Conference on Historical Cryptology (HistoCrypt 2024)
Supporting Historical Cryptology: The Decrypt Pipeline

2024. Mihály Héder (et al.). Proceedings of the 7th International Conference on Historical Cryptology (HistoCrypt 2024)

Konferens

We present a set of resources and tools to support research and development in the field of historical cryptology. The tools aim to support transcription and decipherment of ciphertexts, developed to work together in a pipeline. It encompasses cataloging these documents into the Decode database, which houses ciphers dating from the 14th century to 1965, transcription using both manual and AI-assisted methods, cryptanalysis, and subsequent historical and linguistic analysis to contextualize decrypted content. The project encounters challenges with the accuracy of automated transcription technologies and the necessity for significant user involvement in the transcription and analysis processes. These insights highlight the critical balance between technological innovation and the indispensable input of domain expertise in advancing the field of historical cryptology.

Läs mer om Supporting Historical Cryptology: The Decrypt Pipeline
A Typology for Cipher Key Instructions in Early Modern Times

2024. Beáta Megyesi (et al.). Proceedings of the 7th International Conference on Historical Cryptology (HistoCrypt 2024)

Konferens

We present an empirical study on instructions found in historical cipher keys dating back to early modern times in Europe. The study reveals that instructions in historical cipher keys are prevalent, covering a wide range of themes related to the practical application of ciphers. These include general information about the structure or usage of the cipher key, as well as specific instructions on their application. Being a hitherto neglected genre, these texts provide insight into the practice of cryptographic operations.

Läs mer om A Typology for Cipher Key Instructions in Early Modern Times
Exploring the Alignment of Transcriptions to Images of Encrypted Manuscripts

2024. García Goio (et al.). Proceedings of the 7th International Conference on Historical Cryptology (HistoCrypt 2024)

Konferens

The automatic transcription of encrypted manuscripts is a challenge due to the different handwriting styles and the often invented symbol alphabets. Many transcription methods require annotated sources, including symbol locations. However, most existing transcriptions are provided at line or page level, making it necessary to find the bounding boxes of the transcribed symbols in the image, a process referred to as alignment. So, in this work, we develop several alignment methods, and discuss their performance on encrypted documents with various symbol sets.

Läs mer om Exploring the Alignment of Transcriptions to Images of Encrypted Manuscripts

Visa alla publikationer av Beata Megyesi vid Stockholms universitet

Redigera profilen

Beata MegyesiProfessor

Om mig

Undervisning

Forskning

Forskningsprojekt

Publikationer

Decipherment of Historical Manuscripts with Unknown or Rare Writings: The DESCRYPT Project

Historical Cryptology

From Statistics to Neural Networks: Enhancing Ciphertext-Plaintext Alignment in Historical Substitution Ciphers for Automatic Key Extraction

ICDAR 2024 Competition on Handwriting Recognition of Historical Ciphers

An STS analysis of a digital humanities collaboration: trading zones, boundary objects, and interactional expertise in the DECRYPT project

Prompting the Past: Exploring Zero-Shot Learning for Named Entity Recognition in Historical Texts Using Prompt-Answering LLMs

DECODE2LOD: Connecting the DECODE Database with the Linked Open Data Cloud

Keys with nomenclatures in the early modern Europe

Proceedings of the 7th International Conference on Historical Cryptology (HistoCrypt 2024)

Supporting Historical Cryptology: The Decrypt Pipeline

A Typology for Cipher Key Instructions in Early Modern Times

Exploring the Alignment of Transcriptions to Images of Encrypted Manuscripts