Stockholm university

Beata MegyesiProfessor

About me

I hold the position of a Professor in Computational Linguistics. My main research areas include natural language processing and digital philology and my scholarly pursuits center around cross-disciplinary research aimed at facilitating quantitative studies utilizing AI for the humanities and social sciences. Currently, I am working on historical cryptology to analyze and break ciphers and codes.  

Throughout the years, I have actively taken on a range of academic roles:

  • Chair of the Linguistics Review Panel at the Swedish Research Council 2024 and Member since 2021;
  • Member of the board of the National Research School in Digital Philology (DigPhil), Sweden, 2023-;
  • Member of the board at the Center for Digital Humanities, Uppsala University, Sweden, 2020-2023;
  • President of the Northern European Association for Language Technology (NEALT), 2020-2021 and vice president 2018-2019;
  • Head of Department of Linguistics and Philology, 2009-2018;
  • Director of the English Park Campus, Uppsala University, 2017-2018;

For additional insights into my research and teaching endeavors, please refer to the details provided below.

Teaching

I teach regularly at the undergraduate and advanced level, primarily in computational linguistics. I am program responsible for the international master's program in AI and Language. I am the main supervisor for two PhD students and co-supervisor for one PhD student. 

Throughout the years, I have been taught courses at three universities: the Dept. of Linguistics at Stockholm University (SU), the Dept. of Linguistics and Philology at Uppsala University (UU), and the Dept. of Speech, Music and Hearing at KTH. I have been given various courses in computational linguistics (CL) and general linguistics (GL) from basic to advanced levels, as well as some PhD courses. 

Basic level courses:
  • Corpus linguistics, 7.5 ECTS: 2023 (SU)
  • Computational grammar II, 7.5 ECTS: 2004 (UU)
  • Corpus linguistics, 7.5 ECTS: 2005, 2006, 2007 (UU)
  • Introduction to Language Technology: 2015  (UU)
  • Languages, computers, and text processing (in Swedish): 2016  (UU)
  • Techniques for large scale parsing (parts): 2009  (UU)
  • Advisor for Language Technology Project, 7.5 ECTS: 2011-2016  (UU)
  • BA thesis supervision (SU, KTH, UU)
Advanced level courses:
  • Corpus-based methods, 7.5 ECTS: 2023 (SU)
  • Research and development, 15 ECTS: 2021 (UU)
  • Digital philology, 7.5 ECTS: 2018-2019 (UU)
  • Computer-based tools for research in humanities, 7.5 ECTS: 2007-2013 (UU)
  • Thesis work in language technology, 30 ECTS: 2005, 2006, 2007 (UU)
  • Advanced course in corpus linguistics, 7.5 ECTS: 2005 (UU)
  • Advisor for Language Technology Project, 7.5 ECTS: 2011-2016 (UU)
  • Master thesis supervision (UU)
PhD education:
  • I am the main supervisor of Micaella Bruton (SU) and Crina Tudor (SU), and co-supervisor of Oreen Yousuf (UU)
  • I was co-supervisor: Eva Petterson and Mojgan Seraji
  • Natural Language Processing, GSLT, 2008
  • Infrastructural tools for the study of linguistic variation: PhD course at Oslo University, June 2009

Research

I have always been interested in how human language is processed by humans, and how it can be processed by machines. My research focuses on the automatic analysis of historical handwritten documents on one hand, and large-scale text analysis for research within the humanities and social sciences on the other hand. I collaborate both nationally and internationally in Sweden, Germany, Hungary, Norway, Spain, and the USA. Over the past 10 years, my research has received external funding exceeding 4 million Euros, and my scientific work has resulted in over 100 scientific articles published in international fora.

Some projects that I led and/or contributed to: 

  • DECRYPT: Decryption of Historical Manuscripts: PI, Swedish Research Council, 2018-2024 
  • DECODE: Automatic Decoding of Historical Manuscript: PI, Swedish Research Council, 2015-2017
  • HistoCrypt: A scientific forum for historical cryptology 2018-
  • HistCorp: A collection of historical texts for 17 European languages 2015-
  • SWEGRAM: Automatic Annotation and Analysis of Swedish texts, PI; part of the Swe-CLARIN project, Swedish Research Council, 2014-2024
  • SWeLL: Research Infrastructure for Swedish as a second language: co-applicant, RJ, 2017-2019
  • Multilingual Parallel Corpora, Swedish Research Council: member, 2006-2010
  • Methods and Tools for Automatic Grammar Extraction: Swedish Research Council: member, 2005-2007
  • An Infrastructure for Swedish language technology: member,  Swedish Research Council, 2007-2008

My work has been published in the media as well, see for example: 

You can find details about my research in my publications. 

I have also served on numerous committees for doctoral theses and mid-term evaluations, regularly act as a reviewer for conferences and workshops, and have undertaken numerous expert assignments for appointments in both Sweden and abroad. Additionally, I have served as an assessor for projects funded by the Swedish Research Council and the Wallenberg Foundation.

Research projects

Publications

Beáta Megyesi's publications per year and per type.

A selection from Stockholm University publication database

  • Historical Cryptology

    2024. Beáta Megyesi (et al.). Learning and Experiencing Cryptography with CrypTool and SageMath

    Chapter

    Historical cryptology studies (original) encrypted manuscripts, often handwritten sources, produced in our history. These historical sources can be found in archives, often hidden without any indexing and therefore hard to locate. Once found they need to be digitized and turned into a machine-readable text format before they can be deciphered with computational methods. The focus of historical cryptology is not primarily the development of sophisticated algorithms for decipherment, but rather the entire process of analysis of the encrypted source from collection and digitization to transcription and decryption. The process also includes the interpretation and contextualization of the message set in its historical context. There are many challenges on the way, such as mistakes made by the scribe, errors made by the transcriber, damaged pages, handwriting styles that are difficult to interpret, historical languages from various time periods, and hidden underlying language of the message. Ciphertexts vary greatly in terms of their code system and symbol sets used with more or less distinguishable symbols. Ciphertexts can be embedded in clearly written text, or shorter or longer sequences of cleartext can be embedded in the ciphertext. The ciphers used mostly in historical times are substitutions (simple, homophonic, or polyphonic), with or without nomenclatures, encoded as digits or symbol sequences, with or without spaces. So the circumstances are different from those in modern cryptography which focuses on methods (algorithms) and their strengths and assumes that the algorithm is applied correctly. For both historical and modern cryptology, attack vectors outside the algorithm are applied like implementation flaws and side-channel attacks. In this chapter, we give an introduction to the field of historical cryptology and present an overview of how researchers today process historical encrypted sources.

    Read more about Historical Cryptology
  • The Swell Language Learner Corpus: From Design to Annotation

    2019. Elena Volodina (et al.). Northern European Journal of Language Technology (NEJLT) 6, 67-104

    Article

    The article presents a new language learner corpus for Swedish, SweLL, and the methodology from collection and pesudonymisation to protect personal information of learners to annotation adapted to second language learning. The main aim is to deliver a well-annotated corpus of essays written by second language learners of Swedish and make it available for research through a browsable environment. To that end, a new annotation tool and a new project management tool have been implemented, both with the main purpose to ensure reliability and quality of the final corpus. In the article we discuss reasoning behind metadata selection, principles of gold corpus compilation and argue for separation of normalization from correction annotation.

    Read more about The Swell Language Learner Corpus
  • Learner Corpus Anonymization in the Age of GDPR: Insights from the Creation of a Learner Corpus of Swedish

    2018. Beáta Megyesi (et al.). Proceedings of the 7th Workshop on NLP for Computer Assisted Language Learning at SLTC 2018 (NLP4CALL 2018), 47-56

    Conference

    This paper reports on the status of learner corpus anonymization for the ongoing research infrastructure project SweLL. The main project aim is to deliver and make available for research a well-annotated corpus of essays written by second language (L2) learners of Swedish. As the practice shows, annotation of learner texts is a sensitive process demanding a lot of compromises between ethical and legal demands on the one hand, and research and technical demands, on the other. Below, is a concise description of the current status of pseudonymization of language learner data to ensure anonymity of the learners, with numerous examples of the above-mentioned compromises.

    Read more about Learner Corpus Anonymization in the Age of GDPR
  • A Friend in Need? Research agenda for electronic Second Language infrastructure

    2016. Elena Volodina (et al.).

    Conference

    In this article, we describe the research and societal needs as well as ongoing efforts to shape Swedish as a Second Language (L2) infrastructure. Our aim is to develop an electronic research infrastructure that would stimulate empiric research into learners' language development by preparing data and developing language technology methods and algorithms that can successfully deal with deviations in the learner language.

    Read more about A Friend in Need? Research agenda for electronic Second Language infrastructure
  • EACL - Expansion of Abbreviations in CLinical text

    2014. Lisa Tengstrand (et al.). Proceedings of the 3rdWorkshop on Predicting and Improving Text Readability for Target Reader Population

    Conference

    In the medical domain, especially in clinical texts, non-standard abbreviations are prevalent, which impairs readability for patients. To ease the understanding of the physicians’ notes, abbreviations need to be identified and expanded to their original forms. We present a distributional semantic approach to find candidates of the original form of the abbreviation, and combine this with Levenshtein distance to choose the correct candidate among the semantically related words. We apply the method to radiology reports and medical journal texts, and compare the results to general Swedish. The results show that the correct expansion of the abbreviation can be found in 40% of the cases, an improvement by 24 percentage points compared to the baseline (0.16), and an increase by 22 percentage points compared to using word space models alone (0.18).

    Read more about EACL - Expansion of Abbreviations in CLinical text
  • Professional language in Swedish clinical text: Linguistic characterization and comparative studies

    2014. Kelly Smith (et al.). Nordic Journal of Linguistics 37 (2), 297-323

    Article

    This study investigates the linguistic characteristics of Swedish clinical text in radiology reports and doctor's daily notes from electronic health records (EHRs) in comparison to general Swedish and biomedical journal text. We quantify linguistic features through a comparative register analysis to determine how the free text of EHRs differ from general and biomedical Swedish text in terms of lexical complexity, word and sentence composition, and common sentence structures. The linguistic features are extracted using state-of-the-art computational tools: a tokenizer, a part-of-speech tagger, and scripts for statistical analysis. Results show that technical terms and abbreviations are more frequent in clinical text, and lexical variance is low. Moreover, clinical text frequently omit subjects, verbs, and function words resulting in shorter sentences. Clinical text not only differs from general Swedish, but also internally, across its sub-domains, e.g. sentences lacking verbs are significantly more frequent in radiology reports. These results provide a foundation for future development of automatic methods for EHR simplification or clarification.

    Read more about Professional language in Swedish clinical text

Show all publications by Beata Megyesi at Stockholm University