Research seminar: Eline Visser, Uppsala University

Seminar

Date: Thursday 15 May 2025

Time: 15.00 – 16.30

Location: C307, Södra huset

Leveraging LLMs for language documentation

Abstract

In this talk, I want to report on two studies using Large Language Models (LLMs) to do translation tasks, based on language documentation materials of a Papuan language, Kalamang. Thereafter, I want to discuss the possibilities for leveraging LLMs in language documentation.

The first study, Machine Translation from One Book (MTOB) has as its goal to teach language models to translate between Kalamang and English. Twelve models are prompted to translate sentences after giving them a grammar of Kalamang (Visser 2022) and the 2500-word Kalamang dictionary which includes 500 example sentences (Visser 2020). The results are compared to a human who ”learned” Kalamang by reading the grammar and dictionary. We find that although the human performs better than the models, the latter show promising results, more often than not making translations that are comprehensible and sensible.

In the second study, Automatic Speech Recognition from One Book (ASROB), we test speech-to-text recognition (ASR) and Kalamang speech to English text translation (S2TT). The database is the same for MTOB, plus 15 hours of transcribed and translated recordings. Different models from Google’s Gemini family were prompted. This time, the models performed better than the human baseline.

An interesting find is that for S2TT, the models rely mostly on the dictionary. Adding the grammar and recordings to the prompt lowers the test scores. This suggests that models still struggle to combine text and audio knowledge. Another interesting find is that the models find it easier to do phrase-level than discourse-level ASR and S2TT.

Some of the co-authors of these studies merely wanted to test the capabilities of current LLMs, while others would ultimately like to develop conversational agent apps for all the world’s languages. To me, the results of the two studies above suggest that it is possible to leverage LLMs to speed up language documentation tasks such as transcription, translation and perhaps also glossing. With LLMs becoming increasingly powerful, they should be able to perform ever better on smaller data sets, so that linguists can annotate more data in language documentation projects with the same amount of resources. I will share my ideas of how to test the capabilities of LLMs on various language documentation tasks and improve existing tools like ELPIS (Foley et al. 2018). I also hope to spark a debate about the possibilities and risks involved with that, with special reference to the ethical issue of how to involve language communities in projects like this (Bird 2020).

References

Bird, Steven. (2020). Decolonising speech and language technology. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 3504–3519, Barcelona, Spain (Online). International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.313.

Foley, B., Arnold, J. T., Coto-Solano, R., Durantin, G., Ellison, T. M., van Esch, D., ... & Wiles, J. (2018). Building Speech Recognition Systems for Language Documentation: The CoEDL Endangered Language Pipeline and Inference System (ELPIS). In SLTU (pp. 205-209).

Eline Visser is researcher in General Linguistics at Department of Linguistics and Philology, Uppsala University.

On Eline Visser

Theme

Language, Literature and Culture

Research subject

General linguistics

Last updated: April 28, 2025

Source: Departiment of Linguistics