Henrik Liljegren Professor

About me

As a linguist, I have a particular interest in the languages of the Hindu Kush-Karakoram region, i.e., the mountainous areas of northern Pakistan, northeastern Afghanistan and the disputed territory of Kashmir. Many of those languages are lesser-described, endangered and under-resourced. I resided for a period of 10 years in northern Pakistan and have carried out fieldwork in individual languages of the region as well as conducted areal-typological research by means of collaborative methods. 

Apart from research per se, I advise individuals and communities on their revitalization efforts (orthography and local resource development, mother-tongue-based education, etc.), mentor language activists and scholars from various communities to collect and organize data, and help building networks between local communities and organizations.  

At Stockholm University, I am involved in supervising thesis work and in teaching general linguistics and language documentation.

My research focus is currently on building a language corpus, a lexical database and describing Gawarbati, one of many sparsely documented and under-resourced languages ​​spoken in the Hindu Kush region. During the period 2021-2024, the Swedish Research Council funded an extensive collection of video and audio data from Gawarbati as well as further processing of the material in the form of transcription, translation and glossing. All data collection was carried out in close collaboration with the local language community and with the language resource center Forum for Language Initiatives (based in Islamabad).

In a previous areal-typological project (2015-2020), I produced a linguistic profile of the Hindu Kush-Karakoram region, based on first-hand data collected from 59 language varieties within the project. One tangible outcome is the online database Hindu Kush Areal Typology: https://hindukush.clld.org/ 

  • Gawarbati

    Article
    2025. Henrik Liljegren, Abdullah Soan.

    Gawarbati (ISO 639-3: gwt; Glottocode: gawa2147) is an underdescribed Indo-Aryan language spoken along the Kunar River, in the southern part of Lower Chitral District of Pakistan’s Khyber Pakhtunkhwa Province as well as in adjacent areas across the border in Nari (Naray) and Ghaziabad Districts of Afghanistan’s Kunar Province (see Figure 1). As for the number of speakers, only rough estimations can be given. On the Pakistani side of the border, where credible information is somewhat easier to obtain, local residents estimate it to be 4,000 speakers (Fazal Akbar, pc in 2022), based on the number of known Gawarbati speaking houses and an average number of household members. On the Afghan side of the border, the number appears to range between 15,000 and 20,000, based on recent cross-border contacts with local residents (Fazal Akbar, pc in 2022). This would amount to a total of 19,000–24,000 speakers of Gawarbati. A few small linguistic enclaves situated further down the Kunar Valley in Afghanistan are closely related to Gawarbati: Shumashti (Morgenstierne 1945: 241), Ningalami (Morgenstierne 1950: 58) and Grangali (Grjunberg 1971). Both Shumashti and Ningalami were at the verge of extinction already at the time of Morgenstierne’s field studies in the first half of the twentieth century, whereas Grangali is still spoken in three villages in the Digal Valley, according to a recent report (Robert Tegethoff and Sviatoslav Kaverin, pc in 2021).

    Read more about Gawarbati
  • Linguistic Areality in Northeastern Afghanistan

    Chapter
    2025. Henrik Liljegren.

    In this study, an attempt is made at producing an areal-typological profile of the languages of northestern Afghanistan, based on recently collected data from a tight sample of 29 varieties. Traditional language classification is revisited and problematized in the light of linguistic areality and what appear to be effects of historical contact patterns, both in the region and beyond its confines.

    Read more about Linguistic Areality in Northeastern Afghanistan
  • Locative and existential predication contrasts in Gawarbati (Indo-Aryan) and the surrounding region

    Chapter
    2025. Anastasia Panova, Henrik Liljegren.

    This paper analyses the morphosyntactic variation in locative (loc) and locational-existential (loc-ex) clauses in Gawarbati, an underdescribed Indo-Aryan languagewhich has no dedicated formal marking of the loc vs. loc-ex contrast. The analysis also compares figure-ground predications in geographically adjacent languages.The results show that there are three morphosyntactic parameters which reflect,to varying degrees, the loc vs. loc-ex status of the predication: word order, indefiniteness marking and the lexical identity of the predicate. The parameter reflectingthe loc vs. loc-ex alternation most consistently is word order. However, as shownin the corpus study of Gawarbati, what is primarily encoded by word order variation is a range of available information-structural patterns, and they do not alwayseasily match with the loc vs. loc-ex distinction.

    Read more about Locative and existential predication contrasts in Gawarbati (Indo-Aryan) and the surrounding region
  • På språkjakt i Pakistans Norrland

    Chapter
    2025. Henrik Liljegren.

    Under sammanlagt åtta års vistelse uppe bland bergen i norra Pakistan åren 1998-2010 kunde språkforskaren Henrik Liljegren fördjupa sig i lokala språk, som palula. I denna natursköna miljö sammanfaller hög biodiversitet med kulturell och språklig mångfald. För Sydasien skriver han om äventyren som gjort att han och hela familjen kom nära lokalbefolkningen, som involverades i arbetet och bidrog till etableringen av centret för språkdokumentation, Forum for Language Initiatives (FLI). Berättelsen sträcker sig från 1998 till senaste besöket, i april 2025.

    Read more about På språkjakt i Pakistans Norrland
  • The Indo-European Cognate Relationships dataset

    Article
    2025. Cormac Anderson, Matthew Scarborough, Henrik Liljegren, Russell D. Gray, Paul Heggarty.

    The Indo-European Cognate Relationships (IE-CoR) dataset is an open-access relational dataset showing how related, inherited words (‘cognates’) pattern across 160 languages of the Indo-European family. IE-CoR is intended as a benchmark dataset for computational research into the evolution of the Indo-European languages. It is structured around 170 reference meanings in core lexicon, and contains 25731 lexeme entries, analysed into 4981 cognate sets. Novel, dedicated structures are used to code all known cases of horizontal transfer. All 13 main documented clades of Indo-European, and their main subclades, are well represented. Time calibration data for each language are also included, as are relevant geographical and social metadata. Data collection was performed by an expert consortium of 89 linguists drawing on 355 cited sources. The dataset is extendable to further languages and meanings and follows the Cross-Linguistic Data Format (CLDF) protocols for linguistic data. It is designed to be interoperable with other cross-linguistic datasets and catalogues, and provides a reference framework for similar initiatives for other language families.

    Read more about The Indo-European Cognate Relationships dataset