Contributions to Shallow Discourse Parsing: To English and beyond

2022. Murathan Kurfalı.

Avhandling (Dok)

Discourse is a coherent set of sentences where the sequential reading of the sentences yields a sense of accumulation and readers can easily follow why one sentence follows another. A text that lacks coherence will most certainly fail to communicate its intended message and leave the reader puzzled as to why the sentences are presented together. However, formally accounting for the differences between a coherent and a non-coherent text still remains a challenge. Various theories propose that the semantic links that are inferred between sentences/clauses, known as discourse relations, are the building blocks of the discourse that can be connected to one another in various ways to form the discourse structure. This dissertation focuses on the former problem of discovering such discourse relations without aiming to arrive at any structure, a task known as shallow discourse parsing (SDP). Unfortunately, so far, SDP has been almost exclusively performed on the available gold annotations in English, leading to only limited insight into how the existing models would perform in a low-resource scenario potentially involving any non-English language. The main objective of the current dissertation is to address these shortcomings and help extend SDP to the non-English territory. This aim is pursued through three different threads: (i) investigation of what kind of supervision is minimally required to perform SDP, (ii) construction of multilingual resources annotated at discourse-level, (iii) extension of well-known means to (SDP-wise) low-resource languages. An additional aim is to explore the feasibility of SDP as a probing task to evaluate discourse-level understanding abilities of modern language models is also explored.

The dissertation is based on six papers grouped in three themes. The first two papers perform different subtasks of SDP through relatively understudied means. Paper I presents a simplified method to perform explicit discourse relation labeling without any feature-engineering whereas Paper II shows how implicit discourse relation recognition benefits from large amounts of unlabeled text through a novel method for distant supervision. The third and fourth papers describe two novel multilingual discourse resources, TED-MDB (Paper III) and three bilingual discourse connective lexicons (Paper IV). Notably, Ted-MDB is the first parallel corpus annotated for PDTB-style discourse relations covering six non-English languages. Finally, the last two studies directly deal with multilingual discourse parsing where Paper V reports the first results in cross-lingual implicit discourse relation recognition and Paper VI proposes a multilingual benchmark including certain discourse-level tasks that have not been explored in this context before. Overall, the dissertation allows for a more detailed understanding of what is required to extend shallow discourse parsing beyond English. The conventional aspects of traditional supervised approaches are replaced in favor of less knowledge-intensive alternatives which, nevertheless, achieve state-of-the-art performance in their respective settings. Moreover, thanks to the introduction of TED-MDB, cross-lingual SDP is explored in a zero-shot setting for the first time. In sum, the proposed methodologies and the constructed resources are among the earliest steps towards building high-performance multilingual, or non-English monolingual, shallow discourse parsers.

Läs mer om Contributions to Shallow Discourse Parsing

Linking discourse-level information and the induction of bilingual discourse connective lexicons

2022. Sibel Özer (et al.). Semantic Web 13 (6), 1081-1102

Artikel

The single biggest obstacle in performing comprehensive cross-lingual discourse analysis is the scarcity of multilingual resources. The existing resources are overwhelmingly monolingual, compelling researchers to infer the discourse-level information in the target languages through error-prone automatic means. The current paper aims to provide a more direct insight into the cross-lingual variations in discourse structures by linking the annotated relations of the TED-Multilingual Discourse Bank, which consists of independently annotated six TED talks in seven different languages. It is shown that the linguistic labels over the relations annotated in the texts of these languages can be automatically linked with English with high accuracy, as verified against the relations of three diverse languages semi-automatically linked with relations over English texts. The resulting corpus has a great potential to reveal the divergences in local discourse relations, as well as leading to new resources, as exemplified by the induction of bilingual discourse connective lexicons.

Läs mer om Linking discourse-level information and the induction of bilingual discourse connective lexicons

A multi-country test of brief reappraisal interventions on emotions during the COVID-19 pandemic

2021. Ke Wang (et al.). Nature Human Behaviour 5 (8), 1089-1110

Artikel

The COVID-19 pandemic has increased negative emotions and decreased positive emotions globally. Left unchecked, these emotional changes might have a wide array of adverse impacts. To reduce negative emotions and increase positive emotions, we tested the effectiveness of reappraisal, an emotion-regulation strategy that modifies how one thinks about a situation. Participants from 87 countries and regions (n = 21,644) were randomly assigned to one of two brief reappraisal interventions (reconstrual or repurposing) or one of two control conditions (active or passive). Results revealed that both reappraisal interventions (vesus both control conditions) consistently reduced negative emotions and increased positive emotions across different measures. Reconstrual and repurposing interventions had similar effects. Importantly, planned exploratory analyses indicated that reappraisal interventions did not reduce intentions to practice preventive health behaviours. The findings demonstrate the viability of creating scalable, low-cost interventions for use around the world.

Läs mer om A multi-country test of brief reappraisal interventions on emotions during the COVID-19 pandemic

Let’s be explicit about that: Distant supervision for implicit discourse relation classification via connective prediction

2021. Murathan Kurfali, Robert Östling.

Konferens

In implicit discourse relation classification, we want to predict the relation between adjacent sentences in the absence of any overt discourse connectives. This is challenging even for humans, leading to shortage of annotated data, a fact that makes the task even more difficult for supervised machine learning approaches. In the current study, we perform implicit discourse relation classification without relying on any labeled implicit relation. We sidestep the lack of data through explicitation of implicit relations to reduce the task to two sub-problems: language modeling and explicit discourse relation classification, a much easier problem. Our experimental results show that this method can even marginally outperform the state-of-the-art, in spite of being much simpler than alternative models of comparable performance. Moreover, we show that the achieved performance is robust across domains as suggested by the zero-shot experiments on a completely different domain. This indicates that recent advances in language modeling have made language models sufficiently good at capturing inter-sentence relations without the help of explicit discourse markers.

Läs mer om Let’s be explicit about that

Probing Multilingual Language Models for Discourse

2021. Murathan Kurfali, Robert Östling.

Konferens

Pre-trained multilingual language models have become an important building block in multilingual natural language processing. In the present paper, we investigate a range of such models to find out how well they transfer discourse-level knowledge across languages. This is done with a systematic evaluation on a broader set of discourse-level tasks than has been previously been assembled. We find that the XLM-RoBERTa family of models consistently show the best performance, by simultaneously being good monolingual models and degrading relatively little in a zero-shot setting. Our results also indicate that model distillation may hurt the ability of cross-lingual transfer of sentence representations, while language dissimilarity at most has a modest effect. We hope that our test suite, covering 5 tasks with a total of 22 languages in 10 distinct families, will serve as a useful evaluation platform for multilingual performance at and beyond the sentence level.

Läs mer om Probing Multilingual Language Models for Discourse

A sentiment-annotated dataset of English causal connectives

2020. Marta Andersson, Murathan Kurfali, Robert Östling. Proceedings of the 14th Linguistic Annotation Workshop, 24-33

Konferens

This paper investigates the semantic prosody of three causal connectives: due to, owing to and because of in seven varieties of the English language. While research in the domain of English causality exists, we are not aware of studies that would cover the domain of causal connectives in English. Our claim is that connectives such as because of link two arguments, (at least) one of which will include a phrase that contributes to the interpretation of the relation as positive or negative, and hence define the prosody of the connective used. As our results demonstrate, the majority of the prosodies identified are negative for all three connectives; the proportions are stable across the varieties of English studied, and contrary to our expectations, we find no significant differences between the functions of the connectives and discourse preferences. Further, we investigate whether automatizing the sentiment annotation procedure via a simple language-model based classifier is possible. The initial results highlights the complexity of the task and the need for complicated systems, probably aided with other related datasets to achieve reasonable performance.

Läs mer om A sentiment-annotated dataset of English causal connectives

Disambiguation of Potentially Idiomatic Expressions with Contextual Embeddings

2020. Murathan Kurfali, Robert Östling. Proceedings of the Joint Workshop on MultiwordExpressions and Electronic Lexicons Proceedings of theWorkshop (MWE-LEX 2020), 85-94

Konferens

Läs mer om Disambiguation of Potentially Idiomatic Expressions with Contextual Embeddings

TED-MDB Lexicons: Tr-EnConnLex, Pt-EnConnLex

2020. Murathan Kurfali (et al.). the First Workshop on Computational Approaches to Discourse

Konferens

In this work, we present two new bilingual discourse connective lexicons, namely,for Turkish-English and European PortugueseEnglish created automatically using the existing discourse relation-aligned TED-MDB corpus. In their current form, the Pt-En lexiconincludes 95 entries, whereas the Tr-En lexiconcontains 133 entries. The lexicons constitutethe first step of a larger project of developing amultilingual discourse connective lexicon.

Läs mer om TED-MDB Lexicons

TRAVIS at PARSEME Shared Task 2020: How good is (m) BERT at seeing the unseen?

2020. Murathan Kurfali. Proceedings of the Joint Workshop on MultiwordExpressions and Electronic Lexicons Proceedings of theWorkshop (MWE-LEX 2020), 136-141

Konferens

Läs mer om TRAVIS at PARSEME Shared Task 2020

Zero-shot cross-lingual identification of direct speech using distant supervision

2020. Murathan Kurfali, Mats Wirén. The 4th Joint SIGHUM Workshopon Computational Linguistics for Cultural Heritage,Social Sciences, Humanities and Literature, 105-111

Konferens

Läs mer om Zero-shot cross-lingual identification of direct speech using distant supervision

Labeling Explicit Discourse Relations Using Pre-trained Language Models

2020. Murathan Kurfali. Text, Speech, and Dialogue, 79-86

Kapitel

Labeling explicit discourse relations is one of the most challenging sub-tasks of the shallow discourse parsing where the goal is to identify the discourse connectives and the boundaries of their arguments. The state-of-the-art models achieve slightly above 45% of F-score by using hand-crafted features. The current paper investigates the efficacy of the pre-trained language models in this task. We find that the pre-trained language models, when finetuned, are powerful enough to replace the linguistic features. We evaluate our model on PDTB 2.0 and report the state-of-the-art results in extraction of the full relation. This is the first time when a model outperforms the knowledge intensive models without employing any linguistic features.

Läs mer om Labeling Explicit Discourse Relations Using Pre-trained Language Models

A Multi-Word Expression Dataset for Swedish

2020. Murathan Kurfali (et al.). Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 4402-4409

Konferens

We present a new set of 96 Swedish multi-word expressions annotated with degree of (non-)compositionality. In contrast to most previous compositionality datasets we also consider syntactically complex constructions and publish a formal specification of each expression. This allows evaluation of computational models beyond word bigrams, which have so far been the norm. Finally, we use the annotations to evaluate a system for automatic compositionality estimation based on distributional semantics. Our analysis of the disagreements between human annotators and the distributional model reveal interesting questions related to the perception of compositionality, and should be informative to future work in the area.

Läs mer om A Multi-Word Expression Dataset for Swedish

Zero-shot transfer for implicit discourse relation classification

2019. Murathan Kurfali, Robert Östling. 20th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 226-231

Konferens

Automatically classifying the relation between sentences in a discourse is a challenging task, in particular when there is no overt expression of the relation. It becomes even more challenging by the fact that annotated training data exists only for a small number of languages, such as English and Chinese. We present a new system using zero-shot transfer learning for implicit discourse relation classification, where the only resource used for the target language is unannotated parallel text. This system is evaluated on the discourse-annotated TEDMDB parallel corpus, where it obtains good results for all seven languages using only English training data.

Läs mer om Zero-shot transfer for implicit discourse relation classification

TED Multilingual Discourse Bank (TED-MDB)

2019. Deniz Zeyrek (et al.). Language resources and evaluation

Artikel

TED-Multilingual Discourse Bank, or TED-MDB, is a multilingual resource where TED-talks are annotated at the discourse level in 6 languages (English, Polish, German, Russian, European Portuguese, and Turkish) following the aims and principles of PDTB. We explain the corpus design criteria, which has three main features: the linguistic characteristics of the languages involved, the interactive nature of TED talks—which led us to annotate Hypophora, and the decision to avoid projection. We report our annotation consistency, and post-annotation alignment experiments, and provide a cross-lingual comparison based on corpus statistics.

Läs mer om TED Multilingual Discourse Bank (TED-MDB)

Murathan KurfaliPostdoktor

Forskningsprojekt

Publikationer

Contributions to Shallow Discourse Parsing: To English and beyond

Linking discourse-level information and the induction of bilingual discourse connective lexicons

A multi-country test of brief reappraisal interventions on emotions during the COVID-19 pandemic

Let’s be explicit about that: Distant supervision for implicit discourse relation classification via connective prediction

Probing Multilingual Language Models for Discourse

A sentiment-annotated dataset of English causal connectives

Disambiguation of Potentially Idiomatic Expressions with Contextual Embeddings

TED-MDB Lexicons: Tr-EnConnLex, Pt-EnConnLex

TRAVIS at PARSEME Shared Task 2020: How good is (m) BERT at seeing the unseen?

Zero-shot cross-lingual identification of direct speech using distant supervision

Labeling Explicit Discourse Relations Using Pre-trained Language Models

A Multi-Word Expression Dataset for Swedish

Zero-shot transfer for implicit discourse relation classification

TED Multilingual Discourse Bank (TED-MDB)