Stockholm university

Mattias HeldnerProfessor, Head of Department

About me

Mattias Heldner [maˈtʰìːas ˈhɛ́ldnɛr] (born 1969) is Professor of Phonetics, Director of the Phonetics Laboratory, and Head of Department in the Department of Linguistics, Stockholm University. He received his PhD in Phonetics from Umeå University in 2002, did a postdoc at TeliaSonera Sweden in 2005, became Docent in Speech communication at KTH Royal Institute of Technology in 2007, and Professor in Phonetics at Stockholm University in 2011.

His main research interests are communicative behaviour relevant for turn-taking in conversation, and signalling of prosodic functions such as prominence and boundaries. He collaborates with colleagues worldwide, has held several Swedish research grants and published widely. He was one of the three Technical Program Chairs for the international conference Interspeech 2017 in Stockholm.

He has mostly taught courses in phonetics for Speech and Language Pathology students at Karolinska Institutet, University of Gothenburg and Umeå University. He currently supervises one PhD student and has supervised two PhD theses as main supervisor.

As a Director of the Phonetics lab, he has developed recording facilities for breathing movements and voice quality dynamics in conversation, as well as more general facilities for capturing acoustic and physiological signals with a very high level of experimental control.

Research projects

Publications

A selection from Stockholm University publication database

  • Breathing in Conversation

    2020. Marcin Wlodarczak, Mattias Heldner. Frontiers in Psychology 11

    Article

    This work revisits the problem of breathing cues used for management of speaking turns in multiparty casual conversation. We propose a new categorization of turn-taking events which combines the criterion of speaker change with whether the original speaker inhales before producing the next talkspurt. We demonstrate that the latter criterion could be potentially used as a good proxy for pragmatic completeness of the previous utterance (and, by extension, of the interruptive character of the incoming speech). We also present evidence that breath holds are used in reaction to incoming talk rather than as a turn-holding cue. In addition to analysing dimensions which are routinely omitted in studies of interactional functions of breathing (exhalations, presence of overlapping speech, breath holds), the present study also looks at patterns of breath holds in silent breathing and shows that breath holds are sometimes produced toward the beginning (and toward the top) of silent exhalations, potentially indicating an abandoned intention to take the turn. We claim that the breathing signal can thus be successfully used for uncovering hidden turn-taking events, which are otherwise obscured by silence-based representations of interaction.

    Read more about Breathing in Conversation
  • A Scalable Method for Quantifying the Role of Pitch in Conversational Turn-Taking

    2019. Kornel Laskowski, Marcin Wlodarczak, Mattias Heldner. 20th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 284-292

    Conference

    Pitch has long been held as an important signalling channel when planning and deploying speech in conversation, and myriad studies have been undertaken to determine the extent to which it actually plays this role. Unfortunately, these studies have required considerable human investment in data preparation and analysis, and have therefore often been limited to a handful of specific conversational contexts. The current article proposes a framework which addresses these limitations, by enabling a scalable, quantitative characterization of the role of pitch throughout an entire conversation, requiring only the raw signal and speech activity references. The framework is evaluated on the Switchboard dialogue corpus. Experiments indicate that pitch trajectories of both parties are predictive of their incipient speech activity; that pitch should be expressed on a logarithmic scale and Z-normalized, as well as accompanied by a binary voicing variable; and that only the most recent 400 ms of the pitch trajectory are useful in incipient speech activity prediction.

    Read more about A Scalable Method for Quantifying the Role of Pitch in Conversational Turn-Taking
  • A multidimensional investigation of covert contrast in Swedish acquiring children's speech - a project description

    2019. Carla Wikse Barrow, Sofia Strömbergsson, Mattias Heldner. Proceedings from FONETIK 2019 Stockholm, June 10–12, 2019, 79-83

    Conference

    This paper provides a description of a current PhD project in phonetics at the Department of Linguistics at Stockholm University. A short background is pro- vided, the intended experiments are pre- sented and the potential contributions of the results are outlined.

    Read more about A multidimensional investigation of covert contrast in Swedish acquiring children's speech - a project description
  • Breath holds in spontaneous speech

    2019. Kätlin Aare, Marcin Włodarczak, Mattias Heldner. Eesti ja soome-ugri keeleteaduse ajakiri 10 (1), 13-34

    Article

    This article provides a first quantitative overview of the timing and volume-related properties of breath holds in spontaneous conversations. Firstly, we investigate breath holds based on their position within the coinciding respiratory interval amplitude. Secondly, we investigate breath holds based on their timing within the respiratory intervals and in relation to communicative activity following breath holds. We hypothesise that breath holds occur in different regions of the lung capacity range and at different times during the respiratory phase, depending on the conversational and physiological activity following breath holds. The results suggest there is not only considerable variation in both the time and lung capacity scales, but detectable differences are also present in breath holding characteristics involving laughter and speech preparation, while breath holds coinciding with swallowing are difficult to separate from the rest of the data based on temporal and volume information alone.

    Read more about Breath holds in spontaneous speech
  • Breathing in conversation — what we’ve learned

    2019. Marcin Włodarczak, Mattias Heldner. 1st International Seminar on the Foundations of Speech : BREATHING, PAUSING, AND THE VOICE, 1st –3rd December 2019 in Sønderborg, Denmark, 13-15

    Conference

    In this paper, we provide an overview of selected findings on interactional aspects of breathing in multiparty conversation, accumulated largely over the course of a four- year research project Breathing in conversation, carried out at the Department of Linguistics, Stockholm University. In particular, we focus on results demonstrating the contribution of the respiratory signal to prediction of imminent speech activity, as well as on turn-holding and turn-yielding cues.

    Read more about Breathing in conversation — what we’ve learned
  • Does lung volume size affect respiratory rate and utterance duration?

    2019. Mattias Heldner, Denise Carlsson, Marcin Wlodarczak. Proceedings from Fonetik 2019, 97-102

    Conference

    This study explored whether lung volume size affects respiratory rate and utterance duration. The lung capacity of four women and four men was estimated with a digital spirometer. These subjects subsequently read a nonsense text aloud while their respiratory movements were registered with a Respiratory Inductance Plethysmography (RIP) system. Utterance durations were measured from the speech recordings, and respiratory cycle durations and respiratory rates were measured from the RIP recordings. This experiment did not show any relationship between lung volume size and respiratory rate or utterance duration.

    Read more about Does lung volume size affect respiratory rate and utterance duration?
  • The RespTrack system

    2019. Mattias Heldner (et al.). 1st International Seminar on the Foundations of Speech : BREATHING, PAUSING, AND THE VOICE, 1st –3rd December 2019 in Sønderborg, Denmark, 16-18

    Conference

    This paper describes the RespTrack system for measuring and real-time monitoring of respiratory movements. RespTrack was developed in the Phonetics Laboratory at Stockholm University and the authors have been using it extensively for research for the past five years. Here, we describe briefly the underlying techniques, calibration, digitization as well as recent developments of the system. The presentation at SEFOS 2019 will also include a live demonstration of the system.

    Read more about The RespTrack system
  • Voice Quality as a Turn-Taking Cue

    2019. Mattias Heldner (et al.). Proceedings of Interspeech 2019, 4165-4169

    Conference

    This work revisits the idea that voice quality dynamics (VQ) contributes to conveying pragmatic distinctions, with two case studies to further test this idea. First, we explore VQ as a turn-taking cue, and then as a cue for distinguishing between different functions of affirmative cue words. We employ acoustic VQ measures claimed to be better suited for continuous speech than those in own previous work. Both cases indicate that the degree of periodicity (as measured by CPPS) is indeed relevant in the production of the different pragmatic functions. In particular, turn-yielding is characterized by lower periodicity, sometimes accompanied by presence of creaky voice. Periodicity also distinguishes between backchannels, agreements and acknowledgements.

    Read more about Voice Quality as a Turn-Taking Cue
  • Creak in the respiratory cycle

    2018. Kätlin Aare (et al.). Proceedings of Interspeech 2018, 1408-1412

    Conference

    Creakiness is a well-known turn-taking cue and has been observed to systematically accompany phrase and turn ends in several languages. In Estonian, creaky voice is frequently used by all speakers without any obvious evidence for its systematic use as a turn-taking cue. Rather, it signals a lack of prominence and is favored by lengthening and later timing in phrases. In this paper, we analyze the occurrence of creak with respect to properties of the respiratory cycle. We show that creak is more likely to accompany longer exhalations. Furthermore, the results suggest there is little difference in lung volume values regardless of the presence of creak, indicating that creaky voice might be employed to preserve air over the course of longer utterances. We discuss the results in connection to processes of speech planning in spontaneous speech.

    Read more about Creak in the respiratory cycle
  • Deep throat as a source of information

    2018. Mattias Heldner, Petra Wagner, Marcin Włodarczak. Proceedings Fonetik 2018, 33-38

    Conference

    In this pilot study we explore the signal from an accelerometer placed on the tracheal wall (below the glottis) for obtaining robust voice quality estimates. We investigate cepstral peak prominence smooth, H1-H2 and alpha ratio for distinguishing between breathy, modal and pressed phonation across six (sustained) vowel qualities produced by four speakers and including a systematic variation of pitch. We show that throat signal spectra are unaffected by vocal tract resonances, F0 and speaker variation while retaining sensitivity to voice quality dynamics. We conclude that the throat signal is a promising tool for studying communicative functions of voice prosody in speech communication.

    Read more about Deep throat as a source of information
  • Exhalatory turn-taking cues

    2018. Marcin Włodarczak, Mattias Heldner. Proceedings 9th International Conference on Speech Prosody 2018, 334-338

    Conference

    The paper is a study of kinematic features of the exhalation which signal that the speaker is done speaking and wants to yield the turn. We demonstrate that the single most prominent feature is the presence of inhalation directly following the exhalation. However, several features of the exhalation itself are also found to significantly distinguish between turn holds and yields, such as slower exhalation rate and higher lung level at exhalation onset. The results complement existing body evidence on respiratory turn-taking cues which has so far involved mainly inhalatory features. We also show that respiration allows discovering pause interruptions thus allowing access to unrealised turn-taking intentions.

    Read more about Exhalatory turn-taking cues
  • Acoustics and discourse function of two types of breathing signals

    2017. Aleksandra Ćwiek (et al.). Nordic Prosody, 83-91

    Conference

    Breathing is fundamental for living and speech, and it has been a subject of linguistic research for years. Recently, there has been a renewed interest in tackling the question of possible communicative functions of breathing (e.g. Rochet-Capellan & Fuchs, 2014; Aare, Włodarczak & Heldner, 2014; Włodarczak & Heldner, 2015; Włodarczak, Heldner, & Edlund, 2015). The present study set out to determine acoustic markedness and communicative functions of pauses accompanied and non-accompanied by breathing. We hypothesised that an articulatory reset occurring in breathing pauses and an articulatory freeze in non-breathing pauses differentiates between the two types. A production experiment was conducted and some evidence in favour of such a phenomenon was found. Namely, in case of non-breathing pauses, we observed more coarticulation evidenced by a more frequent omission of plosive releases. Our findings thus give some evidence in favour of the communicative function of breathing.

    Read more about Acoustics and discourse function of two types of breathing signals
  • Capturing respiratory sounds with throat microphones

    2017. Marcin Włodarczak, Mattias Heldner. Nordic Prosody, 181-190

    Conference

    This paper presents the results of a pilot study using throat microphones for recording respiratory sounds. We demonstrate that inhalation noises are louder before longer stretches of speech than before shorter utterances (< 1 s) and in silent breathing. We thus replicate the results from our earlier study which used close-talking head-mounted microphones, without the associated data loss due to cross-talk. We also show that inhalations are louder within than before a speaking turn. Hence, the study provides another piece of evidence in favour of communicative functions of respiratory noises serving as potential turn-taking (for instance, turn-holding) cues. 

    Read more about Capturing respiratory sounds with throat microphones
  • Coordination between f0, intensity and breathing signals

    2017. Juraj Šimko (et al.). Nordic Prosody, 147-156

    Conference

    The present paper presents preliminary results on temporal coordination of breathing, intensity and fundamental frequency signals using continuous wavelet transform. We have found tendencies towards phase-locking at time scales corresponding to several prosodic units such as vowel-to-vowel intervals and prosodic words. The proposed method should be applicable to a wide range of problems in which the goal is finding a stable phase relationship in a pair of hierarchically organised signals.

    Read more about Coordination between f0, intensity and breathing signals
  • Improving Prediction of Speech Activity Using Multi-Participant Respiratory State

    2017. Marcin Włodarczak (et al.). Proceedings of Interspeech 2017, 1666-1670

    Conference

    One consequence of situated face-to-face conversation is the co- observability of participants’ respiratory movements and sounds. We explore whether this information can be exploited in pre- dicting incipient speech activity. Using a methodology called stochastic turn-taking modeling, we compare the performance of a model trained on speech activity alone to one additionally trained on static and dynamic lung volume features. The method- ology permits automatic discovery of temporal dependencies across participants and feature types. Our experiments show that respiratory information substantially lowers cross-entropy rates, and that this generalizes to unseen data. 

    Read more about Improving Prediction of Speech Activity Using Multi-Participant Respiratory State
  • Respiratory Constraints in Verbal and Non-verbal Communication

    2017. Marcin Wlodarczak, Mattias Heldner. Frontiers in Psychology 8

    Article

    In the present paper we address the old question of respiratory planning in speech production. We recast the problem in terms of speakers' communicative goals and propose that speakers try to minimize respiratory effort in line with the H&H theory. We analyze respiratory cycles coinciding with no speech (i.e., silence), short verbal feedback expressions (SFE's) as well as longer vocalizations in terms of parameters of the respiratory cycle and find little evidence for respiratory planning in feedback production. We also investigate timing of speech and SFEs in the exhalation and contrast it with nods. We find that while speech is strongly tied to the exhalation onset, SFEs are distributed much more uniformly throughout the exhalation and are often produced on residual air. Given that nods, which do not have any respiratory constraints, tend to be more frequent toward the end of an exhalation, we propose a mechanism whereby respiratory patterns are determined by the trade-off between speakers' communicative goals and respiratory constraints.

    Read more about Respiratory Constraints in Verbal and Non-verbal Communication
  • Is breathing silence?

    2016. Mattias Heldner, Marcin Włodarczak. Proceedings of Fonetik 2016, 35-38

    Conference

    This paper investigates whether inhalation noises are treated as silences in speech communication. A perception experiment revealed differences in pause detection thresholds for breathing pauses and silent pauses. This in turn indicates that breathing pauses are treated differently by the perceptual system, and could potentially carry a communicative function. 

    Read more about Is breathing silence?
  • Lexical Specification of Prosodic Information in Swedish

    2016. Hatice Zora (et al.). Frontiers in Neuroscience 10

    Article

    Like that of many other Germanic languages, the stress system of Swedish has mainly undergone phonological analysis. Recently, however, researchers have begun to recognize the central role of morphology in these systems. Similar to the lexical specification of tonal accent, the Swedish stress system is claimed to be morphologically determined and morphemes are thus categorized as prosodically specified and prosodically unspecified. Prosodically specified morphemes bear stress information as part of their lexical representations and are classified as tonic (i.e., lexically stressed), pretonic and posttonic, whereas prosodically unspecified morphemes receive stress through a phonological rule that is right-edge oriented, but is sensitive to prosodic specification at that edge. The presence of prosodic specification is inferred from vowel quality and vowel quantity; if stress moves elsewhere, vowel quality and quantity change radically in phonologically stressed morphemes, whereas traces of stress remain in lexically stressed morphemes. The present study is the first to investigate whether stress is a lexical property of Swedish morphemes by comparing mismatch negativity (MMN) responses to vowel quality and quantity changes in phonologically stressed and lexically stressed words. In a passive oddball paradigm, 15 native speakers of Swedish were presented with standards and deviants, which differed from the standards in formant frequency and duration. Given that vowel quality and quantity changes are associated with morphological derivations only in phonologically stressed words, MMN responses are expected to be greater in phonologically stressed words than in lexically stressed words that lack such an association. The results indicated that the processing differences between phonologically and lexically stressed words were reflected in the amplitude and topography of MMN responses. Confirming the expectation, MMN amplitude was greater for the phonologically stressed word than for the lexically stressed word and showed a more widespread topographic distribution. The brain did not only detect vowel quality and quantity changes but also used them to activate memory traces associated with derivations. The present study therefore implies that morphology is directly involved in the Swedish stress system and that changes in phonological shape due to stress shift cue upcoming stress and potential addition of a morpheme.

    Read more about Lexical Specification of Prosodic Information in Swedish
  • Perceptual correlates of Turkish word stress and their contribution to automatic lexical access

    2016. Hatice Zora, Mattias Heldner, Iris-Corinna Schwarz. Frontiers in Neuroscience 10

    Article

    Perceptual correlates of Turkish word stress and their contribution to lexical access were studied using the mismatch negativity (MMN) component in event-related potentials (ERPs). The MMN was expected to indicate if segmentally identical Turkish words were distinguished on the sole basis of prosodic features such as fundamental frequency (f0), spectral emphasis (SE) and duration. The salience of these features in lexical access was expected to be reflected in the amplitude of MMN responses. In a multi-deviant oddball paradigm, neural responses to changes in f0, SE, and duration individually, as well as to all three features combined, were recorded for words and pseudowords presented to 14 native speakers of Turkish. The word and pseudoword contrast was used to differentiate language-related effects from acoustic-change effects on the neural responses. First and in line with previous findings, the overall MMN was maximal over frontal and central scalp locations. Second, changes in prosodic features elicited neural responses both in words and pseudowords, confirming the brain’s automatic response to any change in auditory input. However, there were processing differences between the prosodic features, most significantly in f0: While f0 manipulation elicited a slightly right-lateralized frontally-maximal MMN in words, it elicited a frontal P3a in pseudowords. Considering that P3a is associated with involuntary allocation of attention to salient changes, the manipulations of f0 in the absence of lexical processing lead to an intentional evaluation of pitch change. f0 is therefore claimed to be lexically specified in Turkish. Rather than combined features, individual prosodic features differentiate language-related effects from acoustic-change effects. The present study confirms that segmentally identical words can be distinguished on the basis of prosodic information alone, and establishes the salience of f0 in lexical access.

    Read more about Perceptual correlates of Turkish word stress and their contribution to automatic lexical access
  • Respiratory belts and whistles

    2016. Marcin Włodarczak, Mattias Heldner. Proceedings of Interspeech 2016, 510-514

    Conference

    This paper presents first results on using acoustic intensity of inhalations as a cue to speech initiation in spontaneous multiparty conversations. We demonstrate that inhalation intensity significantly differentiates between cycles coinciding with no speech activity, shorter (< 1 s) and longer stretches of speech. While the model fit is relatively weak, it is comparable to the fit of a model using kinematic features collected with Respiratory Inductance Plethysmography. We also show that incorpo- rating both kinematic and acoustic features further improves the model. Given the ease of capturing breath acoustics, we consider the results to be a promising first step towards studying communicative functions of respiratory sounds. We discuss possible extensions to the data collection procedure with a view to improving predictive power of the model. 

    Read more about Respiratory belts and whistles
  • Respiratory turn-taking cues

    2016. Marcin Włodarczak, Mattias Heldner. Proceedings of Interspeech 2016, 1275-1279

    Conference

    This paper investigates to what extent breathing can be used as a cue to turn-taking behaviour. The paper improves on existing accounts by considering all possible transitions between speaker states (silent, speaking, backchanneling) and by not relying on global speaker models. Instead, all features (including breathing range and resting expiratory level) are estimated in an incremental fashion using the left-hand context. We identify several inhalatory features relevant to turn-management, and assess the fit of models with these features as predictors of turn-taking behaviour.

    Read more about Respiratory turn-taking cues
  • The Acoustics of Lexical Stress in Italian as a Function of Stress Level and Speaking Style

    2016. Anders Eriksson (et al.). Proceedings of Interspeech 2016, 1059-1063

    Conference

    The study is part of a series of studies, describing the acoustics of lexical stress in a way that should be applicable to any language. The present database of recordings includes Brazilian Portuguese, English, Estonian, German, French, Italian and Swedish. The acoustic parameters examined are F0-level, F0- variation, Duration, and Spectral Emphasis. Values for these parameters, computed for all vowels (a little over 24000 vowels for Italian), are the data upon which the analyses are based. All parameters are examined with respect to their correlation with Stress (primary, secondary, unstressed) and speaking Style (wordlist reading, phrase reading, spontaneous speech) and Sex of the speaker (female, male). For Italian Duration was found to be the dominant factor by a wide margin, in agreement with previous studies. Spectral Emphasis was the second most important factor. Spectral Emphasis has not been studied previously for Italian but intensity, a related parameter, has been shown to correlate with stress. F0-level was also significantly correlated but not to the same degree. Speaker Sex turned out as significant in many comparisons. The differences were, however, mainly a function of the degree to which a given parameter was used, not how it was used to signal lexical stress contrasts. 

    Read more about The Acoustics of Lexical Stress in Italian as a Function of Stress Level and Speaking Style
  • Breathing in Conversation

    2015. Marcin Włodarczak, Mattias Heldner, Jens Edlund. Proceedings of the 2nd European and the 5th Nordic Symposium on Multimodal Communication, 107-112

    Conference

    This paper attempts to draw attention of the multimodal communication research community to what we consider a long overdue topic, namely respiratory activity in conversation. We submit that a turn towards spontaneous interaction is a natural extension of the recent interest in speech breathing, and is likely to offer valuable insights into mechanisms underlying organisation of interaction and collaborative human action in general, as well as to make advancement in existing speech technology applications. Particular focus is placed on the role of breathing as a perceptually and interactionally salient turn-taking cue. We also present the recording setup developed in the Phonetics Laboratory at Stockholm University with the aim of studying communicative functions of physiological and audio-visual breathing correlates in spontaneous multiparty interactions

    Read more about Breathing in Conversation
  • Communicative needs and respiratory constraints

    2015. Marcin Włodarczak, Mattias Heldner, Jens Edlund. 16th Annual Conference of the International Speech Communication Association (INTERSPEECH 2015), 3051-3055

    Conference

    This study investigates timing of communicative behaviour with respect to speaker’s respiratory cycle. The data is drawn from a corpus of multiparty conversations in Swedish. We find that while longer utterances (> 1 s) are tied, predictably, primarily to exhalation onset, shorter vocalisations are spread more uni- formly across the respiratory cycle. In addition, nods, which are free from any respiratory constraints, are most frequently found around exhalation offsets, where respiratory requirements for even a short utterance are not satisfied. We interpret the results to reflect the economy principle in speech production, whereby respiratory effort, associated primarily with starting a new respiratory cycle, is minimised within the scope of speaker’s communicative goals. 

    Read more about Communicative needs and respiratory constraints
  • Inhalation amplitude and turn-taking in spontaneous Estonian conversations

    2015. Kätlin Aare, Marcin Włodarczak, Mattias Heldner. Proceedings from Fonetik 2015 Lund, June 8-10, 2015, 1-5

    Conference

    This study explores the relationship between inhalation amplitude and turn management in four approximately 20 minute long spontaneous multiparty conversations in Estonian. The main focus of interest is whether inhalation amplitude is greater before turn onset than in the following inhalations within the same speaking turn. The results show that inhalations directly before turn onset are greater in amplitude than those later in the turn. The difference seems to be realized by ending the inhalation at a greater lung volume value, whereas the initial lung volume before inhalation onset remains roughly the same across a single turn. The findings suggest that the increased inhalation amplitude could function as a cue for claiming the conversational floor.

    Read more about Inhalation amplitude and turn-taking in spontaneous Estonian conversations
  • Neural correlates of lexical stress

    2015. Hatice Zora, Iris-Corinna Schwarz, Mattias Heldner. NeuroReport 26 (13), 791-796

    Article

    Neural correlates of lexical stress were studied using the mismatch negativity (MMN) component in event-related potentials. The MMN responses were expected to reveal the encoding of stress information into long-term memory and the contributions of prosodic features such as fundamental frequency (F0) and intensity toward lexical access. In a passive oddball paradigm, neural responses to changes in F0, intensity, and in both features together were recorded for words and pseudowords. The findings showed significant differences not only between words and pseudowords but also between prosodic features. Early processing of prosodic information in words was indexed by an intensity-related MMN and an F0-related P200. These effects were stable at right-anterior and mid-anterior regions. At a later latency, MMN responses were recorded for both words and pseudowords at the mid-anterior and posterior regions. The P200 effect observed for F0 at the early latency for words developed into an MMN response. Intensity elicited smaller MMN for pseudowords than for words. Moreover, a larger brain area was recruited for the processing of words than for the processing of pseudowords. These findings suggest earlier and higher sensitivity to prosodic changes in words than in pseudowords, reflecting a language-related process. The present study, therefore, not only establishes neural correlates of lexical stress but also confirms the presence of long-term memory traces for prosodic information in the brain.

    Read more about Neural correlates of lexical stress
  • Pitch Slope and End Point as Turn-Taking Cues in Swedish

    2015. Mattias Heldner, Marcin Włodarczak. Proceedings of the 18th International Congress of Phonetic Sciences

    Conference

    This paper examines the relevance of parameters related to slope and end-point of pitch segments for indicating turn-taking intentions in Swedish. Perceptually motivated stylization in Prosogram was used to characterize the last pitch segment in talkspurts involved in floor-keeping and turn- yielding events. The results suggest a limited contribution of pitch pattern direction and position of its endpoint in the speaker’s pitch range to signaling turn-taking intentions in Swedish. 

    Read more about Pitch Slope and End Point as Turn-Taking Cues in Swedish
  • Respiratory Properties of Backchannels in Spontaneous Multiparty Conversation

    2015. Marcin Włodarczak, Mattias Heldner. Proceedings of the 18th International Congress of Phonetic Sciences

    Conference

    In this paper we report on first results of a newly started project focussing on interactional functions of breathing in spontaneous multiparty conversation. Specifically, we investigate respiratory patterns associated with backchannels (short feedback expressions), and compare them with breathing cycles observed during longer stretches of speech or while listening to interlocutor’s speech. Overall, inhalations preceding backchannels were found to resemble those in quiet breathing to a large degree. The results are discussed in terms of temporal organisation and respiratory planning in these utterances. 

    Read more about Respiratory Properties of Backchannels in Spontaneous Multiparty Conversation
  • Temporal aspects of breathing and turn-taking in Swedish multiparty conversations

    2015. Jonna Hammarsten (et al.). Proceedings from Fonetik 2015, 47-50

    Conference

    Interlocutors use various signals to make conversations flow smoothly. Recent research has shown that respiration is one of the signals used to indicate the intention to start speaking. In this study, we investigate whether inhalation duration and speech onset delay within one’s own turn differ from when a new turn is initiated. Respiratory activity was recorded in two three-party conversations using Respiratory Inductance Plethysmography. Inhalations were categorised depending on whether they coincided with within-speaker silences or with between- speaker silences. Results showed that within-turn inhalation durations were shorter than inhalations preceding new turns. Similarly, speech onset delays were shorter within turns than before new turns. Both these results suggest that speakers ‘speed up’ preparation for speech inside turns, probably to indicate that they intend to continue. 

    Read more about Temporal aspects of breathing and turn-taking in Swedish multiparty conversations
  • The acoustics of word stress in English as a function of stress level and speaking style

    2015. Anders Eriksson, Mattias Heldner. 16th Annual Conference of the International Speech Communication Association (INTERSPEECH 2015), 41-45

    Conference

    This study of lexical stress in English is part of a series of studies, the goal of which is to describe the acoustics of lexical stress for a number of typologically different languages. When fully developed the methodology should be applicable to any language. The database of recordings so far includes Brazilian Portuguese, English (U.K.), Estonian, German, French, Italian and Swedish. The acoustic parameters examined are f0-level, f0-variation, Duration, and Spectral Emphasis. Values for these parameters, computed for all vowels, are the data upon which the analyses are based. All parameters are tested with respect to their correlation with stress level (primary, secondary, unstressed) and speaking style (wordlist reading, phrase reading, spontaneous speech). For the English data, the most robust results concerning stress level are found for Duration and Spectral Emphasis. f0-level is also significantly correlated but not quite to the same degree. The acoustic effect of phonological secondary stress was significantly different from primary stress only for Duration. In the statistical tests, speaker sex turned out as significant in most cases. Detailed examination showed, however, that the difference was mainly in the degree to which a given parameter was used, not how it was used to signal lexical stress contrasts. 

    Read more about The acoustics of word stress in English as a function of stress level and speaking style
  • Backchannels and breathing

    2014. Kätlin Aare, Marcin Włodarczak, Mattias Heldner. Proceedings from FONETIK 2014, 47-52

    Conference

    The present study investigated the timing of backchannel onsets within speaker’s own and dialogue partner’s breathing cycle in two spontaneous conversations in Estonian. Results indicate that backchannels are mainly produced near the beginning, but also in the second half of the speaker’s exhalation phase. A similar tendency was observed in short non-backchannel utterances, indicating that timing of backchannels might be determined by their duration rather than their pragmatic function. By contrast, longer non-backchannel utterances were initiated almost exclusively right at the beginning of the exhalation. As expected, backchannels in the conversation partner’s breathing cycle occurred predominantly towards the end of the exhalation or at the beginning of the inhalation. 

    Read more about Backchannels and breathing
  • Catching wind of multiparty conversation

    2014. Jens Edlund, Mattias Heldner, Marcin Włodarczak. Proceedings of Multimodal Corpora, 35-36

    Chapter

    The paper describes the design of a novel multimodal corpus of spontaneous multiparty conversations in Swedish. The corpus is collected with the primary goal of investigating the role of breathing and its perceptual cues for interactive control of interaction. Physiological correlates of breathing are captured by means of respiratory belts, which measure changes in cross sectional area of the rib cage and the abdomen. Additionally, auditory and visual correlates of breathing are recorded in parallel to the actual conversations. The corpus allows studying respiratory mechanisms underlying organisation of spontaneous conversation, especially in connection with turn management. As such, it is a valuable resource both for fundamental research and speech techonology applications.

    Read more about Catching wind of multiparty conversation
  • Is breathing prosody?

    2014. Jens Edlund, Mattias Heldner, Marcin Włodarczak. International Symposium on Prosody to Commemorate Gösta Bruce

    Conference

    Even though we may not be aware of it, much breathing in face-to-face conversation is both clearly audible and visible. Consequently, it has been suggested that respiratory activity is used in the joint coordination of conversational flow. For instance, it has been claimed that inhalation is an interactionally salient cue to speech initiation, that exhalation is a turn yielding device, and that breath holding is a marker of turn incompleteness (e.g. Local & Kelly, 1986; Schegloff, 1996). So far, however, few studies have addressed the interactional aspects of breathing (one notable exeption is McFarland, 2001). In this poster, we will describe our ongoing efforts to fill this gap. We will present the design of a novel corpus of respiratory activity in spontaneous multiparty face-to-face conversations in Swedish. The corpus will contain physiological measurements relevant to breathing, high-quality audio, and video. Minimally, the corpus will be annotated with interactional events derived from voice activity detection and (semi-) automatically detected inhalation and exhalation events in the respiratory data. We will also present initial analyses of the material collected. The question is whether breathing is prosody and relevant to this symposium? What we do know is that the turntaking phenomena that of particular interest to us are closely (almost by definition) related to several prosodic phenomena, and in particular to those associated with prosodic phrasing, grouping and boundaries. Thus, we will learn more about respiratory activity in phrasing (and the like) through analyses of breathing in conversation. References Local, John K., & Kelly, John. (1986). Projection and 'silences': Notes on phonetic and conversational structure. Human Studies, 9, 185-204. McFarland, David H. (2001). Respiratory markers of conversational interaction. Journal of Speech, Language, and Hearing Research, 44, 128-143. Schegloff, E. A. (1996). Turn organization: One intersection of grammar and interaction. In E. Ochs, E. A. Schegloff & S. A. Thompson (Eds.), Interaction and Grammar (pp. 52-133), Cambridge: Cambridge University Press.

    Read more about Is breathing prosody?
  • Voices after midnight

    2014. Alexandra Berger (et al.). Proceedings from FONETIK 2014, 1-4

    Conference

    This study aimed to investigate how different parameters of the voice (jitter, shimmer, LTAS and mean pitch) are affected by a late night out. Three recordings were made: one early evening before the night out, one after midnight, and one on the next day. Each recording consisted of a one minute reading and prolonged vowels. Five students took part in the experiment. Results varied among the participants, but some patterns were noticeable in all parameters. A trend towards increased mean pitch during the second recording was observed among four of the subjects. Somewhat unexpectedly, jitter and shimmer decreased between the first and second recordings and increased in the third one. Due to the lack of ethical testing, only a small number of participants were included. A larger sample is suggested for future research in order to generalize results.

    Read more about Voices after midnight
  • Backchannel relevance spaces

    2013. Mattias Heldner, Anna Hjalmarsson, Jens Edlund. Nordic Prosody, 137-146

    Conference

    This contribution introduces backchannel relevance spaces – intervals where it is relevant for a listener in a conversation to produce a backchannel. By annotating and comparing actual visual and vocal backchannels with potential backchannels established using a group of subjects acting as third-party listeners, we show (i) that visual only backchannels represent a substantial proportion of all backchannels; and (ii) that there are more opportunities for backchannels (i.e. potential backchannels or backchannel relevance spaces) than there are actual vocal and visual backchannels. These findings indicate that backchannel relevance spaces enable more accurate acoustic, prosodic, lexical (et cetera) descriptions of backchannel inviting cues than descriptions based on the context of actual vocal backchannels only.

    Read more about Backchannel relevance spaces
  • 3rd party observer gaze as a continuous measure of dialogue flow

    2012. Jens Edlund (et al.).

    Conference

    We present an attempt at using 3rd party observer gaze to get a measure of how appropriate each segment in a dialogue is for a speaker change. The method is a step away from the current dependency of speaker turns or talkspurts towards a more general view of speaker changes. We show that 3rd party observers do indeed largely look at the same thing (the speaker), and how this can be captured and utilized to provide insights into human communication. In addition, the results also suggest that there might be differences in the distribution of 3rd party observer gaze depending on how information-rich an utterance is. 

    Read more about 3rd party observer gaze as a continuous measure of dialogue flow
  • Conversational gaze in light and darkness

    2012. Elisabet Renklint (et al.). Proceedings from FONETIK 2012, 57-60

    Conference

    The way we use our gaze in face-to-face interaction is an important part of our social behavior. This exploratory study investigates the relationship between mutual gaze and joint silences and overlaps, where speaker changes and backchannels often occur. Seven dyadic conversations between two persons were recorded in a studio. Gaze patterns were annotated in ELAN to find instances of mutual gaze. Part of the study was conducted in total darkness as a way to observe what happens to our gaze-patterns when we cannot see our interlocutor, although the physical face-to-face condition is upheld. The results show a difference in the frequency of mutual gaze in conversation in light and darkness. 

    Read more about Conversational gaze in light and darkness
  • On the dynamics of overlap in multi-party conversation

    2012. Kornel Laskowski, Mattias Heldner, Jens Edlund. INTERSPEECH 2012, 846-849

    Conference

    Overlap, although short in duration, occurs frequently in multi- party conversation. We show that its duration is approximately log-normal, and inversely proportional to the number of simul- taneously speaking parties. Using a simple model, we demon- strate that simultaneous talk tends to end simultaneously less frequently than in begins simultaneously, leading to an arrow of time in chronograms constructed from speech activity alone. The asymmetry is significant and discriminative. It appears to be due to dialog acts which do not carry propositional content, and those which are not brought to completion. 

    Read more about On the dynamics of overlap in multi-party conversation
  • On the effect of the acoustic environment on the accuracy of perception of speaker orientation from auditory cues alone

    2012. Jens Edlund, Mattias Heldner, Joakim Gustafson. INTERSPEECH 2012, 1482-1485

    Conference

    The ability of people, and of machines, to determine the position of a sound source in a room is well studied. The related ability to determine the orientation of a directed sound source, on the other hand, is not, but the few studies there are show people to be surprisingly skilled at it. This has bearing for studies of face-to- face interaction and of embodied spoken dialogue systems, as sound source orientation of a speaker is connected to the head pose of the speaker, which is meaningful in a number of ways. The feature most often implicated for detection of sound source orientation is the inter-aural level difference - a feature which it is assumed is more easily exploited in anechoic chambers than in everyday surroundings. We expand here on our previous studies and compare detection of speaker orientation within and outside of the anechoic chamber. Our results show that listeners find the task easier, rather than harder, in everyday surroundings, which suggests that inter-aural level differences is not the only feature at play. 

    Read more about On the effect of the acoustic environment on the accuracy of perception of speaker orientation from auditory cues alone
  • Who am I speaking at? Perceiving the head orientation of speakers from acoustic cues alone

    2012. Jens Edlund, Mattias Heldner, Joakim Gustafson. LREC Workshop on Multimodal Corpora for Machine Learning

    Chapter

    The ability of people, and of machines, to determine the position of a sound source in a room is well studied. The related ability to determine the orientation of a directed sound source, on the other hand, is not, but the few studies there are show people to be surprisingly skilled at it. This has bearing for studies of face-to-face interaction and of embodied spoken dialogue systems, as sound source orientation of a speaker is connected to the head pose of the speaker, which is meaningful in a number of ways. We describe in passing some preliminary findings that led us onto this line of investigation, and in detail a study in which we extend an experiment design intended to measure perception of gaze direction to test instead for perception of sound source orientation. The results corroborate those of previous studies, and further show that people are very good at performing this skill outside of studio conditions as well. 

    Read more about Who am I speaking at? Perceiving the head orientation of speakers from acoustic cues alone
  • Detection thresholds for gaps, overlaps and no-gap-no-overlaps

    2011. Mattias Heldner. Journal of the Acoustical Society of America 130 (1), 508-513

    Article

    Detection thresholds for gaps and overlaps, that is acoustic and perceived silences and stretches of overlapping speech in speaker changes, were determined. Subliminal gaps and overlaps were cate- gorized as no-gap-no-overlaps. The established gap and overlap detection thresholds both corre- sponded to the duration of a long vowel, or about 120 ms. These detection thresholds are valuable for mapping the perceptual speaker change categories gaps, overlaps, and no-gap-no-overlaps into the acoustic domain. Furthermore, the detection thresholds allow generation and understanding of gaps, overlaps, and no-gap-no-overlaps in human-like spoken dialogue systems.

    Read more about Detection thresholds for gaps, overlaps and no-gap-no-overlaps
  • Pauses, gaps and overlaps in conversations

    2010. Mattias Heldner, Jens Edlund. Journal of Phonetics 38 (4), 555-568

    Article

    This paper explores durational aspects of pauses gaps and overlaps in three different conversational corpora with a view to challenge claims about precision timing in turn-taking Distributions of pause gap and overlap durations in conversations are presented and methodological issues regarding the statistical treatment of such distributions are discussed The results are related to published minimal response times for spoken utterances and thresholds for detection of acoustic silences in speech It is shown that turn-taking is generally less precise than is often claimed by researchers in the field of conversation analysis or interactional linguistics These results are discussed in the light of their implications for models of timing in turn-taking and for interaction control models in speech technology In particular it is argued that the proportion of speaker changes that could potentially be triggered by information immediately preceding the speaker change is large enough for reactive interaction controls models to be viable in speech technology.

    Read more about Pauses, gaps and overlaps in conversations
  • Exploring the prosody of floor mechanisms in English using the fundamental frequency variation spectrum

    2009. Kornel Laskowski, Mattias Heldner, Jens Edlund. Proceedings of EUSIPCO 2009, 2539-2543

    Conference

    A basic requirement for participation in conversation is the ability to jointly manage interaction, and to recognize the attempts of interlocutors to do same. Examples of management activity include efforts to acquire, re-acquire, hold, release, and acknowledge floor ownership, and they are often implemented using dedicated dialog act types. In this work, we explore the prosody of one class of such dialog acts, known as floor mechanisms, using a methodology based on a recently proposed representation of fundamental frequency variation. Models over the representation illustrate significant differences between floor mechanisms and other dialog act types, and lead to automatic detection accuracies in equal-prior test data of up to 75%. description of floor mechanism prosody. We note that this work is also the first attempt to compute and model FFV spectra for multiparty rather than two-party conversation, as well as the first attempt to infer dialogue structure from non-anechoic-chamber recordings.

    Read more about Exploring the prosody of floor mechanisms in English using the fundamental frequency variation spectrum
  • Computing the fundamental frequency variation spectrum in conversational spoken dialogue systems

    2008. Kornel Laskowski (et al.). Proceedings of the 155th Meeting of the Acoustical Society of America, 5th EAA Forum Acusticum, and 9th SFA Congrés Français d'Acoustique (Acoustics2008), 3305-3310

    Conference

    Continuous modeling of intonation in natural speech has long been hampered by a focus on modeling pitch, of which several normative aspects are particularly problematic. The latter include, among others, the fact that pitch is undefined in unvoiced segments, that its absolute magnitude is speaker-specific, and that its robust estimation and modeling, at a particular point in time, rely on a patchwork of long-time stability heuristics. In the present work, we continue our analysis of the fundamental frequency variation (FFV) spectrum, a recently proposed instantaneous, continuous, vector-valued representation of pitch variation, which is obtained by comparing the harmonic structure of the frequency magnitude spectra of the left and right half of an analysis frame. We analyze the sensitivity of a task-specific error rate in a conversational spoken dialogue system to the specific definition of the left and right halves of a frame, resulting in operational recommendations regarding the framing policy and window shape.

    Read more about Computing the fundamental frequency variation spectrum in conversational spoken dialogue systems
  • Potential benefits of human-like dialogue behaviour in the call routing domain

    2008. Joakim Gustafson, Mattias Heldner, Jens Edlund. Perception in Multimodal Dialogue Systems, 240-251

    Chapter

    This paper presents a Wizard-of-Oz (Woz) experiment in the call routing domain that took place during the development of a call routing system for the TeliaSonera residential customer care in Sweden. A corpus of 42,000 calls was used as a basis for identifying problematic dialogues and the strategies used by operators to overcome the problems. A new Woz recording was made, implementing some of these strategies. The collected data is described and discussed with a view to explore the possible benefits of more human-like dialogue behaviour in call routing applications.

    Read more about Potential benefits of human-like dialogue behaviour in the call routing domain
  • Towards human-like spoken dialogue systems

    2008. Jens Edlund (et al.). Speech Communication 50 (8-9), 630-645

    Article

    This paper presents an overview of methods that can be used to collect and analyse data on user responses to spoken dialogue system components intended to increase human-likeness, and to evaluate how well the components succeed in reaching that goal. Wizard-of-Oz variations, human-human data manipulation, and micro-domains are discussed ill this context, as is the use of third-party reviewers to get a measure of the degree of human-likeness. We also present the two-way mimicry target, a model for measuring how well a human-computer dialogue mimics or replicates some aspect of human-human dialogue, including human flaws and inconsistencies. Although we have added a measure of innovation, none of the techniques is new in its entirely. Taken together and described from a human-likeness perspective, however, they form a set of tools that may widen the path towards human-like spoken dialogue systems.

    Read more about Towards human-like spoken dialogue systems
  • Exploring prosody in interaction control

    2005. Jens Edlund, Mattias Heldner. Phonetica 62 (2-4), 215-226

    Article

    This paper investigates prosodic aspects of turn-taking in conversation with a view to improving the efficiency of identifying relevant places at which a machine can legitimately begin to talk to a human interlocutor. It examines the relationship between interaction control, the communicative function of which is to regulate the flow of information between interlocutors, and its phonetic manifestation. Specifically, the listener's perception of such interaction control phenomena is modelled. Algorithms for automatic online extraction of prosodic phenomena liable to be relevant for interaction control, such as silent pauses and intonation patterns, are presented and evaluated in experiments using Swedish map task data. We show that the automatically extracted prosodic features can be used to avoid many of the places where current dialogue systems run the risk of interrupting their users, as well as to identify suitable places to take the turn.

    Read more about Exploring prosody in interaction control
  • The Swedish NICE Corpus – Spoken dialogues between children and embodied characters in a computer game scenario

    2005. Linda Bell (et al.). Proceedings Interspeech 2005 - Eurospeech, 2765-2768

    Conference

    This article describes the collection and analysis of a Swedish database of spontaneous and unconstrained children-machine dialogues. The Swedish NICE corpus consists of spoken dialogues between children aged 8 to 15 and embodied fairytale characters in a computer game scenario. Compared to previously collected corpora of children's computer-directed speech, the Swedish NICE corpus contains extended interactions, including three-party conversation, in which the young users used spoken dialogue as the primary means of progression in the game.

    Read more about The Swedish NICE Corpus – Spoken dialogues between children and embodied characters in a computer game scenario
  • On the reliability of overall intensity and spectral emphasis as acoustic correlates of focal accents in Swedish

    2003. Mattias Heldner. Journal of Phonetics 31 (1), 39-62

    Article

    This study shows that increases in overall intensity and spectral emphasis are reliable acoustic correlates of focal accents in Swedish. They are both reliable in the sense that there are statistically significant differences between focally accented words and nonfocal ones for a variety of words, in any position of the phrase and for all speakers in the analyzed materials, and in the sense of their being useful for automatic detection of focal accents. Moreover, spectral emphasis turns out to be the more reliable correlate, as the influence on it of position in the phrase, word accent and vowel height was less pronounced and as it proved a better predictor of focal accents in general and for a majority of the speakers. Finally, the study has resulted in data for overall intensity and spectral emphasis that might prove important in modeling for speech synthesis.

    Read more about On the reliability of overall intensity and spectral emphasis as acoustic correlates of focal accents in Swedish
  • Prosodic adaptation in human-computer interaction

    2003. Linda Bell, Joakim Gustafson, Mattias Heldner. Proceedings ICPhS 2003, 2453-2456

    Conference

    State-of-the-art speech recognizers are trained on predominantly normal speech and have difficulties handling either exceedingly slow and hyperarticulated or fast and sloppy speech. Explicitly instructing users on how to speak, however, can make the human–computer interaction stilted and unnatural. If it is possible to affect users’ speaking rate while maintaining the naturalness of the dialogue, this could prove useful in the development of future human–computer interfaces. Users could thus be subtly influenced to adapt their speech to better match the current capabilities of the system, so that errors can be reduced and the overall quality of the human–computer interaction is improved. At the same time, speakers are allowed to express themselves freely and naturally. In this article, we investigate whether people adapt their speech as they interact with an animated character in a simulated spoken dialogue system. A user experiment involving 16 subjects was performed to examine whether people who speak with a simulated dialogue system adapt their speaking rate to that of the system. The experiment confirmed that the users adapted to the speaking rate of the system, and no subjects afterwards seemed to be aware they had been affected in this way. Another finding was that speakers varied their speaking rate substantially in the course of the dialogue. In particular, problematic sequences where subjects had to repeat or rephrase the same utterance several times elicited slower speech.

    Read more about Prosodic adaptation in human-computer interaction
  • Temporal effects of focus in Swedish

    2001. Mattias Heldner, Eva Strangert. Journal of Phonetics 29 (3), 329-361

    Article

    The four experiments reported concern the amount and domain of lengthening associated with focal accents in Swedish. Word, syllable and segment durations were measured in read sentences with focus in different positions. As expected, words with focal accents were longer than nonfocal words in general, but the amount of lengthening varied greatly, primarily due to speaker differences but also to position in the phrase and the word accent distinction. Most of the lengthening occurred within the stressed syllable. An analysis of the internal structure of stressed syllables showed that the phonologically long segments-whether vowels or consonants-were lengthened most, while the phonologically short vowels were hardly affected at all. Through this nonlinear lengthening, the contrast between long and short vowels in stressed syllables was sharpened in focus. Thus, the domain of focal accent lengthening includes at least the stressed syllable. Also, an unstressed syllable immediately to the right of the stressed one was lengthened in focus, while initial unstressed syllables, as well as unstressed syllables to the right of the first unstressed one, were not lengthened. Thus, we assume the domain of focal accent lengthening in Swedish to be restricted to the stressed syllable and the immediately following unstressed one.

    Read more about Temporal effects of focus in Swedish
  • F0 declination in spontaneous and read-aloud speech

    1996. Marc Swerts, Eva Strangert, Mattias Heldner. Fourth International Conference on Spoken Language, 1996. ICSLP 96. Proceedings., 1501-1504

    Conference

    The paper deals with a prosodic comparison of spontaneous and read-aloud speech. More specifically, the study reports data on F0 declination in these two speaking modes using Swedish materials. For both speaking styles the analysis revealed negative slopes, a steepness-duration dependency with declination being less steep in longer utterances than in shorter ones and resetting at utterance boundaries. However, there was a difference in degree of declination between the two speaking styles, read-aloud speech in general having steeper slopes, a more apparent time dependency and stronger resetting than spontaneous speech

    Read more about F0 declination in spontaneous and read-aloud speech

Show all publications by Mattias Heldner at Stockholm University