June 1st, 2020

Speech Prosody 2020, yes ... no ... YES!

When I heard that Speech Prosody would be held in Tokyo in 2020, I knew that I wanted to be sure to join in the fun. I have wanted to join a Speech Prosody conference for some years, but one reason or another, it never really fit my schedule or budget. But, being in the same city, I had no excuse. So, I eagerly submitted a paper.

And then, CoronaVirus hit in February and grew more and more serious. Speech Prosody was to be held in May, so early on, it seemed it would be fine. There was a lot of communication between the organizers and the participants and talk of moving it to October, or 2021, or online, or even cancelling it completely. Finally, the choice was made: Move it online. But not in real-time. Instead, they went for asking presenters to prepare a video of their presentations to be displayed via YouTube. This was a first me, but an enjoyable one.

Speech Prosody 2020 presentation still by R. Rose

I'll talk about my presentation below. First, let me talk about a few others that I really found interesting. The only presentation that was explicitly and exclusive focused on filled pauses was on by Vered Silber-Varod (a DiSS acquaintance), Daphna Amit, and Anat Lerner. Their aim was to model the change in filled pause usage among interlocutors as their communication continues. This kind of 'accommodation' has been studied in several contexts where speakers tend to change certain speech patterns so that they more closely align with the patterns of their interlocutor.

In Vered et al's case, they were specifically looking at whether the rate of filled pause use entrains between Hebrew-speaking interlocutors in their corpus. But one twist is that these are not 'balanced' speakers. One is in a role of leader and the other follower. This differential often leads to certain assertion of differentiating speech. So, there's a tension within the task. Nevertheless, they observed that, indeed, speakers do entrain toward each other. And not only that, even after they switch roles they continue to entrain (not simply reset and start again). I think this has some severe implications for the kind of conclusions we can draw from corpora of interactive speech. Speakers are not necessarily behaving in their default manner.

Yet another interesting presentation was by Misaki Kato, Shigeto Kawahara, and Kaori Idemaru on speech rate normalization. They looked at the influence of speech rate on the perception of vowel length and voice onset time (known to differentate voiced from voiceless stop consonants). They were looking at a specific feature of these phenomena that I won't go into here, but they did show the well-established result that after speech, listeners are more likely to perceive a constant length vowel as a short vowel, while after shorter speech, the same vowel as a long vowel.

This has got me thinking about the lengthening as a disfluency phenomon. What can be seen as lengthening may be strongly conditioned by the rate of speech that has come just before it. I remember listening to a presentation once where the presenters identified lengthened segments as simply those that were significantly longer than average. But without taking local speech rate into account, this would yield many false negatives and positives.

One more presentation is one I have to write about. It was given by my Waseda Colleagues, Mariko Kondo, Sylvain Detey, and their collaborators. They are working on system to automatically compute fluency ratings. Their system is far more complex than what I'm working on because their goal is a more high-stakes algorithm that can be used for learner classification. I listened to their work with great interest and look forward to detailed conversations with them later (we've had some so far even before Speech Prosody).

As for me, I talked about some validation work I've done with my Fluidity application. I have been wanting to formally test its ability to detect the various fluency parameters (silence, speech, pause rate, pause duration, filled pause rate). I did so by piping the L2 speech samples from the Crosslingustic Corpus of Hesitation Phenomena (CCHP) through it and comparing it to manual measurements. It turns out it is relatively accurate on most measures, which is great. However, the filled pause detection mechanism is not so good, with only low correlation (about 0.2) between Fluidity and manual measurements. So, there's room for improvement. Still, it was a lot of fun to create my first YouTube-based research presentation. I learned an awful lot.

Here's a direct link to the paper for those who are interested: link.

[Note: This post was written in September, 2020. However, in order to preserve the chronology of the blog, it has been dated to reflect when the described events actually took place.]

Filled Pause

Research Center

Filled Pause

Research Center

Filled Pause

Research Center

Speech Prosody 2020, yes ... no ... YES!