Filled Pause
Research Center

Filled Pause
Research Center

Filled Pause
Research Center

Investigating 'um' and 'uh' and other hesitation phenomena

Investigating 'um' and 'uh' and other hesitation phenomena

Investigating 'um' and 'uh' and other hesitation phenomena

CCHP ReadMe.txt file

Crosslinguistic Corpus of Hesitation Phenomena (CCHP)
Last updated: 2018/06/22

Thank you for your interest in the CCHP. This ReadMe.txt file is
intended to give a technical overview of the corpus as well as
stand as a record of updates to the corpus. Although it possible
to download only parts of the corpus, this file should accompany
all downloads.


This work is licensed under the Creative Commons
Attribution-NonCommercial-ShareAlike 3.0 Unported
License. To view a copy of this license, visit
(see also license.html).


The Crosslinguistic Corpus of Hesitation Phenomena (CCHP) is
designed for research into the first and second language use of
hesitation phenomena in various kinds of elicited speech.  In
particular, it is designed to allow a comparison across first
and second language speech by recordings responses to parallel
elicitation tasks in both languages.  The recordings are
transcribed with special attention to the use of hesitation
phenomena but also with a view toward high transcription
accuracy and thus high usability by other researchers.  Since
the construction of the corpus is being funded by a Japanese
government research grant, it is being made publicly available
for the benefit of other researchers and learners.

Technical Description

Participants in the corpus are all university students and were
recruited through advertisement in university bulletin boards.
After signing a consent form which informed them of the public
distribution of the corpus, each participant was asked to make
three recordings of about 3-4 minutes each in each of their
first and second languages (in that order).  The elicitation
tasks for three recordings were as follows (in the order

- Reading aloud: Participants were given a printed text and were
  asked to read it aloud.  They were given no advance
  preparation time.

- Picture description: Participants were shown a picture or
  cartoon strips and asked to describe it.  This was repeated
  several times in order to fill the 3-4 minute target time.
  They were told they could take a few seconds to study each
  picture, but were asked to begin speaking as soon as possible.

- Topic narrative: Participants were given a topic to talk about
  freely (e.g., describe the sport of basketball).  They were
  asked to imagine that they were speaking to someone during
  this task.  If necessary, a second topic (e.g., table tennis)
  was given to fill the 3-4 minute target.

The participants were recorded in a sound-attenuated room using
an AKG C300 microphone channeled through an ART Dual Pre
microphone pre-amp to a Toshiba Dynabook R731 in mono 16-bit
48kHz quality.  The files were processed using the normalize
and noise reduction functions in Audacity (ver. 2.0.1;  The audio files are
provided in the CCHP archive as wav files for further analysis
and also as more portable mp3 files.

Each recording has been transcribed by two transcribers
independently.  The transcribers are native speakers of the
same native language as the participants and advanced speakers
of the second language the participants spoke in.  These two
transcriptions were checked by a third transcriber who focused
on resolving differences between the two transcriptions as well
as double-checking for errors.

The most detailed transcriptions are contained in the XML files.
For the most part, the annotations should be self-explanatory.
Following is an overview of the key elements.

  <TRANSCRIPT> represents one recording and attributes on this
  element indicate what language the participants spoke in and
  in response to which elicitation task.  Other attributes give
  some demographic details about the participant.

  <T>, which stand for "token" essentially represents standalone
  words or partial words (shown with a hash mark '#' at the cut-
  off point) as well as filled pauses.

  Filled pauses (typically uh/um in English, e-/e-to in
  Japanese) were marked as <T> elements like other words but
  have a FILLED-PAUSE='yes' attribute.

  <UTTERANCE> marks a complete utterance.  Utterance boundaries
  were determined by intonation primarily, though occasionally
  by the presence of long pauses followed by an utterance
  clearly intended as new.

  <PUNC> marks punctuation.  Though unspoken, of course, these
  are provided at the end of each utterance for processing
  purposes (e.g., for creating the minimal text transcriptions
  described below).

  <RP> demarcates repair sequences.  The reparandum is marked
  with an <O> tag (for "Original") and the repair is marked with
  an <E> (for "rEpair").  Editing terms like filled pauses or
  interjections were placed between <O> and <E> elements.  Also,
  when speakers made multiple attempts at repairs, these were
  marked as <E> elements.  Hence, the final <E> node under a
  <RP> node represents the repaired speech.

  <RT> denotes a repeat sequence.  The structure is similar to
  the <RP> sequence with <O> marking the original sequence of
  words and <E> marking the repetition, with multiple <E> tags
  showing iterated repetition.  In rare cases, there is a <T>
  element between the <O> and <E> elements indicating a filled

  <FS> indicates a sequence of words which constitutes a false

  <OH> indicates an interjection of some sort (e.g., "Oh",

  <AHEM/> indicates throat-clearing (i.e., "ahem").

  <SIGH/> indicates a sigh.

  <ING/> indicates a sound made when sucking air in through
  closed teeth.

  <IA> is used to mark a sequence of words which transcribers
  found indeterminate.  In some cases, a guess has been provided
  within the <IA> element, but this was not always possible.

  <BREAK/> indicates the boundary between pictures or topics in
  the picture description and topic narrative elicitation tasks.

  <PAUSE/> indicates a silent pause.

  <C/> indicates a clause boundary. The type attribute indicates
  whether it is the start or end of a clause. The id attribute
  also indicate the type with a single final character (s or e).
  A start-end pair of boundaries will have an overlapping id
  attribute. Note, though, that because of repairs, repeats, and
  false starts, there is not a one-to-one correspondence between
  clause boundary start-end tags. For example, there are some
  starts with no ends and some ends with multiple starts.

  The duration of various intervals are shown using start and
  end attributes showing the start time and the end
  time of the respective elements.  These times are measured
  from the start of the recording. In some cases, the intervals
  spread across multiple elements.

In addition to the detailed XML files, a plain text version of
the transcription is also available.  This is a simple formatted
text consisting of the <T> nodes (i.e., words and filled
pauses) plus silent pause marks ('_' = 250-1000 ms pause; '__' =
1000-5000ms pause; '___' = 5000+ ms pause).  This version is
probably not useful for detailed analysis, but may be useful
to get a quick overview of the speech.

Finally, TextGrid files are provided which give the duration
details of the transcription in the TextGrid format used by
Praat (  These files may be opened together with
the corresponding wav audio file in Praat for further analysis.
The durational information is equivalent to the interval
annotations in the xml files.

All of the text-based files are encoded in UTF-8 and should be
readable in most any text editor.

News and Updates

2012/09/01 - This is the initial release of CCHP materials.
  This release includes audio files and transcripts for six
  participants: p102-p104, p106-p108.  The transcription
  process is still ongoing.  Thus, transcripts in this release
  do not yet contain time markings and there are no Praat
  TextGrid files yet.

2012/09/19 - This release adds files for five more participants
  (p109-p114).  However, the collections have not been updated
  yet since a further release is expected soon with additional
  participants.  The collections will be updated in the next

2012/10/05 - This release adds data from four more participants
  (p115-p118).  This marks the halfway point (15 of 30) for
  transcribing the corpus.  The remaining half will be worked
  on during the coming months and all of the transcription
  files should be available on-line in early 2013.

2018/06/22 - This release catches up on a lot of changes to
  the corpus over the past few years. First, it now includes
  all the participants (p101-p135). Also, clause information is
  now annotated. In addition, there are Praat TextGrid files
  that show speech and pause interval information as well as
  filled pause intervals and their immediate word contexts.


The CCHP was compiled by and is maintained by Ralph Rose
<>, Center for English Language Education (CELESE)
in Waseda University Faculty of Science and Engineering in
Tokyo, Japan.

Other Research Staff (former and current)

Hiroaki Suzuki
Junichi Inagaki
Masayuki Motoori
Yukikatsu Fukuda
Tatsuhiro Nomaguchi
Aiko Oue
Hinako Masuda
Wataru Okuzumi
Yutaka Shirasugi
Richard Jayson Varela


The CCHP was created and developed under research grants-in-aid
from the Japan Society for the Promotion of Sciences (JSPS), as

“Hesitation Phenomena in Second Language Learning”
Project #24520661
Principal investigator: Ralph Rose

“Relationship between Silent and Filled Pauses and Syntactic
Structure in Second Language Use”
Project #15K02765
Principal investigator: Ralph Rose