CCHP ReadMe.txt
file
================================================================
Crosslinguistic Corpus of Hesitation Phenomena (CCHP)
http://filledpause.com/chp/cchp
Last updated: 2018/06/22
================================================================
Thank you for your interest in the CCHP. This ReadMe.txt file is
intended to give a technical overview of the corpus as well as
stand as a record of updates to the corpus. Although it possible
to download only parts of the corpus, this file should accompany
all downloads.
----------------------------------------------------------------
License
This work is licensed under the Creative Commons
Attribution-NonCommercial-ShareAlike 3.0 Unported
License. To view a copy of this license, visit
https://creativecommons.org/licenses/by-nc-sa/3.0/
(see also license.html).
----------------------------------------------------------------
Overview
The Crosslinguistic Corpus of Hesitation Phenomena (CCHP) is
designed for research into the first and second language use of
hesitation phenomena in various kinds of elicited speech. In
particular, it is designed to allow a comparison across first
and second language speech by recordings responses to parallel
elicitation tasks in both languages. The recordings are
transcribed with special attention to the use of hesitation
phenomena but also with a view toward high transcription
accuracy and thus high usability by other researchers. Since
the construction of the corpus is being funded by a Japanese
government research grant, it is being made publicly available
for the benefit of other researchers and learners.
----------------------------------------------------------------
Technical Description
Participants in the corpus are all university students and were
recruited through advertisement in university bulletin boards.
After signing a consent form which informed them of the public
distribution of the corpus, each participant was asked to make
three recordings of about 3-4 minutes each in each of their
first and second languages (in that order). The elicitation
tasks for three recordings were as follows (in the order
performed).
- Reading aloud: Participants were given a printed text and were
asked to read it aloud. They were given no advance
preparation time.
- Picture description: Participants were shown a picture or
cartoon strips and asked to describe it. This was repeated
several times in order to fill the 3-4 minute target time.
They were told they could take a few seconds to study each
picture, but were asked to begin speaking as soon as possible.
- Topic narrative: Participants were given a topic to talk about
freely (e.g., describe the sport of basketball). They were
asked to imagine that they were speaking to someone during
this task. If necessary, a second topic (e.g., table tennis)
was given to fill the 3-4 minute target.
The participants were recorded in a sound-attenuated room using
an AKG C300 microphone channeled through an ART Dual Pre
microphone pre-amp to a Toshiba Dynabook R731 in mono 16-bit
48kHz quality. The files were processed using the normalize
and noise reduction functions in Audacity (ver. 2.0.1;
http://audacity.sourceforge.net/). The audio files are
provided in the CCHP archive as wav files for further analysis
and also as more portable mp3 files.
Each recording has been transcribed by two transcribers
independently. The transcribers are native speakers of the
same native language as the participants and advanced speakers
of the second language the participants spoke in. These two
transcriptions were checked by a third transcriber who focused
on resolving differences between the two transcriptions as well
as double-checking for errors.
The most detailed transcriptions are contained in the XML files.
For the most part, the annotations should be self-explanatory.
Following is an overview of the key elements.
<TRANSCRIPT> represents one recording and attributes on this
element indicate what language the participants spoke in and
in response to which elicitation task. Other attributes give
some demographic details about the participant.
<T>, which stand for "token" essentially represents standalone
words or partial words (shown with a hash mark '#' at the cut-
off point) as well as filled pauses.
Filled pauses (typically uh/um in English, e-/e-to in
Japanese) were marked as <T> elements like other words but
have a FILLED-PAUSE='yes' attribute.
<UTTERANCE> marks a complete utterance. Utterance boundaries
were determined by intonation primarily, though occasionally
by the presence of long pauses followed by an utterance
clearly intended as new.
<PUNC> marks punctuation. Though unspoken, of course, these
are provided at the end of each utterance for processing
purposes (e.g., for creating the minimal text transcriptions
described below).
<RP> demarcates repair sequences. The reparandum is marked
with an <O> tag (for "Original") and the repair is marked with
an <E> (for "rEpair"). Editing terms like filled pauses or
interjections were placed between <O> and <E> elements. Also,
when speakers made multiple attempts at repairs, these were
marked as <E> elements. Hence, the final <E> node under a
<RP> node represents the repaired speech.
<RT> denotes a repeat sequence. The structure is similar to
the <RP> sequence with <O> marking the original sequence of
words and <E> marking the repetition, with multiple <E> tags
showing iterated repetition. In rare cases, there is a <T>
element between the <O> and <E> elements indicating a filled
pause.
<FS> indicates a sequence of words which constitutes a false
start.
<OH> indicates an interjection of some sort (e.g., "Oh",
"Ah").
<AHEM/> indicates throat-clearing (i.e., "ahem").
<SIGH/> indicates a sigh.
<ING/> indicates a sound made when sucking air in through
closed teeth.
<IA> is used to mark a sequence of words which transcribers
found indeterminate. In some cases, a guess has been provided
within the <IA> element, but this was not always possible.
<BREAK/> indicates the boundary between pictures or topics in
the picture description and topic narrative elicitation tasks.
<PAUSE/> indicates a silent pause.
<C/> indicates a clause boundary. The type attribute indicates
whether it is the start or end of a clause. The id attribute
also indicate the type with a single final character (s or e).
A start-end pair of boundaries will have an overlapping id
attribute. Note, though, that because of repairs, repeats, and
false starts, there is not a one-to-one correspondence between
clause boundary start-end tags. For example, there are some
starts with no ends and some ends with multiple starts.
The duration of various intervals are shown using start and
end attributes showing the start time and the end
time of the respective elements. These times are measured
from the start of the recording. In some cases, the intervals
spread across multiple elements.
In addition to the detailed XML files, a plain text version of
the transcription is also available. This is a simple formatted
text consisting of the <T> nodes (i.e., words and filled
pauses) plus silent pause marks ('_' = 250-1000 ms pause; '__' =
1000-5000ms pause; '___' = 5000+ ms pause). This version is
probably not useful for detailed analysis, but may be useful
to get a quick overview of the speech.
Finally, TextGrid files are provided which give the duration
details of the transcription in the TextGrid format used by
Praat (praat.org). These files may be opened together with
the corresponding wav audio file in Praat for further analysis.
The durational information is equivalent to the interval
annotations in the xml files.
All of the text-based files are encoded in UTF-8 and should be
readable in most any text editor.
----------------------------------------------------------------
News and Updates
2012/09/01 - This is the initial release of CCHP materials.
This release includes audio files and transcripts for six
participants: p102-p104, p106-p108. The transcription
process is still ongoing. Thus, transcripts in this release
do not yet contain time markings and there are no Praat
TextGrid files yet.
2012/09/19 - This release adds files for five more participants
(p109-p114). However, the collections have not been updated
yet since a further release is expected soon with additional
participants. The collections will be updated in the next
release.
2012/10/05 - This release adds data from four more participants
(p115-p118). This marks the halfway point (15 of 30) for
transcribing the corpus. The remaining half will be worked
on during the coming months and all of the transcription
files should be available on-line in early 2013.
2018/06/22 - This release catches up on a lot of changes to
the corpus over the past few years. First, it now includes
all the participants (p101-p135). Also, clause information is
now annotated. In addition, there are Praat TextGrid files
that show speech and pause interval information as well as
filled pause intervals and their immediate word contexts.
----------------------------------------------------------------
Credits
The CCHP was compiled by and is maintained by Ralph Rose
<rose@waseda.jp>, Center for English Language Education (CELESE)
in Waseda University Faculty of Science and Engineering in
Tokyo, Japan.
Other Research Staff (former and current)
Hiroaki Suzuki
Junichi Inagaki
Masayuki Motoori
Yukikatsu Fukuda
Tatsuhiro Nomaguchi
Aiko Oue
Hinako Masuda
Wataru Okuzumi
Yutaka Shirasugi
Richard Jayson Varela
----------------------------------------------------------------
Sponsorship
The CCHP was created and developed under research grants-in-aid
from the Japan Society for the Promotion of Sciences (JSPS), as
follows.
“Hesitation Phenomena in Second Language Learning”
Project #24520661
Principal investigator: Ralph Rose
https://kaken.nii.ac.jp/en/grant/KAKENHI-PROJECT-24520661/
“Relationship between Silent and Filled Pauses and Syntactic
Structure in Second Language Use”
Project #15K02765
Principal investigator: Ralph Rose
https://kaken.nii.ac.jp/en/grant/KAKENHI-PROJECT-15K02765/