Options
HKNSC: A dynamic bilingual narrative speech corpus from Hong Kong and its potential for linguistic research
Author(s)
Date Issued
2024
Citation
Yang, Y. (8 December 2024). HKNSC: A dynamic bilingual narrative speech corpus from Hong Kong and its potential for linguistic research. 2024 Annual Research Forum of the Linguistic Society of Hong Kong, Hong Kong Shue Yan University.
Type
Conference Paper
Abstract
Background: Until recently, language documentation has been largely focusing on collecting
data of endangered languages, and the linguistic profiles of bilingual speakers in the
immigration setting are not well documented, although there are some attempts in documenting
language development of bilingual children. The current project aims to collect narrative
speech data from late bilingual speakers, who exhibit distinctive linguistic features in the
language acquisition process compared to simultaneous or early bilinguals because there is
already a full-fledged first language (L1) in place when they start to learn their second language
(L2). HKNSC stands for ‘Hong Kong Narrative Speech Corpus’. At the initial stage, only
Mandarin-speaking immigrants were invited to contribute to the corpus. The team aims to
invite immigrants with a wider range of L1s and will continue to update the corpus.
Participants: Three groups of participants were invited to attend this study. The target
bilingual group consisted of 73 Mandarin-speaking immigrants who were born and raised up
in Northern China and arrived in Hong Kong after puberty. The other two groups were native
speakers of Cantonese (N=59) and Mandarin (N=40). The bilingual group and the Cantonese
speakers told the story in both Cantonese and Mandarin, while the monolingual Mandarin
speakers told the story in Mandarin only.
Materials: The wordless picture book Frog, where are you? (Mayer, 1969) were used to elicit
spontaneous speech data. Participants were asked to tell the whole story at a self-paced speed.
They were allowed to refer to the book and were encouraged to tell the story page by page in
order not to miss the key information. In total, 132 Cantonese stories (73 bilinguals + 59 natives)
and 155 Mandarin stories (73 bilinguals + 40 natives + 42 Cantonese speakers) were collected.
Corpus Construction: The recordings were first transcribed into text using automatic speechto-text tools and subsequently verified by trained linguists to ensure transcription accuracy.
Following word and sentence segmentation, the corpus was constructed using CLAN
(MacWhinney, 1996). Upon a final review, the corpus will be released to researchers and
students of linguistics and language acquisition for public access in early 2025.
Potential for Research: This corpus will be valuable for scholars in the fields of linguistics
and language acquisition. For instance, the data can be used to investigate the following topics
regarding L1 and L2 Cantonese: 1) the merging status of checked syllables; 2) the production
of lexical tones; 3) the use of sentence-final particles; 4) lexical/syntactic complexity; and 5)
the use of cohesive devices. Moreover, researchers may also use the corpus data to examine
the cross-linguistic inference of the L1 and L2 at various linguistic levels.
Availability at HKSYU Library

