Heritage Language Variation and Change in Toronto
home » about » corpus

The HerLD Corpus

A primary goal of the is to construct a unique corpus of conversational speech in six Heritage Languages spoken in the Greater Toronto Area. This corpus, the , contains recordings in the Heritage Languages of speakers representing three generations. Our goal is to record 40 speakers, balanced for age and sex, for each of the three generations (and 20 speakers for languages where only two generations exist in Toronto, i.e., Korean and Faetar). As of Summer 2013, we have recorded the following distribution:

HLVC Speakers recorded

Heritage (GTA)


10 collected by our HLVC Team

+ Homeland Cantonese data to be contributed by May Chan.


See Nagy 1996



Hikyoung Lee and her research team are collecting the Korean homeland data in Seoul.


from the Russian National Corpus


Collection by our HLVC Team

For each speaker recorded, there are several types of data. (See file-naming conventions.)

1. Primary content: a set of audio-recorded interviews with time-aligned transcriptions (IV)

  • Interviews average one-hour in length. Normally the speakers include one Heritage Language participant and one or two Heritage Language speaking interviewers. Sometimes members of the participant's family are present. All segments of the interview conducted in the Heritage Language are transcribed as fully as possible. Lengthy switches to English are not always transcribed. An English version of the interview modules is available here. It is also available in each Heritage language.

The audio interviews are in .wav and .mp3 format.

Time-aligned transcriptions have been produced (or are being produced) for each interview.

  • Transcriptions are constructed using ELAN and are available as ELAN Annotation files, or .eaf files. (These are easily converted to many other formats.) All are UNICODE-compliant.
  • Transcriptions are aligned at the phrase level.
  • Switches to English (and other languages) are marked.
  • Cantonese is transcribed phonetically, using the Jyutping system. We are adding colloquial Cantonese character transcription.
    • Jyutping is as in Matthews, Stephen and Virginia Yip. 1994. Cantonese: a comprehensive grammar. London, New York: Routledge. 13-22.
  • Faetar is transcribed in IPA (International Phonetic Alphabet) and loosely translated into English.
  • Korean is transcribed in Hangul.
  • Hungarian is transcribed orthographically.
  • Italian is transcribed orthographically.
  • Polish is transcribed orthographically.
  • Russian and Ukrainian are transcribed using Comrie & Corbett’s Transliteration.
    • Comrie, Bernard & Greville Corbett. 2002. The Slavonic Languages. London & New York: Routledge. 827, 832-833.
    • Exceptions to this system: symbols with hachek (use zh, sh, ch, and shh instead), the 'hard sign' (#) and 'open e' (je).
    • These transcriptions can be automatically transliterated to Cyrillic at http://www.translit.ru.

2. Participants respond orally to an Ethnic Orientation Questionnaire (EOQ)

    • Audio recordings of these are available (as _EOQ.wav and _EOQ.mp3 files). [See file-naming conventions]
    • Some participants' responses are transcribed (in _EOQ.eaf files). Many are not fully transcribed. Rather, the question number is indicated in the transcript if the interviewer asked the question verbatim.
    • Responses to each question are numerically coded in a spreadsheet for all participants of one Heritage Language: EOQ_data_LANGUAGE_date.xls.

3. Participants name a set of pictures, the First Words task (FW)

    • Participants are asked to describe a sequence of pictures from a children's story book, naming common objects and then describing scenes containing these items.
    • The book used is:

      Amery, Heather and Stephen Cartwright. 1987. The First Hundred Words. Tulsa: Educational Developmental Corporation.

    Audio .wav and .mp3, as well as .eaf transcription files are available for this task.

File naming conventions

Files labels have three parts:

  1. Speaker code of the primary participants
  2. Abbreviation indicating the type of interaction (preceded by an underscore):
    1. _IV is the sociolinguistic interview, generally a lengthy, relaxed conversation
    2. _EOQ is the oral administration of the Ethnic Orientation Questionnaire
    3. _FW is a picture-description task, referred to as First Words
  3. An extension indicating the file type (preceded by a dot or period):
    1. .eaf is an ELAN annotation file, or transcript
    2. .wav is an uncompressed audio recording
    3. .mp3 is a compressed audio recording
    4. Additional file types include Praat textgrids (.TextGrid) for acoustic analysis, .xls and .xslx for record-keeping of various types, e.g., the2_2_linguists.php#catalog.

Example: F1F29A_IV.eaf is the transcription of the sociolinguistic interview of speaker F1F29A.

Speaker labelling conventions

Each speaker is identified by a speakercode. The speakercode consists of five parts:

  1. The first character identifies the heritage language of the speaker (C, F, I, K, R, U).
  2. The second character identifies the speaker's generation (1, 2 or 3).
  3. The third character identifies the speaker's sex (M or F).
  4. The fourth and fifth characters give the speaker's age.
  5. The final character (A, B, C, etc.) provide unique identifiers for otherwise identically-labeled speakers.

Example: F1F29A is a Faetar-speaking, first generation, female, 29-year old. She is the first such speaker recorded.