CABank Croatian Corpus

Jelena Kuvač Kraljević
Speech and Language Pathology
University of Zagreb
jelena.kuvac@erf.hr
website

Gordana Hržica
Speech and Language Pathology
University of Zagreb
gordana.hrzica@erf.hr
website

Participants:	617
Type of Study:	demographic
Location:	Croatia
Media type:	audio
DOI:	doi:10.21415/T5131S

Citation information

Kuvač Kraljević, Jelena; Hržica, Gordana (2016). Croatian Adult Spoken Language Corpus (HrAL). Fluminensia: Journal for philological research. 28, 2. Article can be downloaded from http://fluminensia.ffri.hr/en/clanak?id=1466.html

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Project Description

This Croatian Language corpus (HrAL) was built by sampling spontaneous conversations among 617 speakers from all Croatian counties, and it comprises more than 250 000 tokens and more than 100 000 types. Data for the corpus were collected from 2010 to 2012, from 2014 to 2015 and during 2016. Croatian speakers from different parts of Croatia with access to groups of speakers (friends and families) were recruited and trained to collect samples of spoken language. These trained investigators were responsible for recruiting participants, who were groups of 3-8 adult speakers who regularly engage in informal spoken interaction. Sampling was performed in informal situations, predominantly spontaneous conversations among friends, relatives or acquaintances during family meals, informal gatherings or socializing.

Participants were adults who spoke Croatian as their mother tongue and first language. All speakers had typical language status. Of the original group of 636 speakers, 3% withdrew during sampling or analysis, leaving 617 participants and 165 language transcripts. Transcripts were annotated with the ages and genders of the speakers, as well as the location of the conversation. A separate spreadsheet lists the speakers' origin, where they have spent most of their life and their level of education. While age and gender data were available for all speakers (Table 1), information about origins was available for only 80% and information about education for only 60%. These data are more complete for samples collected from 2014 onwards.

An Excel file called 0participants.xlsx which provide a full description of the demographics for each sample is included with the transcripts

The completion of HrAL could not have been possible without the participation and assistance of many people that helped with the data collection and transcription. Most of all we would like to express our deep appreciation to Ana Matic and Marina Olujic. The list of all people that have helped to build HrAL is given here.

Acknowledgment: The work on this paper was supported by the Croatian National Foundation, grant HRZZ-2421 for the project Adult language processing.