CABank English SBCSAE Corpus

John DuBois
Department of Linguistics
University of California, Santa Barbara


Robert Englebretson
Department of Linguistics
Rice University


Participants: 30
Type of Study: conversations
Location: California
Media type: audio
DOI: doi:10.21415/T5VG6X

Browsable transcripts

Download transcripts

Media folder

Citation information

Some citation here.

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Project Description

The Santa Barbara Corpus of Spoken American English is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more. The corpus was collected by the University of California, Santa Barbara Center for the Study of Discourse, Director John W. Du Bois (UCSB), Associate Editors: Wallace L. Chafe (UCSB), Charles Meyer (UMass, Boston), and Sandra A. Thompson (UCSB). Additional information can be found at this site .

Each speech file is accompanied by a transcript in which phrases are time stamped with respect to the audio recording. Personal names, place names, phone numbers, etc., in the transcripts have been altered to preserve the anonymity of the speakers and their acquaintances and the audio files have been filtered to make these portions of the recordings unrecognizable. Pitch information is still recoverable from these filtered portions of the recordings, but the amplitude levels in these regions have been reduced relative to the original signal. A separate filter list file (*.flt) associated with each transcript/waveform file pair is provided to list the beginning and ending times of the filtered regions. There are 4 .flt files which are empty because there was no information that needed to be filtered out from the audio files.The filtering was done using a digital FIR low-pass filter, with the cut-off frequency set at 400 Hz. The effect of the filter was gradually faded in and out at the beginning and end of the regions over a 1,000 sample region, roughly 45 milliseconds, to avoid abrupt transitions in the resulting waveform. The audio data consists of 16 wave format speech files, recorded in two-channel pcm, at 22050Hz.

The TalkBank version of the corpus was constructed by Nii Martey of the Linguistic Data Consortium with help from Jack DuBois for Part 1 and from Robert Englebretson, now at Rice University, for Parts 2, 3, and 4. Personal names, place names, phone numbers, etc, in the transcripts have been altered to preserve the anonymity of the speakers and their acquaintances and the audio files have been filtered to make these portions of the recordings unrecognizable. Pitch information is still recoverable from these filtered portions of the recordings, but the amplitude levels in these regions have been reduced relative to the original signal. A separate filter list file (*.flt) associated with each transcript/waveform file pair is provided to list the beginning and ending times of the filtered regions. The filtering was done using a digital FIR low-pass filter, with the cut-off frequency set at 400 Hz. The effect of the filter was gradually faded in and out at the beginning and end of the regions over a 1,000 sample region, roughly 45 milliseconds, to avoid abrupt transitions in the resulting waveform. In the case of a phone number, which was not adequately disguised by the filter, the signal was set to zero, except for the 45 millisecond boundary regions which fade into and out of zero.

No. Name Sex Age City State Orig Edu Years of Edu Occ Race/Eth
0001LENOREf30Los AngelesCACABA16studentwhite
0002DORISf50MontanaMTMTHS12horse rancwhite
0005JAMIEf30Walnut CreCACAcollege16dancer/dawhite
0007PETEm36San LeandrCACA18grad studentwhite
0010CAROLYNf19Santa FeNMCOHS12studentwhite
0011KATHYf31Boston/Santa FeA/NMCAgrad studentwhite
0012SHARONf24New MexicoNMTXcollegeteacherwhite
0013SHANEm23Corp ChristiTXTXgradmed studentchicano
0016DARRYLm33San FranciscoCACABA16comm./compwhite
0017PAMELAf38Southern CaliforniaCACABA16actress/fiwhite
0018ALINAf34Los AngelesCACABA16housewifewhite
0019ALICEf28PryorMTMT4 years16studentCrow Indian
0020MARYf27PryorMTMTcollege3cook fireCrow Indian
0021RICKIESan FranciscoCACAHS12clerkblack
0022JUNEf21Laguna BeachCACAA MA17grad studentwhite
0023REBECCAf31SaratogaACAA J22attorneywhite
0024ARNOLDmSaginawMICAHS12S Armywhite
0027BRADm45MA18director owhite
0030ANGELAf90middle WesMOAZMS18teacher Jwhite
0032BEVf20So CaliforniaCACAHS15studentwhite
0035GILBERTm22So CaliforniaCACAHSstudenthispanic
0036CAROLYNf18So CaliforniaCACAHS12studentwhite
0037LAURAf23San JoseCACAHSstudentjapanese/
0038FRANKm24So CaliforniaCACABA16business owhite
0040RUBENm27So CaliforniaCACA5 yrs17teacherhispanic
0043KENm51midwestININPhd M23director owhite
0046KEVINm26midwestININS Cr16missionarywhite
0047JIMm41metro St.L.ILILcertified16bankingwhite
0048FREDm47ChrismanILILmasters18loan officerwhite


Franklin Chen reformatted this corpus into accord with current versions of CHAT.