CABank English SBCSAE Corpus

John DuBois
Department of Linguistics
University of California, Santa Barbara
dubois@humanitas.ucsb.edu
website

Robert Englebretson
Department of Linguistics
Rice University
reng@rice.edu
website

Participants:	30
Type of Study:	conversations
Location:	California
Media type:	audio
DOI:	doi:10.21415/T5VG6X

Citation information

Some citation here.

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Project Description

The Santa Barbara Corpus of Spoken American English is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more. The corpus was collected by the University of California, Santa Barbara Center for the Study of Discourse, Director John W. Du Bois (UCSB), Associate Editors: Wallace L. Chafe (UCSB), Charles Meyer (UMass, Boston), and Sandra A. Thompson (UCSB). Additional information can be found at this site .

Each speech file is accompanied by a transcript in which phrases are time stamped with respect to the audio recording. Personal names, place names, phone numbers, etc., in the transcripts have been altered to preserve the anonymity of the speakers and their acquaintances and the audio files have been filtered to make these portions of the recordings unrecognizable. Pitch information is still recoverable from these filtered portions of the recordings, but the amplitude levels in these regions have been reduced relative to the original signal. A separate filter list file (*.flt) associated with each transcript/waveform file pair is provided to list the beginning and ending times of the filtered regions. There are 4 .flt files which are empty because there was no information that needed to be filtered out from the audio files.The filtering was done using a digital FIR low-pass filter, with the cut-off frequency set at 400 Hz. The effect of the filter was gradually faded in and out at the beginning and end of the regions over a 1,000 sample region, roughly 45 milliseconds, to avoid abrupt transitions in the resulting waveform. The audio data consists of 16 wave format speech files, recorded in two-channel pcm, at 22050Hz.

The TalkBank version of the corpus was constructed by Nii Martey of the Linguistic Data Consortium with help from Jack DuBois for Part 1 and from Robert Englebretson, now at Rice University, for Parts 2, 3, and 4. Personal names, place names, phone numbers, etc, in the transcripts have been altered to preserve the anonymity of the speakers and their acquaintances and the audio files have been filtered to make these portions of the recordings unrecognizable. Pitch information is still recoverable from these filtered portions of the recordings, but the amplitude levels in these regions have been reduced relative to the original signal. A separate filter list file (*.flt) associated with each transcript/waveform file pair is provided to list the beginning and ending times of the filtered regions. The filtering was done using a digital FIR low-pass filter, with the cut-off frequency set at 400 Hz. The effect of the filter was gradually faded in and out at the beginning and end of the regions over a 1,000 sample region, roughly 45 milliseconds, to avoid abrupt transitions in the resulting waveform. In the case of a phone number, which was not adequately disguised by the filter, the signal was set to zero, except for the 45 millisecond boundary regions which fade into and out of zero.

01 Actual Blacksmithing
02 Lambada
03 Conceptual Pesticides
04 Raging Bureaucracy
05 A Book About Death
06 Cuz
07 A Tree's Life
08 Tell the Jury That
09 Zero Equals Zero
10 Letter of Concerns
11 This Retirement Bit
12 American Democracy is Dying
13 Appease the Monster
14 Bank Products

No.	Name	Sex	Age	City	State	Orig	Edu	Years of Edu	Occ	Race/Eth
0001	LENORE	f	30	Los Angeles	CA	CA	BA	16	student	white
0002	DORIS	f	50	Montana	MT	MT	HS	12	horse ranc	white
0003	LYNNE	f	19	Montana	MT		HS	12	student/ho	white
0004	HAROLD
0005	JAMIE	f	30	Walnut Cre	CA	CA	college	16	dancer/da	white
0006	MILES	m				CA				black
0007	PETE	m	36	San Leandr	CA	CA		18	grad student	white
0008	ROY	m	34			CA			designer	white
0009	MARILYN	f	33			CA			writer	white
0010	CAROLYN	f	19	Santa Fe	NM	CO	HS	12	student	white
0011	KATHY	f	31	Boston/Santa Fe	A/NM	CA			grad student	white
0012	SHARON	f	24	New Mexico	NM	TX	college		teacher	white
0013	SHANE	m	23	Corp Christi	TX	TX	grad		med student	chicano
0014	PAM	f	43	Massachusetts	MA	NM			housewife	white
0015	WARREN	m	34	Wenham	MA	IL	DVM	23	veterinarian	white
0016	DARRYL	m	33	San Francisco	CA	CA	BA	16	comm./comp	white
0017	PAMELA	f	38	Southern California	CA	CA	BA	16	actress/fi	white
0018	ALINA	f	34	Los Angeles	CA	CA	BA	16	housewife	white
0019	ALICE	f	28	Pryor	MT	MT	4 years	16	student	Crow Indian
0020	MARY	f	27	Pryor	MT	MT	college	3	cook fire	Crow Indian
0021	RICKIE			San Francisco	CA	CA	HS	12	clerk	black
0022	JUNE	f	21	Laguna Beach	CA	CA	A MA	17	grad student	white
0023	REBECCA	f	31	Saratoga	A	CA	A J	22	attorney	white
0024	ARNOLD	m		Saginaw	MI	CA	HS	12	S Army	white
0025	KATHY	f	17	Mobile	AL	AL	HS	10	student	white
0026	NATHAN	m	19	Mobile	AL	AL	HS	12	student	white
0027	BRAD	m	45				MA	18	director o	white
0028	PHIL	m	30			NM	BA	16	designer	hispanic
0029	DORIS	f	83	Indianapolis	IN	AZ	MA	18	teacher	white
0030	ANGELA	f	90	middle Wes	MO	AZ	MS	18	teacher J	white
0031	SAM	f	72	Arcadia	IN	AZ	Nursing	15	retired	white
0032	BEV	f	20	So California	CA	CA	HS	15	student	white
0033	MONTOYO	m	51			CA	PhD		political	latino/chicano
0034	MARIA	f	26	Nicaragua		CA	HS	15	dispatcher	hispanic
0035	GILBERT	m	22	So California	CA	CA	HS		student	hispanic
0036	CAROLYN	f	18	So California	CA	CA	HS	12	student	white
0037	LAURA	f	23	San Jose	CA	CA	HS		student	japanese/
0038	FRANK	m	24	So California	CA	CA	BA	16	business o	white
0039	RAMON	m	19	MoreValley	CA	CA	HS	12	student	hispanic
0040	RUBEN	m	27	So California	CA	CA	5 yrs	17	teacher	hispanic
0042	KENDRA	f	25	midwest	IN	IN	BA	16	administrator	white
0043	KEN	m	51	midwest	IN	IN	Phd M	23	director o	white
0044	MARCI	f	50	midwest	IN	IN	MA	19	counselor	white
0045	WENDY	f	26	midwest	IN	IN	BS	16	missionary	white
0046	KEVIN	m	26	midwest	IN	IN	S Cr	16	missionary	white
0047	JIM	m	41	metro St.L.	IL	IL	certified	16	banking	white
0048	FRED	m	47	Chrisman	IL	IL	masters	18	loan officer	white
0049	JOE	m	45	Dupo	IL	IL		17	banking	white
0050	KURT	m	70	Millstad	IL	IL		12	retired-co	white
0051	VIVIAN	f	55	Shenandoah	A	IL	HS	13	banking	white

Acknowledgements

Franklin Chen reformatted this corpus into accord with current versions of CHAT.