Mandarin Chinese (PRC) Spontaneous Dialogue
About the Dataset
1082 hours
This audio dataset contains 1082 hours of Mandarin Chinese Speech Data in Banking, Telecommunication, Insurance and Retail domains, recorded by native Mandarin Chinese speakers from PRC.
Domain distribution per dataset:
- 220.5 hours of Banking
- 261.88 hours of Insurance
- 281.47 hours of Retail
- 318.67 hours of Telecommunication
Defined.ai creates scenarios for our crowd members to follow, which they study beforehand. They then record a conversation, one speaker playing the agent, the other speaker “playing out” the scenario with spontaneous content. The recording is done via telephony and is saved in 8khz 16 bit per channel. That content is then transcribed.
The dataset is covered by Defined.ai's standard license agreement. The license agreement is perpetual and allows for the commercialization of all models built on the data.
Other characteristics:
- Audio format: WAV
- Recording environment: noisy, silent
- Bits per sample: 16
- Communication band: broadband
- Sample rate: 8Hz
Samples
Check this 5-minute audio sample here. Transcription for the sample is also available.
Download Sample
Tell us about yourself, and get access to a free 30-minute sample of the Mandarin Chinese dataset