Edinburgh International Accents Corpus
- EdAcc is a freely available speech corpus featuring 40 hours of natural, video-call English conversations across diverse global accents.
- It provides comprehensive linguistic metadata and sociolinguistic profiles, enabling detailed analysis of accent impact on ASR performance.
- Benchmark evaluations indicate significant error rate variations across accents, underscoring the need for accent-adaptive ASR systems.
The Edinburgh International Accents of English Corpus (EdAcc) is a freely available speech resource designed to address deficits in current automatic speech recognition (ASR) evaluation, specifically regarding the global diversity of English accents. Unlike prior benchmarks, which predominantly comprise American or British English in read or telephony formats, EdAcc consists of nearly 40 hours of wide-band, dyadic, video-call conversations among native and non-native English speakers from every inhabited continent. The corpus is released with orthographic transcriptions, detailed linguistic metadata, and open evaluation tools under a Creative Commons Attribution-ShareAlike (CC-BY-SA) license (Sanabria et al., 2023).
1. Rationale and Scope
EdAcc was developed to democratize English ASR by exposing the limitations of current systems on the global spectrum of English accents. Standard databases, such as Switchboard, TIMIT, WSJ, and LibriSpeech, comprise hundreds of hours of mostly North American or British English in read speech or telephone format, which fails to reflect the challenges posed by spontaneous, conversational varieties found worldwide. Mozilla Common Voice and AESRC2020 do address accent diversity to an extent but rely on read prompts and only coarse accent labeling. In contrast, EdAcc specifically targets informal, real-world dialog and incorporates rich, self-reported sociolinguistic characterization of each speaker (Sanabria et al., 2023).
2. Data Collection Procedures
Recordings in EdAcc were collected via Zoom video calls between pairs of speakers with pre-existing acquaintanceship, producing unstructured English dialog sessions of 20–60 minutes. This design minimizes the observer’s paradox, yielding more natural spontaneous speech while simulating contemporary teleconferencing conditions. Audio is sampled at 16 kHz (wide-band), stored as single-channel WAV files—the Zoom platform’s constraint. Speaker recruitment leveraged both academic networks and the micro-task platform Fiverr, with compensation set at £10 per 15 minutes. All participants provided informed consent and data protection agreements prior to release. Video recordings are planned for future versions (Sanabria et al., 2023).
3. Speaker Demographics and Linguistic Profiling
The initial EdAcc release comprises approximately 80 speakers and 61 conversations (∼40 hours of audio). Speakers span L1 (native) and L2 (second-language) varieties, reporting 51 distinct first languages; English was acquired before age 5 in many cases and after in others. Nine well-represented L1 English accents are directly annotated: South African, Ghanaian, Irish, Scottish, U.S., Southern British, Indian, Jamaican, Nigerian, with numerous additional self-described accents present. Each participant completed a linguistic background survey, collecting:
- First language(s) acquired before age 5
- Other languages spoken
- Age and context of initial English exposure
- Language use domains (home, work, social)
- Residences over three years’ duration
- Acquaintance details with conversational partner
- Self-described accent
- Socio-demographics (age, gender, ethnicity, education)
This structure enables detailed analysis of accent distribution and sociolinguistic factors (Sanabria et al., 2023).
4. Annotation and Corpus Organization
Professional transcribers orthographically segmented and transcribed each speaker turn, including noise events, overlaps, laughter, and hesitation phenomena. Transcriptions underwent post-processing to strip punctuation, normalize casing, standardize numerals/disfluencies, and conform to ASR evaluation conventions. Non-lexical tokens (e.g., “[laugh]”) are force-aligned using Kaldi and absent from scoring. The metadata for each speaker encompasses age, gender, self-reported L1, L2 proficiency and age of acquisition, past residences, and socioeconomic markers.
EdAcc is partitioned for ASR benchmarking into a development set (31 conversations) and a speaker-disjoint test set (30 conversations). Audio and plain-text transcripts are provided in parallel (Sanabria et al., 2023).
5. ASR System Benchmarking and Performance Analysis
Three ASR architectures were benchmarked using EdAcc, measured by Token-Level Word Error Rate (WER): with (substitutions), (deletions), (insertions), and (number of reference words). Systems evaluated:
- Wav2vec 2.0: Self-supervised pretraining (Libri-light, Common Voice, Switchboard) and fine-tuning (LibriSpeech, AMI, MGB, Switchboard) with a 4-gram LM.
- Whisper: OpenAI’s encoder-decoder trained on 680,000 hours of multilingual web audio.
- Commercial accent-tuned model (anonymized).
Summary of system performance:
| Model | EdAcc dev | EdAcc test | Libri test-clean | Libri test-other |
|---|---|---|---|---|
| Wav2vec 2.0 | 33.4 % | 36.1 % | 2.9 % | 5.6 % |
| Commercial | 17.9 % | 18.7 % | 3.8 % | 7.4 % |
| Whisper | 16.4 % | 19.7 % | 2.7 % | 5.6 % |
Even for the best-performing model (Whisper), EdAcc yields a WER of 19.7%, in stark contrast to 2.7% on the LibriSpeech test-clean set (Sanabria et al., 2023).
Breakdown by L1 accent for accent-homogeneous conversations shows elevated error rates particularly for Indian, Jamaican, and Nigerian English, revealing performance gaps not seen in standard benchmarks:
| L1 Accent | Whisper | Commercial | Wav2vec 2.0 |
|---|---|---|---|
| South African | 12.5 % | 13.4 % | 24.5 % |
| Ghanaian | 12.9 % | 12.7 % | 23.4 % |
| Irish | 14.0 % | 16.5 % | 30.2 % |
| Scottish | 14.8 % | 17.1 % | 31.5 % |
| U.S. | 16.1 % | 17.3 % | 32.0 % |
| Southern British | 16.6 % | 18.9 % | 37.8 % |
| Indian | 19.7 % | 20.4 % | 39.7 % |
| Jamaican | 21.5 % | 26.0 % | 46.2 % |
| Nigerian | 22.9 % | 25.6 % | 54.0 % |
This suggests latent biases in training data and the need for accent-robust ASR modeling (Sanabria et al., 2023).
6. Current and Prospective Applications
EdAcc’s conversational recordings, comprehensive accent representation, and metadata enable research in several domains:
- Development and evaluation of accent-adaptive ASR models, including domain/adversarial adaptation and multi-dialect training
- Quantitative analysis of bias and equity in ASR systems with respect to socio-linguistic variables
- Linguistic analysis involving accent variation, phonetically-informed modeling, and socio-phonetic phenomena
Future work on EdAcc includes (1) integrating automated or semi-automated accent/clustering labels to increase linguistic granularity, (2) releasing video streams with speaker consent, (3) expanding speaker demographic diversity, (4) providing richer linguistic annotations, such as phonetic transcriptions, prosody labels, and diarization references (Sanabria et al., 2023).
7. Significance and Availability
EdAcc represents a significant step toward inclusive, accent-robust ASR evaluation and model development by providing an openly licensed, conversational corpus that reflects real-world diversity in English speech. All data, transcripts, metadata, and evaluation scripts are publicly available via https://groups.inf.ed.ac.uk/edacc/ under CC-BY-SA, with full licensing information at https://creativecommons.org/licenses/ (Sanabria et al., 2023).
This corpus is positioned to support both technological advancement and socio-linguistic research, highlighting critical domains for improvement in contemporary English ASR.