Bias in Speech Recognition Performance
Accent and dialect variation is only one among the many factors that have been shown to influence speech recognition performance[4]. Speech-to-text systems have also been shown to exhibit systematic inaccuracies or biases towards groups of speakers with varying age, gender, and other demographic factors[5-8]. While some of these variables affect our voices more than others, algorithmic biases observed in speech-to-text are thought to reflect broader historical, societal biases and prejudices.
Artificial intelligence (AI) bias in speech-to-text not only affects the reliability of speech technologies in real-world applications but it can perpetuate discrimination at a large scale. At Speechmatics, we strive to reduce bias as much as possible by utilizing rich representations of speech learnt from millions of hours of unlabeled audio with self-supervised learning. With the release of Ursa, we’ve scaled our machine learning models to create additional capacity to learn from our multilingual data and further reduce bias on diverse voice cohorts. This is how Ursa sets new standards for speech-to-text fairness with the best accuracy across the spectrum.
Accuracy Across Demographics
Based on a combination of the Casual Conversation[2] and CORAAL[3] datasets, we evaluated Ursa’s transcription performance against other speech-to-text providers across several demographic factors. We found that independent of age, gender identity, skin tone, socio-economic status, and level of education, Ursa offers the best transcription accuracy across all demographics with a 10% lead over the nearest vendor (see Figure 3 - 6).
It’s no news that a wide variety of different AI applications have been reported to be biased against women. Numerous studies have shown consistent algorithmic differences between women and men, with better performance for men in tasks from face recognition[9] to language translation[10]. These issues have been extensively covered by the media[11]. In the context of speech-to-text, Ursa stands out from the competition as the most accurate speech-to-text system across both female and male speakers (see Figure 3). Specifically, Ursa is almost 30% more accurate than Google on male speech.