Figure 1. Ursa’s enhanced model is 24% more accurate than the best competitor when transcribing English speakers from all across the globe with a wide variety of accents across Amazon, Google, Microsoft, and OpenAI’s Whisper. Word error rate (WER) calculated on the Common Voice dataset[1], 26 hours of speech from speakers across the globe with varied accents (lower is better, error bars show standard error).
Figure 2. Ursa provides a 22% lead over the next best competitor on specific dialects that have been historically underrepresented in the data. Word error rate (WER) calculated on the CORAAL dataset[3], more than 100 hours of African American Vernacular English (error bars represent standard errors).
Figure 3. Ursa is consistently the most accurate speech-to-text system across age and gender, with a 30% and 25% relative lead on male and senior voices compared to Google, respectively.
Figure 4. Ursa is consistently 32% more accurate than Google across skin tones (results based on Casual Conversation[2] dataset; based on the Fitzpatrick scale spanning from 1 to 6, higher the number darker the skin tone[2]).
Figure 5. Ursa is consistently the most accurate speech-to-text engine across socio-economic status and levels of education, with an approximate lead of 30% over Google for people from a lower socio-economic background and less formal education (results based on CORAAL[3]).