Speechmatics recently introduced Ursa, new generation models for our Automatic Speech Recognition (ASR) system, which achieves market-leading accuracy in speech transcription. We’ve previously discussed the overall performance and new features of Ursa and demonstrated our ongoing commitment to understanding every voice, with outstanding results across a range of demographics. In addition to these, Ursa shows impressive performance in specialized domains.
Here, when we talk about domains, we mean specific contexts in which language is used. Each domain has its own style and vocabulary. For example, the style and vocabulary of language in a legal domain, such as a court hearing[1], will differ significantly from that in a medical domain, such as a physician-patient conversation[2]. Ursa can accurately transcribe speech across these different domains in a process known as domain generalization.
Ursa’s impressive performance is driven by significant scaling up of both our self-supervised learning model and our neural language model. We increased our self-supervised learning model to 2 billion parameters, enabling us to better understand every voice. We also increased our language model to 30 times its previous size, greatly expanding our coverage of domain-specific vocabulary. The ability to improve domain generalization by boosting the language model in particular is one of the great benefits of maintaining our modular approach to ASR.
To test Ursa’s domain generalization performance, we identified utterances relating to five specific domains within one of our internal datasets: Medical, Financial, Technical, Political, and Construction, and measured Ursa against Speechmatics' previous Enhanced model using the word error rate (WER) metric. While our previous model already boasted market-leading accuracy, Figure 1 shows that Ursa achieves relative WER improvements of up to 18.2%.