Real-Time ASR Systems
Automatic Speech Recognition (ASR) systems have two common modes of operation used to solve different use cases, batch and real-time.
In batch ASR, audio content is provided in complete files which are transcribed in their entirety with a single transcript provided as an output. This allows the speech recognition system to fully understand the context of speech and allows for higher levels of accuracy.
In contrast, real-time audio is provided as a stream of data, typically at the time of creation, and sent to the ASR system which then returns short segments of the transcription back at regular intervals. The time it takes for these segments to be returned after the words have been spoken is important to minimize and is known as the latency of the system. There is a continuous trade-off between latency and accuracy since sending the transcripts back quicker reduces the amount of context the model can use which is important for accuracy.
To illustrate latency in different use-cases, consider generating captions for a live spoken event. In this scenario, the user requires the text to be available with minimal delay after the speaker has uttered the words. However, for a live news broadcast, there is often a longer delay, providing more time for the captioning process. The balance between timeliness and accuracy is determined by the user's specific requirements.
Evaluating Batch versus Real-Time ASR
To show the effectiveness of our real-time engine compared to batch transcription, we transcribed six internal test sets that cover a wide range of use cases applicable to real-time, such as news reports for captions or meetings for accessibility needs. We then compared the results of transcribing in real-time with the output of batch transcription, and the output of other real-time ASR systems. The files were streamed to each service simulating the use of a microphone streaming input. Each transcript was then normalised using the open-source OpenAI Whisper normaliser to give a comparable output for word error rate (WER) scoring. WER is the standard metric used to track the mistakes made in transcription (learn more about it here).
We previously reported the substantial accuracy improvements we obtained with our latest release, Ursa. Whereas the results we presented before were more focused on batch mode, in this blog we show that, increased scale of neural networks, Ursa has also made significant improvements in real-time transcription across a range of latencies, offering near batch levels of accuracy performance as shown in Table 1. The table shows that at the lowest latency setting, Ursa achieves a WER of 11.2%, which is only an 8.5% relative degradation compared to the batch accuracy of 10.25%. Linking to the results in the Ursa release blog, our low latency real-time ASR system is significantly ahead of OpenAI Whisper in accuracy, which only supports batch processing.
At Speechmatics, we prioritize real-time performance and base all our modelling decisions around it. This means that we use the same models for both batch and real-time allowing us to achieve parity in accuracy as the latency is increased to 10s. By taking this approach, we deliver the most accurate results for all ranges of latencies that might be required for your application.
Table 1: Word Error Rate (WER) across multiple max_delay settings measured in seconds. It also includes the relative degradation of WER as you decrease the latency compared to batch.
Ursa Leads the Competition
As part of our focus on continuous improvement, we regularly compare our releases to other competitors. It can be difficult to compare settings between vendors as each uses different methods for prioritizing between fast transcription and accurate results. Therefore, we compared our fastest real-time transcription (2s max delay) with each provider’s most accurate settings, regardless of the speed of transcription, to offer them the most favorable environment when measuring WER on our test sets.
As shown in Table 2, Ursa achieves higher accuracy than the competition in settings that favour competitor products. For example, Ursa removes an extra 2 out of 5 errors compared to Amazon.
Table 2: Word Error Rate (WER) averaged across six test-sets when running real-time ASR across different vendors. Speechmatics is run at the lowest latency, while other competitors are run at their most accurate and high latency configurations.