Speechmatics’ latest speech-to-text system, Ursa, can transcribe difficult audio with incredible accuracy, regardless of demographics. We have found that Ursa’s high transcription accuracy is crucial for high-quality downstream performance. Try the world’s most accurate transcription for yourself with just one click here!
At the most basic level, a large language model (LLM) is trained to predict the next word given the sequence of words that have come before. They learn to use that past context to build representations of how we use language at scale, reflected in the vast number of parameters they possess, the computational power required for their training, and the immense amount of training data. As a result, these models can perform tasks such as summarization in a zero-shot manner, meaning that they can generalize and solve problems without the need for any specific training examples pertaining to the given task[1].
ChatGPT[2] has taken the world by storm, producing a flurry of rap battles, poetry, and fitness plans with seamless dialogue with the user. ChatGPT is an LLM that has been fine-tuned using Reinforcement Learning with Human Feedback (RLHF)[3] and packaged in a chatbot user interface. More recently, OpenAI has released GPT4, an LLM with increased performance and the ability to accept images as input. We discuss the technical details and significance of GPT4 here.
GPT4 and ChatGPT have shown remarkable performance in many areas of natural language processing; little analysis has been done on these models for tasks based on automatic speech recognition (ASR). We’ve found that they can gloss over some recognition errors and occasionally produce “better than input” answers to questions due to hallucinations based on the knowledge from training data, as in Figure 1.
Figure 1: ChatGPT hallucinates information about the lungs and attributes it to the speaker, even though this information is not contained in the input transcript. It was prompted with the transcript* from this video followed by “what does the speaker say about the lungs”. The relevant section begins at 4:12.
Generally, when a question is asked based on a specific transcript from ASR, you want the output to accurately reflect that input. We have found that Ursa’s high transcription accuracy is critical for consistent, high-quality downstream performance with no hallucinations. To demonstrate this, we use transcripts from Ursa and Google as input to GPT4 and ChatGPT and design prompts for five different tasks, as shown in Table 1.
Table 1: The five downstream tasks that are used to demonstrate the impact of ASR accuracy. The prompt design for each task is given where <Transcript> would be replaced with the output of the speech-to-text system and <question> replaced accordingly.
Summarization
Summarization is condensing a long piece of text into a shorter one while maintaining the key points. We compared the quality of summarizations based on transcriptions from a technical machine learning video, and you can immediately see the impact of ASR accuracy on summarisation quality (Figure 2). ChatGPT perpetuates the error that Google makes in the key term, replacing “Lorentz transformation” with “Lawrence transformation”.
Figure 2: Summarisation of a snippet from this technical video using ChatGPT on Ursa input (left) and Google input (right).