top of page
  • Writer's pictureRESONIKS

Whisper — Robust Speech Recognition via Large-Scale Weak Supervision

Note: Generative AI services are used as assistants in this blog post!!


Pre-trained audio encoders, while able to learn high-quality representations of speech, have limitations due to their unsupervised nature. They lack an equally high-quality decoder, requiring a complex fine-tuning stage to be useful for tasks like speech recognition. Fine-tuning can improve performance but may also lead to brittle and spurious patterns that do not generalize well to different datasets. A model that performs exceptionally on a specific dataset may still make basic errors on another due to these dataset-specific quirks.

Unsupervised pre-training has enhanced the quality of audio encoders significantly, but the lack of an equivalent pre-trained decoder and the need for dataset-specific fine-tuning are key weaknesses. Ideally, a speech recognition system should function reliably across various environments without needing supervised fine-tuning for every deployment.

Speech recognition systems that are pre-trained in a supervised manner across multiple datasets exhibit higher robustness and better generalization. However, the available high-quality speech recognition datasets are limited in size. Newer efforts aim to create larger datasets for speech recognition, often trading off quality for quantity by using weakly supervised speech recognition. Recent work in computer vision has demonstrated that moving beyond gold-standard crowdsourced datasets such as ImageNet to much larger but weakly supervised datasets significantly improves the robustness and generalization of models.

This work introduces Whisper, which scales weakly supervised speech recognition to 680,000 hours of labeled audio data, removing the need for any dataset-specific fine-tuning. The approach focuses on not only English but also includes multilingual and multitask training, with 117,000 hours covering 96 other languages and 125,000 hours of translation data.

The study suggests that simple scaling of weakly supervised pre-training has been underappreciated for speech recognition. It accomplishes these results without the need for self-supervision or self-training techniques that are often used in large-scale speech recognition work.

Data Processing

Instead of heavily standardizing the data, these models are trained to predict the raw text of transcripts, learning to map between spoken words and their transcribed form. This simplifies the speech recognition process by eliminating the need for a separate step to normalize the text.

The data for these models come from various audio paired with transcripts found on the internet, providing a diverse range of audio samples from different environments, setups, speakers, and languages. While this diversity in audio helps the model become more robust, variability in transcript quality isn’t beneficial. Upon initial inspection, a significant amount of low-quality transcripts were found in the raw dataset. In response to this, several automated filtering methods were developed to enhance transcript quality.

They also handle machine-generated transcripts and language mismatch in the training dataset for Automatic Speech Recognition (ASR) systems. The presence of machine-generated transcripts, often lacking in complexity and stylistic aspects found in human-generated transcripts, can impair system performance. Thus, several heuristics have been developed to detect and eliminate these from the training dataset.

Further, they used an audio language detector, which is used to confirm that the spoken language matches the transcript language. This tool was fine-tuned using a model trained on the VoxLingua107 dataset. If a language mismatch is detected, the pair is not included as a training example, except when the transcript language is English, in which case the pairs are used for speech translation training.

The team breaks audio files into 30-second segments, each paired with the portion of the transcript that corresponds to that time segment. They train on all audio, even segments without speech, and use these segments for voice activity detection training. After initial training, they review error rates on training data sources. This review, which considers high error rates and data source size, helps identify and remove low-quality sources.

The models have been trained on an extensive dataset of 680,000 hours of audio and its corresponding transcripts, sourced from the internet. Out of this data, 65% (or 438,000 hours) corresponds to English-language audio with matched English transcripts. Approximately 18% (or 126,000 hours) of the dataset comprises of non-English audio with English transcripts, while the remaining 17% (or 117,000 hours) is made up of non-English audio with its corresponding transcript. This non-English data spans across 98 different languages.

As mentioned in the paper, it has been observed that the efficiency of transcribing in a particular language is directly related to the quantity of training data utilized for that language.


They use an encoder-decoder Transformer architecture, originally proposed by Vaswani et al. in 2017, for processing audio data. The audio is re-sampled to a frequency of 16,000 Hz, and an 80-channel log-magnitude Mel spectrogram is computed using 25-millisecond windows with a 10-millisecond stride. Feature normalization is then applied, scaling the input to a range between -1 and 1, with approximately zero mean across the pre-training dataset.

The encoder begins processing this input through a ‘stem’, which consists of two convolution layers with a filter width of 3, utilizing the GELU activation function. The second convolution layer has a stride of two. After this, sinusoidal position embeddings are added to the output of the stem, followed by the application of the Transformer blocks in the encoder. The Transformer blocks use pre-activation residual blocks and a final layer normalization is applied to the encoder output.

The decoder, in contrast, uses learned position embeddings and tied input-output token representations. Both the encoder and decoder are identical in terms of width and the number of Transformer blocks used.

Multitask Format

While predicting spoken words in an audio snippet is a critical aspect of the overall speech recognition problem and is widely studied, it isn’t the only component. A comprehensive speech recognition system can also include other components such as voice activity detection, speaker diarization, and inverse text normalization. Typically, these components are managed separately, leading to a relatively complex system surrounding the core speech recognition model.

They propose a solution to use a single model to do all of these tasks. Here is how they do it as shown in the above image:

We use a simple format to specify all tasks and conditioning information as a sequence of input tokens to the decoder. Since our decoder is an audio-conditional language model, we also train it to condition on the history of text of the transcript in the hope that it will learn to use longer-range text context to resolve ambiguous audio. Specifically, with some probability we add the transcript text preceding the current audio segment to the decoder’s context. We indicate the beginning of prediction with a <|startoftranscript|> token. First, we predict the language being spoken which is represented by a unique token for each language in our training set (99 total). These language targets are sourced from the aforementioned VoxLingua107 model. In the case where there is no speech in an audio segment, the model is trained to predict a <|nospeech|> token indicating this. The next token specifies the task (either transcription or translation) with an <|transcribe|> or <|translate|> token. After this, we specify whether to predict timestamps or not by including a <|notimestamps|> token for that case. At this point, the task and desired format is fully specified, and the output begins. For timestamp prediction, we predict time relative to the current audio segment, quantizing all times to the nearest 20 milliseconds which matches the native time resolution of Whisper models, and add additional tokens to our vocabulary for each of these. We interleave their prediction with the caption tokens: the start time token is predicted before each caption’s text, and the end time token is predicted after. When a final transcript segment is only partially included in the current 30- second audio chunk, we predict only its start time token for the segment when in timestamp mode, to indicate that the subsequent decoding should be performed on an audio window aligned with that time, otherwise we truncate the audio to not include the segment. Lastly, we add a <|endoftranscript|> token. We only mask out the training loss over the previous context text, and train the model to predict all other tokens.

Pretty smart solution!!

Available models

They release Whisper in different sizes:

You can install and run Whisper by following their GitHub repo:

Performance and Limitation

Studies show that the models exhibit improved robustness to accents, background noise, technical language, as well as zero shot translation from multiple languages into English compared to many existing ASR systems; and that accuracy on speech recognition and translation is near the state-of-the-art level.

Although the models are trained through a weakly supervised method using massive amounts of noisy data, their predictions may include text that was not actually spoken in the audio input, which is commonly known as "hallucination". One possible reason for this is that the models tend to combine their attempt to predict the next word in the audio with their effort to transcribe the audio itself, leveraging their general knowledge of language.

They have observed that the models have varying performance levels across different languages, with lower accuracy rates on low-resource or hard-to-find languages that have less training data. Additionally, the models show different levels of accuracy on different accents and dialects of certain languages, with a higher word error rate among speakers of different genders, races, ages, or other demographic criteria. The paper provides a detailed analysis of the evaluation results.




bottom of page