Whisper
Whisper is an open-source automatic speech recognition system developed by OpenAI, trained on 680,000 hours of multilingual audio data and capable of transcription, translation, and language identification across nearly 100 languages.
Whisper is an automatic speech recognition (ASR) system developed by OpenAI and released as open-source software in September 2022. Trained on approximately 680,000 hours of multilingual and multitask audio data collected from the internet, Whisper is notable for its robustness across languages, accents, dialects, and acoustic conditions including background noise. It performs transcription, speech translation into English, language identification, and timestamp alignment within a single unified model.
Architecture
Whisper uses an encoder-decoder transformer architecture. The audio processing pipeline begins by converting a raw audio waveform into a log-Mel spectrogram — a compact frequency-domain representation that captures the distribution of audio energy across frequency bands over time. The input audio is segmented into 30-second chunks, each of which is processed by a convolutional front-end followed by transformer encoder layers. The decoder then generates text tokens autoregressively, conditioned on the encoded audio representation and task-conditioning tokens that specify the target language and task type.
This multitask conditioning mechanism allows Whisper to function as a single model for multiple audio-to-text operations. Special tokens at the start of the decoder prompt direct the model to either transcribe speech in its original language, translate speech into English, or detect the language being spoken. Timestamp tokens can also be requested, allowing Whisper to produce word-level or phrase-level timing information aligned to the source audio — useful for generating synchronised subtitles.
Model Sizes and Trade-offs
OpenAI released Whisper in five size variants: Tiny, Base, Small, Medium, and Large. Larger variants achieve lower word error rates but require more memory and computation. The Tiny and Base models can run efficiently on CPU hardware, making them suitable for edge deployment and low-latency applications. The Large variant — particularly the Large-v3 release of November 2023 — provides state-of-the-art transcription quality on many benchmarks but requires a GPU for real-time operation.
The whisper-large-v3 model on Hugging Face is among the most widely downloaded ASR models, reflecting its broad adoption across research and industry.
Capabilities and Limitations
Whisper demonstrates strong performance across a wide range of languages and acoustic conditions. Its training corpus of 680,000 hours dwarfs earlier ASR datasets, and the diversity of internet-sourced audio means the model handles spontaneous conversational speech, technical vocabulary, regional accents, and non-native speakers more robustly than models trained on carefully curated studio recordings.
Despite these strengths, Whisper has documented failure modes. It occasionally generates plausible-sounding but factually incorrect text — sometimes called confabulation in the ASR context — particularly during long silences or near the boundaries of audio segments. Some languages in its training corpus are under-represented, leading to higher error rates for those languages. Whisper also lacks native speaker diarisation (the ability to identify which speaker said which words), though diarisation can be added by pairing Whisper with a separate speaker-segmentation tool.
Downstream Applications
Whisper has been integrated into a large ecosystem of applications. Real-time captioning systems for video conferencing and broadcast media use Whisper-derived models as their speech-to-text backend. Transcription services for interview recording, medical dictation, legal depositions, and academic research have adopted Whisper for its accuracy and language coverage. Voice assistants, podcast transcription tools, and subtitle-generation pipelines all draw on the model. OpenAI also exposes Whisper through its commercial API under the product name whisper-1, making it accessible to developers who prefer a hosted service to local deployment.
References
- Radford, A. et al. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. OpenAI Technical Report.
- OpenAI. (2022). Introducing Whisper. openai.com.
- OpenAI. (2023). Whisper large-v3 model card. Hugging Face.
- GitHub. (2024). openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision. github.com/openai/whisper.
- Gladia. (2024). What is OpenAI Whisper? gladia.io.