Text-to-Speech
Text-to-speech is the technology that converts written text into synthesised spoken audio using rule-based, concatenative, or neural network methods.
Text-to-speech (TTS) is the process of generating intelligible, natural-sounding spoken audio from written text. Historically a niche assistive technology for visually impaired users, TTS has become a foundational component of voice assistants, navigation systems, audiobook production, accessibility tools, in-game dialogue generation, and contact-centre automation. The arrival of deep learning has transformed output quality from the recognisable robotic cadence of early systems to voices that human listeners often cannot distinguish from recorded speech.
Pipeline
A modern neural TTS pipeline contains three logical stages.
The text front end normalises the input — expanding numerals, abbreviations, and dates — performs grapheme-to-phoneme conversion when required, predicts prosody (pauses, stress, intonation), and produces a sequence of phonetic or linguistic features.
The acoustic model maps those features to a low-dimensional acoustic representation, typically a sequence of mel-spectrogram frames. Architectures here include Tacotron 2 (attention-based encoder-decoder), FastSpeech and FastSpeech 2 (non-autoregressive with explicit duration prediction), and Glow-TTS or Grad-TTS (flow- and diffusion-based variants).
The vocoder synthesises the final waveform from the mel-spectrogram. WaveNet introduced fully neural vocoding at the cost of slow autoregressive sampling; parallel successors such as WaveGlow, HiFi-GAN, and BigVGAN deliver real-time quality on consumer hardware.
End-to-end models have begun to collapse this pipeline. Systems such as VITS, Tortoise TTS, Bark, and VALL-E generate waveforms or audio codec tokens directly from text, often conditioned on a short reference recording to clone a target voice.
Voice cloning and zero-shot TTS
A defining capability of recent systems is zero-shot voice cloning — synthesising new utterances in a target speaker's voice from a few seconds of reference audio. Microsoft's VALL-E, ElevenLabs' instant voice cloning, and Meta's Voicebox demonstrate that an audio codec language model trained on tens of thousands of hours of speech can capture timbre, accent, and prosodic style without retraining. The same capability has driven a sharp rise in audio deepfake fraud and a corresponding wave of detection research.
Evaluation
Subjective quality is measured by Mean Opinion Score (MOS), in which listeners rate utterances on a 5-point scale. Modern neural systems frequently score above 4.5, approaching the MOS of recorded human speech. Objective measures include Mel-cepstral distortion, word error rate of a downstream ASR system, and speaker similarity for cloning use cases.
Multilingual and low-resource TTS
Producing high-quality TTS in languages with limited training data remains an active research area. Approaches include multilingual joint training (a single model serving many languages), cross-lingual transfer (using high-resource phonemes to bootstrap low-resource ones), and self-supervised acoustic pre-training on untranscribed audio. For Southeast Asian languages, including Bahasa Melayu, Tamil, and Mandarin Chinese variants, these techniques are essential because dedicated single-language corpora are small.
Applications
TTS underpins screen readers (NVDA, JAWS, VoiceOver), automotive navigation, public-transport announcements, e-learning content, podcast and audiobook production, video dubbing and localisation, conversational AI assistants, and emergency public-address systems. In contact centres, TTS combined with speech recognition and large language models has replaced significant portions of pre-recorded IVR menus with natural dialogue.
Risks and policy
Voice cloning carries identity and consent risks. Several jurisdictions have begun to require disclosure when synthetic voices are used in advertising or political communication, and platforms have introduced watermarking and provenance metadata to allow downstream detection. Industry best practice now includes voice-consent verification before cloning and audit logs of generation activity.
References
- Shen, J. et al. (2018). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. ICASSP.
- Ren, Y. et al. (2021). FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. ICLR.
- Wang, C. et al. (2023). Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E). Microsoft Research.
- Kong, J. et al. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. NeurIPS.
- Dewan Bahasa dan Pustaka. (2024). Bahasa Melayu Language Technology Initiatives. DBP Publications.