What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Text-to-Speech

Text-to-speech is the technology that converts written text into synthesised spoken audio using rule-based, concatenative, or neural network methods.

5 min readLast updated May 2026Applications

Text-to-speech (TTS) is the process of generating intelligible, natural-sounding spoken audio from written text. Historically a niche assistive technology for visually impaired users, TTS has become a foundational component of voice assistants, navigation systems, audiobook production, accessibility tools, in-game dialogue generation, and contact-centre automation. The arrival of deep learning has transformed output quality from the recognisable robotic cadence of early systems to voices that human listeners often cannot distinguish from recorded speech.

Pipeline

A modern neural TTS pipeline contains three logical stages.

The text front end normalises the input — expanding numerals, abbreviations, and dates — performs grapheme-to-phoneme conversion when required, predicts prosody (pauses, stress, intonation), and produces a sequence of phonetic or linguistic features.

The acoustic model maps those features to a low-dimensional acoustic representation, typically a sequence of mel-spectrogram frames. Architectures here include Tacotron 2 (attention-based encoder-decoder), FastSpeech and FastSpeech 2 (non-autoregressive with explicit duration prediction), and Glow-TTS or Grad-TTS (flow- and diffusion-based variants).

The vocoder synthesises the final waveform from the mel-spectrogram. WaveNet introduced fully neural vocoding at the cost of slow autoregressive sampling; parallel successors such as WaveGlow, HiFi-GAN, and BigVGAN deliver real-time quality on consumer hardware.

End-to-end models have begun to collapse this pipeline. Systems such as VITS, Tortoise TTS, Bark, and VALL-E generate waveforms or audio codec tokens directly from text, often conditioned on a short reference recording to clone a target voice.

Voice cloning and zero-shot TTS

A defining capability of recent systems is zero-shot voice cloning — synthesising new utterances in a target speaker's voice from a few seconds of reference audio. Microsoft's VALL-E, ElevenLabs' instant voice cloning, and Meta's Voicebox demonstrate that an audio codec language model trained on tens of thousands of hours of speech can capture timbre, accent, and prosodic style without retraining. The same capability has driven a sharp rise in audio deepfake fraud and a corresponding wave of detection research.

Evaluation

Subjective quality is measured by Mean Opinion Score (MOS), in which listeners rate utterances on a 5-point scale. Modern neural systems frequently score above 4.5, approaching the MOS of recorded human speech. Objective measures include Mel-cepstral distortion, word error rate of a downstream ASR system, and speaker similarity for cloning use cases.

Multilingual and low-resource TTS

Producing high-quality TTS in languages with limited training data remains an active research area. Approaches include multilingual joint training (a single model serving many languages), cross-lingual transfer (using high-resource phonemes to bootstrap low-resource ones), and self-supervised acoustic pre-training on untranscribed audio. For Southeast Asian languages, including Bahasa Melayu, Tamil, and Mandarin Chinese variants, these techniques are essential because dedicated single-language corpora are small.

Applications

TTS underpins screen readers (NVDA, JAWS, VoiceOver), automotive navigation, public-transport announcements, e-learning content, podcast and audiobook production, video dubbing and localisation, conversational AI assistants, and emergency public-address systems. In contact centres, TTS combined with speech recognition and large language models has replaced significant portions of pre-recorded IVR menus with natural dialogue.

Risks and policy

Voice cloning carries identity and consent risks. Several jurisdictions have begun to require disclosure when synthetic voices are used in advertising or political communication, and platforms have introduced watermarking and provenance metadata to allow downstream detection. Industry best practice now includes voice-consent verification before cloning and audit logs of generation activity.

Malaysian Context — TTS for Bahasa Melayu and beyond

Malaysian deployment of TTS is shaped by the country's multilingual environment — Bahasa Melayu, English, Mandarin, Tamil, and several indigenous languages — and by the need to serve users across the urban–rural and digital-literacy divide.

The Dewan Bahasa dan Pustaka (DBP) and academic groups at Universiti Sains Malaysia, Universiti Kebangsaan Malaysia, and Universiti Putra Malaysia have developed Bahasa Melayu speech corpora used by both academic and commercial TTS systems. Telekom Malaysia (TM)'s research arm and MIMOS Berhad have contributed acoustic models and front-end normalisation rules for Malay.

Sector adoption is broad. Maybank, CIMB, and Public Bank use TTS in IVR and outbound call automation. Grab Malaysia and AirAsia integrate TTS into voice navigation, in-app announcements, and chat-to-voice features. The Ministry of Health (KKM) and MyHEALTH portal use TTS for accessibility on government health portals. Local startups offer TTS-powered training content delivered through HRD Corp-funded programmes.

Regulatory considerations include PDPA consent for voice data used in training and MCMC guidance on the disclosure of synthetic voices in advertising. Cybersecurity Malaysia and NACSA monitor voice deepfake fraud, which has emerged as a vector for impersonation scams targeting Malaysian banking customers.

References

Shen, J. et al. (2018). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. ICASSP.
Ren, Y. et al. (2021). FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. ICLR.
Wang, C. et al. (2023). Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E). Microsoft Research.
Kong, J. et al. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. NeurIPS.
Dewan Bahasa dan Pustaka. (2024). Bahasa Melayu Language Technology Initiatives. DBP Publications.

Tags:tts speech-synthesis voice-ai multimodal

Also known as	TTS, speech synthesis
Inverse task	Speech recognition (ASR)
Modern approach	Neural acoustic models + neural vocoders
Notable systems	Tacotron 2, FastSpeech, VALL-E, ElevenLabs, Bark
Related	Speech recognition, multimodal AI, voice cloning