What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Speech Recognition

Speech recognition, or automatic speech recognition (ASR), is the technology that enables computers to identify and transcribe spoken language into text using acoustic models, language models, and deep learning architectures.

6 min readLast updated May 2026Applications

Speech recognition, formally known as Automatic Speech Recognition (ASR) or Speech-to-Text (STT), is the technology that allows computing systems to identify spoken words and convert them into written text. It is a foundational component of voice assistants, real-time transcription services, accessibility tools, telephony systems, and multimodal AI interfaces. Modern ASR systems are built on deep learning architectures that jointly model acoustic patterns, language structure, and sometimes visual context, achieving near-human accuracy on standard benchmarks in controlled conditions. The global ASR market is projected to reach USD 73 billion by 2031, reflecting the technology's broad integration across consumer, enterprise, and public-sector applications.

Historical Development

Early speech recognition systems from the 1950s through the 1980s were highly constrained: they recognised isolated words, required extensive speaker training, and operated within limited vocabularies. Progress accelerated with the adoption of Hidden Markov Models (HMMs) in the 1970s and 1980s, which provided a principled probabilistic framework for modelling sequences of phonemes and words. HMM-based systems with Gaussian Mixture Model (GMM) acoustic models dominated commercial ASR through the early 2000s.

The deep learning revolution transformed ASR from approximately 2009 onwards. Researchers at the University of Toronto demonstrated that deep neural networks substantially improved the acoustic modelling component of HMM-GMM pipelines. By 2014, end-to-end deep learning systems using LSTMs were beginning to replace modular HMM-based approaches. By the late 2010s, attention-based encoder-decoder models and Transformer architectures had set new performance benchmarks, and end-to-end systems had supplanted the classic pipeline architecture in most state-of-the-art systems.

Architecture

Modern ASR systems typically consist of three components.

Acoustic model: Processes raw audio waveforms or mel-spectrogram features and produces a sequence of frame-level representations. Deep learning architectures used here include LSTMs, Time-Delay Neural Networks (TDNNs), and Conformers — a hybrid of convolution and Transformer layers that is the dominant architecture in leading modern systems. The Conformer captures both local acoustic features (through convolution) and long-range dependencies (through self-attention).

Language model: Scores sequences of words for linguistic plausibility, helping disambiguate acoustically similar utterances. Neural language models — including Transformer-based LMs — improve transcription accuracy particularly in noisy conditions and for out-of-vocabulary words.

Decoder: Combines acoustic model outputs and language model scores to produce the final transcription, typically using beam search over possible word sequences.

End-to-end systems such as OpenAI's Whisper, Meta's wav2vec 2.0, and Google's Universal Speech Model (USM) fold these components into a single neural network trained jointly on audio-text pairs, simplifying the pipeline and substantially improving multilingual capability.

Whisper

OpenAI's Whisper, released in September 2022, became one of the most widely adopted open-weight ASR models. Trained on 680,000 hours of multilingual audio from the internet, Whisper supports transcription in 99 languages and translation into English. Its open-source availability made it the default choice for researchers and developers building ASR applications without per-request API costs. Whisper's strong multilingual performance, while uneven across languages relative to its English accuracy, provided a practical baseline for low-resource language ASR that had not previously existed.

Key Challenges

Accent and dialect variation remains a persistent problem. Models trained predominantly on standard accents underperform on regional dialects, accented speech, or non-standard pronunciation. This disparity has been documented extensively and has implications for equitable access to voice AI.

Code-switching is particularly challenging: speakers in multilingual societies frequently alternate between two or more languages within a single utterance. Most commercial ASR systems are designed for monolingual input and fail on code-switched speech.

Noisy environments including background noise, music, multi-speaker overlap, and channel distortion remain active research areas. Signal enhancement preprocessing and multi-channel microphone array processing are commonly combined with acoustic modelling to address noise.

Low-resource languages present structural challenges: models require large quantities of aligned audio-text data to achieve high accuracy, and most of the world's languages lack such resources.

Applications

Speech recognition underpins a wide range of products and services: voice assistants such as Siri, Google Assistant, Alexa, and Cortana; real-time meeting transcription in platforms like Microsoft Teams, Zoom, and Otter.ai; call centre analytics for transcription and intent detection; accessibility tools for deaf and hard-of-hearing users; medical dictation systems for ambient clinical documentation; and voice-controlled interfaces for automotive, industrial, and smart home applications.

Malaysian Context — Bahasa Malaysia ASR and Local Voice AI

Malaysia's multilingual linguistic environment presents both a challenge and a significant opportunity for speech recognition technology. Bahasa Malaysia (BM) is the national language spoken by over 32 million people, but Malaysian speech is characterised by pervasive code-switching — mixing BM with English (Manglish), as well as Mandarin, Hokkien, Tamil, and other languages within a single conversation. Standard ASR systems trained on monolingual corpora perform poorly on this mixed-language speech, creating a clear gap for locally developed solutions.

Telekom Malaysia (TM) has developed and deployed speech recognition capabilities within its TM One enterprise services and the Unifi customer service platform, including voice bots for customer care that handle BM and English queries. Maxis and Celcom have similarly integrated voice AI into their customer service operations. AirAsia's conversational AI platform AVA (AirAsia Virtual Allstar) handles tens of millions of passenger interactions annually and incorporates speech recognition for voice-based queries across multiple languages.

MIMOS Berhad has conducted research into Bahasa Malaysia ASR, and the Language Technology Lab at Universiti Sains Malaysia (USM) has developed BM speech corpora and acoustic models. The HIMPUN Bahasa Malaysia speech dataset and other locally collected corpora support academic and commercial ASR development for the national language and for Malaysian English.

The MyDigital Blueprint identifies voice AI and multilingual NLP as priority technology areas for Malaysia's digital economy. MDEC's Smart City and Digital Government initiatives include voice interfaces for government services, making accessible Bahasa Malaysia ASR particularly important for public-sector digital services. HRDC Corp (Human Resource Development Corporation) has funded AI training programmes that include ASR and voice technology modules for Malaysian tech workers. As Speech Large Language Models (S-LLMs) that integrate speech encoders with LLM backbones mature, the prospect of genuinely fluent Malaysian multilingual voice AI is increasingly viable.

References

Hinton, G., et al. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine, 29(6), 82-97.
Radford, A., et al. (2022). Robust Speech Recognition via Large-Scale Weak Supervision (Whisper). OpenAI / arXiv:2212.04356.
Baevski, A., et al. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. NeurIPS 2020.
Gulati, A., et al. (2020). Conformer: Convolution-augmented Transformer for Speech Recognition. arXiv:2005.08100.

Tags:speech-recognition asr nlp audio-ai

Type	AI application technology
Also known as	Automatic Speech Recognition (ASR), Speech-to-Text (STT)
Key architectures	Conformer, Transformer, LSTM
Leading models	Whisper, wav2vec 2.0, Google USM
Key use	Transcription, voice interfaces, accessibility
Related	Natural language processing, Text-to-speech, Deep learning