AIWiki
Malaysia

MaLLaM (Malaysia Large Language Model)

MaLLaM is a family of large language models developed by Malaysian startup Mesolitica, pretrained from scratch on Malay-language data to understand Malaysian dialects, colloquialisms, and regional languages.

5 min readLast updated June 2026Models

MaLLaM (Malaysia Large Language Model) is a family of generative language models built by Mesolitica, a Malaysian artificial intelligence startup, to address the under-representation of the Malay language and Malaysian cultural context in mainstream large language models. Whereas globally dominant models are trained predominantly on English-language data and treat Malay as a minor language, MaLLaM was pretrained from scratch on a Malay-centric corpus, allowing it to capture local slang, colloquialisms, code-switching, and regional dialects with greater fidelity. The associated research paper, MaLLaM — Malaysia Large Language Model, was published in 2024.

Motivation

Malaysia is a multilingual society in which Bahasa Malaysia is the national language, used alongside English, Mandarin, Tamil, and numerous regional and indigenous languages. General-purpose models trained on web-scale English data frequently misinterpret Malaysian expressions, mishandle formal versus informal registers (bahasa baku versus bahasa pasar), and fail to reflect local cultural and institutional knowledge. MaLLaM was created to provide a foundation model that understands these nuances, supporting applications such as customer service assistants, content generation, and document analysis tailored to Malaysian users.

Architecture and Training

MaLLaM was released in three sizes — approximately 1.1 billion, 3 billion, and 5 billion parameters — following a decoder-only transformer architecture similar to other contemporary language models. The models were pretrained on a large Malay-specific dataset reported at roughly 349 gigabytes of text, equivalent to on the order of 90 billion tokens, drawn from a wide range of Malay sources. The training corpus was assembled and cleaned to emphasise Malaysian content rather than relying on translated or incidental Malay text found in multilingual datasets.

Mesolitica subsequently produced instruction-tuned variants of MaLLaM, fine-tuned to follow user instructions for chat and assistant use cases. In published comparisons, the instruction-tuned MaLLaM models demonstrated stronger capture of Malaysian linguistic nuance than general models such as GPT-3.5 and a Malaysian-adapted Mistral baseline on Malay-language evaluations.

Infrastructure and Efficiency

A notable aspect of MaLLaM's development was its use of cost-efficient training infrastructure. Mesolitica trained and served the models on Amazon Web Services using AWS Trainium and AWS Inferentia accelerators rather than conventional GPUs. According to AWS, this approach reduced compute costs substantially — reported at around 87 percent — and increased training throughput several-fold compared with the startup's prior setup, illustrating how smaller organisations outside the largest AI labs can build foundation models economically.

Capabilities and Applications

MaLLaM is designed for natural language understanding and generation in Malaysian contexts. Reported capabilities include comprehension of Malay nuances spanning multiple state dialects (such as those of Johor, Kedah, Kelantan, and Sarawak) and several regional languages, making it suitable for AI assistants in customer service, government communication, content creation, and data analysis. By providing a locally grounded base model, MaLLaM also serves as a foundation that Malaysian enterprises and developers can fine-tune for domain-specific tasks without starting from a model that lacks Malay fluency.

Significance

MaLLaM is widely cited as one of the first large language models pretrained specifically for the Malaysian context, and it features prominently in discussions of Malaysian model sovereignty. Its open documentation and accessibility through Mesolitica's Malaya ecosystem have made it a reference point for Bahasa Malaysia natural language processing and a practical demonstration that nations with smaller AI ecosystems can develop competitive local models.

References

  1. Zolkepli, H., Razak, A., et al. (2024). MaLLaM — Malaysia Large Language Model. arXiv:2401.14680.
  2. Amazon Web Services. (2024). Mesolitica Builds Malaysian Large Language Model for Generative AI Assistants on AWS. press.aboutamazon.com.
  3. CDO Magazine. (2024). Mesolitica Launches Malaysia's First Localized GenAI Model MaLLaM. cdomagazine.tech.
  4. TNGlobal. (2024). Malaysia's AI startup Mesolitica builds Malaysian LLM for generative AI assistants on AWS. technode.global.