AIWiki
Malaysia

SEA-LION

SEA-LION (Southeast Asian Languages In One Network) is an open-source family of large language models developed by AI Singapore to serve the languages and cultures of Southeast Asia.

5 min readLast updated June 2026Models

SEA-LION, short for Southeast Asian Languages In One Network, is an open-source family of large language models developed by AI Singapore (AISG) and released from 2023 onwards. The project is purpose-built to represent the languages, scripts, and cultural contexts of Southeast Asia, a region whose languages are under-represented in the training data of most globally dominant models. Rather than competing with general-purpose systems such as GPT-4, Claude, or Gemini on broad capability, SEA-LION occupies a gap that large international developers have limited incentive to fill and that most regional organisations lack the compute to address independently.

Languages and training data

SEA-LION centres its training on eleven Southeast Asian languages, including English, Chinese, Malay, Indonesian, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, and Lao. The pretraining corpus comprises roughly one trillion tokens, with Southeast Asian languages deliberately over-represented relative to their share in conventional web-scraped datasets. This design choice improves the model's fluency, cultural grounding, and handling of code-switching, a common feature of everyday communication across the region.

To support this work, AISG also produced the Southeast Asian Languages in One Network Data (SEALD) collection, a curated and cleaned multilingual dataset assembled in collaboration with regional partners. High-quality regional data is the principal constraint on building models of this kind, and dataset construction is therefore a core part of the SEA-LION programme rather than an afterthought.

Model versions and architecture

The SEA-LION family has progressed through several generations. Early versions used an in-house transformer architecture trained from scratch. From version 3 onwards, AISG adopted a continued-pretraining strategy built on strong open base models, including Meta's Llama 3 and Google's Gemma, which are further trained on the SEA-LION corpus to inject regional language competence. This approach allows the project to benefit from the engineering investment behind frontier open models while concentrating its own resources on regional adaptation.

The family spans multiple sizes, typically around 3 billion, 7 to 9 billion, and 70 billion parameters, and multiple variants including base models, instruction-tuned chat models, and configurations adapted for retrieval-augmented generation. By 2026, the flagship 70B model represented one of the most capable openly available foundations oriented specifically toward Southeast Asia.

Availability and use

SEA-LION models are distributed openly. Weights are published on Hugging Face and can also be accessed through the official sea-lion.ai application programming interface. The permissive licensing, generally MIT or Apache 2.0 depending on the base model, makes the models suitable for commercial deployment as well as research. Typical applications include multilingual chatbots, government and public-service tools, translation and summarisation for regional languages, and as a base for organisations that wish to fine-tune a model on their own local data.

Significance for the region

SEA-LION is frequently described as one of the first genuinely Southeast-Asia-oriented open large language model foundations. Its importance lies less in raw benchmark scores than in linguistic inclusion: by treating low-resource regional languages as first-class rather than as a long tail, the project provides a shared infrastructure that national initiatives, universities, and companies across ASEAN can build on. A related effort, SeaLLM, pursues similar goals, and the two are often discussed together as anchors of a regional model ecosystem.

References

  1. AI Singapore. (2024). SEA-LION: Southeast Asian Languages In One Network. https://sea-lion.ai/
  2. AI Singapore. (2024). SEA-LION GitHub Repository. https://github.com/aisingapore/sealion
  3. NVIDIA Developer Blog. (2024). Regional LLMs SEA-LION and SeaLLM Serve Languages and Cultures of Southeast Asia.
  4. Computer Weekly. (2024). Sea-Lion explained: Southeast Asia's first large language model.