What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Embedding

An embedding is a dense numerical vector representation of data — such as text, images, or audio — that encodes semantic meaning in a continuous high-dimensional space, enabling machine learning models to measure similarity and relationships.

6 min readLast updated May 2026Foundations

An embedding is a learned mapping from a discrete or high-dimensional object — a word, sentence, document, image, or structured record — into a continuous, dense vector of fixed dimensionality. The defining property of a well-trained embedding is that objects with similar meaning or function are located close together in the resulting vector space, while dissimilar objects are far apart. This geometric encoding of semantic relationships allows machine learning systems to perform operations that were previously intractable: computing textual similarity, retrieving relevant documents by meaning rather than exact keyword match, and powering the retrieval layer in architectures such as retrieval-augmented generation (RAG).

Historical Background

The modern conception of embeddings in natural language processing originates with Yoshua Bengio and colleagues' neural probabilistic language model in 2003, which jointly learned word representations and a language model. The technique gained broad popularity with Word2Vec, introduced by Tomas Mikolov and colleagues at Google in 2013. Word2Vec demonstrated that simple neural objectives — predicting a word from its context (CBOW) or a context from a word (skip-gram) — yield vectors with remarkable algebraic properties: the famous example that vector("king") − vector("man") + vector("woman") ≈ vector("queen") illustrated that relational meaning is geometrically encoded.

GloVe (Global Vectors for Word Representation) from Stanford in 2014 incorporated global co-occurrence statistics, offering complementary strengths. These static word embeddings were superseded by contextual embeddings from ELMo, BERT, and subsequent transformer models, in which a word's representation changes depending on its surrounding context, capturing polysemy and nuance that static embeddings cannot represent.

How Embeddings Are Generated

Modern embedding models are typically pre-trained on large text corpora using a contrastive or self-supervised objective. A sentence transformer model, for example, encodes an input sentence through multiple transformer layers and pools the resulting token representations into a single fixed-size vector. During training, the model is rewarded for placing semantically similar pairs close together in the embedding space (measured by cosine similarity or dot product) and dissimilar pairs far apart — a framework known as contrastive learning.

For specialised domains — code, biomedical text, legal documents — domain-adapted embedding models fine-tuned on in-domain corpora generally outperform general-purpose models. Multi-modal embedding models such as CLIP (Contrastive Language–Image Pre-training) from OpenAI jointly embed images and text into a shared space, enabling cross-modal retrieval.

Dimensionality and Distance Metrics

Embedding vectors typically range from 384 to 4096 dimensions depending on the model. The choice of distance metric is important: cosine similarity, which measures the angle between vectors normalised to unit length, is most common for semantic similarity tasks. Euclidean distance and dot product are used in specific contexts, particularly when vectors have not been normalised. Dimensionality reduction techniques such as principal component analysis (PCA) and UMAP are used to project high-dimensional embeddings into two or three dimensions for visualisation.

Types of Embeddings

Text embeddings can be generated at word, sentence, or document level. Word-level embeddings represent individual tokens; sentence embeddings capture the meaning of entire sentences, making them well suited to semantic search. Image embeddings, produced by convolutional or vision transformer encoders, represent visual content for tasks such as image retrieval and multi-modal search. Graph embeddings encode node relationships in knowledge graphs, enabling link prediction and entity resolution.

Applications

Semantic Search

Semantic search replaces traditional keyword matching with embedding-based nearest-neighbour retrieval. A query is embedded into the vector space, and the system retrieves documents whose embeddings are most similar, surfacing results that are conceptually related even when they share no exact keywords with the query.

Retrieval-Augmented Generation

In RAG systems, a document corpus is pre-embedded and stored in a vector database. At inference time, a user query is embedded and used to retrieve the most relevant document chunks, which are then passed as context to a large language model. The quality of the embedding model directly determines the quality of the retrieved context and, consequently, the accuracy of the generated answer.

Recommendation Systems

E-commerce and content platforms encode user behaviour histories and item attributes as embeddings. Recommendation is then framed as a nearest-neighbour search in the embedding space: items similar to those a user has interacted with are ranked highly.

Anomaly Detection and Classification

Embeddings can be used to detect outliers by identifying vectors that lie far from any cluster centre in the embedding space. Classification heads — simple linear layers trained on top of frozen embeddings — are a standard technique for adapting general-purpose embeddings to specific classification tasks with limited labelled data.

Malaysian Context — Embeddings in Enterprise and Government AI

Embeddings are foundational to a range of AI deployments across Malaysia's enterprise and public sectors, particularly as organisations invest in retrieval-augmented generation (RAG) systems that combine large language models with proprietary knowledge bases. Malaysian financial institutions such as Maybank and RHB Bank have built internal document search and compliance tools using embedding-based retrieval, allowing analysts to query large repositories of regulatory circulars, policy documents, and research reports by semantic meaning rather than exact text match.

The deployment of Amazon Bedrock's embedding models in the Asia Pacific (Malaysia) region from 2025 has lowered the barrier for Malaysian enterprises to access enterprise-grade embedding APIs without requiring local GPU infrastructure. AWS Partner Axrail, which operates Malaysia and Southeast Asia's first Generative AI Laboratory, provides implementation support for embedding pipelines built on Bedrock, assisting organisations in deploying semantic search and RAG solutions using AWS-managed infrastructure.

In the public sector, MDEC's MyDigital Blueprint initiative emphasises AI-powered services that can deliver personalised information to citizens. Embedding-based search systems are directly applicable to government knowledge bases and public service chatbots, enabling queries in Bahasa Malaysia and English to retrieve accurate information across multiple domains. Malaysia's PDPA (Personal Data Protection Act) and the Malaysia AI Governance Framework require that organisations maintain transparency about how AI systems process and retrieve personal information — considerations that extend to embedding pipelines that index personal communications or user data.

Malaysian universities and research institutions are active contributors to embedding research. Groups at Universiti Malaya and the Institute of Electrical and Electronics Engineers (IEEE) Malaysia Section publish on multilingual embedding models for Bahasa Malaysia, reflecting the practical need for embeddings that accurately encode meaning in local languages. The development of Bahasa Malaysia embedding models remains an area of active research, as general-purpose English-dominant models show reduced performance on Malay text due to underrepresentation in pre-training corpora.

References

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL-HLT 2019.
Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. ICML 2021.
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using siamese BERT-networks. EMNLP 2019.
Amazon Web Services. (2025). Amazon Bedrock now available in Asia Pacific (Malaysia). AWS Announcements.
IBM. (2025). What is vector embedding. IBM Think. https://www.ibm.com/think/topics/vector-embedding

Tags:embedding vector representation semantic search NLP

Type	Data representation technique
Output	Dense floating-point vector (e.g. 768 or 1536 dimensions)
Key use	Semantic search, RAG, recommendation, classification
Common models	text-embedding-3 (OpenAI), E5, BGE, Cohere Embed
Related	Vector database, Retrieval-augmented generation, Semantic search