What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Sentence Transformers

Sentence Transformers are neural network models that encode sentences, paragraphs, or short documents into fixed-length dense vector embeddings optimised for semantic similarity comparison.

6 min readLast updated June 2026Infrastructure

Sentence Transformers are a class of neural network models that transform variable-length text sequences — sentences, paragraphs, or short documents — into fixed-length dense vector embeddings that capture semantic meaning. These embeddings are positioned in a continuous vector space such that semantically similar texts are geometrically close, enabling efficient similarity comparison via cosine similarity or dot product. Sentence Transformers underpin many modern applications in semantic search, retrieval-augmented generation, clustering, duplicate detection, and cross-lingual information retrieval.

Background and Motivation

Standard BERT-based models, introduced by Devlin et al. in 2018, produce contextualised token embeddings but are not designed to produce a single vector representation of an entire sentence. Naively averaging token embeddings or using the [CLS] token embedding from BERT yields sentence representations of poor quality, performing worse than non-neural baselines such as GloVe averaged embeddings on semantic textual similarity (STS) benchmarks.

Moreover, computing sentence similarity with vanilla BERT requires a cross-encoder forward pass over every candidate pair, which scales as O(n squared) in the corpus size. For a corpus of 10,000 sentences, this implies 50 million BERT inferences, taking approximately 65 hours. Sentence Transformers reduce this to a single embedding pass per sentence — typically milliseconds per sentence — after which similarity is a simple vector dot product. Similarity search across ten thousand sentences takes seconds rather than hours.

Architecture

The foundational Sentence-BERT (SBERT) architecture, proposed by Reimers and Gurevych in their 2019 paper published at EMNLP, uses a siamese or triplet network structure built on top of BERT or RoBERTa.

In the siamese configuration, two identical encoder towers (sharing weights) each process one of two input sentences independently. Each encoder produces token embeddings, which are then aggregated by a pooling operation — mean pooling over all token embeddings is the most effective by default — to yield a single sentence embedding vector. The cosine similarity between the two sentence embeddings is computed, and the network is trained with a softmax loss over natural language inference labels (entailment, contradiction, neutral) or a regression loss over sentence similarity scores from STS datasets.

The resulting model produces sentence embeddings that can be pre-computed and indexed. At inference time, only the query sentence requires a new encoder pass; all corpus embeddings are retrieved from the index.

Training Objectives

Multiple training objectives have been developed for sentence transformer models depending on the intended downstream task.

Classification-based fine-tuning trains the model to predict semantic relationships between sentence pairs, with cross-entropy loss over NLI or STS classes. Regression fine-tuning optimises cosine similarity scores against human-annotated similarity ratings, common in STS benchmarks. Triplet loss training uses anchor-positive-negative triplets, pushing the embedding of a positive (semantically similar) example closer to the anchor than the embedding of a negative (semantically dissimilar) example by a margin. Contrastive learning with in-batch negatives — used in models such as SimCSE and E5 — treats all other examples in a training batch as negatives for each anchor, enabling efficient training on large-scale text pairs.

Model Families

The sentence-transformers library on Hugging Face, maintained by Nils Reimers and the SBERT team, hosts hundreds of pre-trained models. Notable families include all-MiniLM-L6-v2, a compact 22M-parameter model offering strong performance with low latency; all-mpnet-base-v2, a higher-quality 109M-parameter model; and multi-qa and multi-lingual variants that support retrieval across multiple languages.

Several commercial embedding APIs provide sentence transformer functionality as a managed service. OpenAI's text-embedding-3 series, Cohere Embed, Google's Gecko and Gemini embeddings, and Voyage AI all expose embedding endpoints. These APIs offer convenience and high-quality multilingual coverage at the cost of data egress and per-token pricing.

Applications

Semantic search and hybrid search pipelines use sentence transformers as the dense retrieval component, encoding both queries and corpus documents into a shared embedding space and retrieving the most similar documents by approximate nearest-neighbour search in a vector database such as Pinecone, Weaviate, or Qdrant.

Retrieval-augmented generation (RAG) systems rely on sentence transformers to embed knowledge base chunks and retrieve the most relevant passages to include in the LLM prompt. The quality of the embedding model is a primary determinant of RAG system accuracy.

Duplicate detection and question deduplication in customer support or forum systems use sentence transformer embeddings to cluster similar queries, routing duplicates to existing answers. Cross-lingual semantic search uses multilingual sentence transformers to retrieve documents in a different language from the query language, useful in multilingual document collections.

Evaluation

The primary evaluation benchmark for sentence transformers is the Semantic Textual Similarity (STS) suite, which measures Spearman correlation between model-predicted similarity scores and human annotations. The MTEB (Massive Text Embedding Benchmark) provides a more comprehensive evaluation across retrieval, clustering, classification, and STS tasks in over 50 languages.

Malaysian Context — Multilingual Embeddings and Local Deployment

Sentence Transformers occupy a pivotal role in Malaysian AI infrastructure because they are the embedding backbone of semantic search and RAG systems. The multilingual dimension is particularly significant: Malaysia's linguistic environment requires embedding models that handle Bahasa Malaysia and English interchangeably, and ideally Mandarin and Tamil as well.

Most general-purpose sentence transformer models are trained predominantly on English, with multilingual models such as paraphrase-multilingual-mpnet-base-v2 and LaBSE (Language-Agnostic BERT Sentence Encoder) offering broader coverage. LaBSE, developed by Google Research, covers 109 languages including Malay and is openly available. Malaysian AI practitioners have used LaBSE in academic and industry projects requiring bilingual Malay-English search. However, benchmarks on Malay-specific STS tasks show that cross-lingual transfer performance remains below English-only performance, motivating ongoing research at Malaysian universities into domain-adapted Malay embedding models.

MIMOS Berhad, a government research entity under MOSTI, has historically led Malay language technology development in Malaysia. Researchers at Universiti Teknologi Malaysia (UTM) and Universiti Malaya (UM) have published work on Malay word and sentence embeddings. The HuggingFace community has seen contributions of Malay-specific models under the mesolitica and malaya-speech organisations, with mesolitica's malaya-electra and related models providing locally trained alternatives for Bahasa Malaysia.

For Malaysian enterprises deploying RAG over document repositories — legal case files, corporate policies, bank circulars, government regulations — the choice of embedding model directly affects retrieval accuracy. Maybank, CIMB, Petronas, and Telekom Malaysia, each managing large internal knowledge bases, have interest in embedding infrastructure that can handle domain-specific vocabulary in multiple languages. Cloud providers operating in Malaysia, including AWS with its Amazon Bedrock embedding models and Microsoft Azure OpenAI Service, provide managed embedding APIs accessible from Malaysian data centres, simplifying deployment while raising data residency considerations for organisations subject to Bank Negara Malaysia's data localisation guidance.

References

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of EMNLP 2019.
Feng, F., et al. (2022). Language-agnostic BERT Sentence Embedding. Proceedings of ACL 2022.
Muennighoff, N., et al. (2023). MTEB: Massive Text Embedding Benchmark. Proceedings of EACL 2023.
Gao, T., Yao, X., & Chen, D. (2021). SimCSE: Simple Contrastive Learning of Sentence Embeddings. Proceedings of EMNLP 2021.
Wang, L., et al. (2024). Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv:2212.03533.

Tags:embeddings semantic-search nlp sbert sentence-similarity

Type	Embedding Model Family
Original paper	Reimers & Gurevych, 2019 (SBERT)
Architecture	Siamese Transformer with pooling
Key use	Semantic search, clustering, RAG, duplicate detection
Hub	Hugging Face sentence-transformers library
Related	Embedding, BERT, Semantic Search, RAG, Vector Database