AIWiki
Malaysia

Sentence Transformers

Sentence Transformers are neural network models that encode sentences, paragraphs, or short documents into fixed-length dense vector embeddings optimised for semantic similarity comparison.

6 min readLast updated June 2026Infrastructure

Sentence Transformers are a class of neural network models that transform variable-length text sequences — sentences, paragraphs, or short documents — into fixed-length dense vector embeddings that capture semantic meaning. These embeddings are positioned in a continuous vector space such that semantically similar texts are geometrically close, enabling efficient similarity comparison via cosine similarity or dot product. Sentence Transformers underpin many modern applications in semantic search, retrieval-augmented generation, clustering, duplicate detection, and cross-lingual information retrieval.

Background and Motivation

Standard BERT-based models, introduced by Devlin et al. in 2018, produce contextualised token embeddings but are not designed to produce a single vector representation of an entire sentence. Naively averaging token embeddings or using the [CLS] token embedding from BERT yields sentence representations of poor quality, performing worse than non-neural baselines such as GloVe averaged embeddings on semantic textual similarity (STS) benchmarks.

Moreover, computing sentence similarity with vanilla BERT requires a cross-encoder forward pass over every candidate pair, which scales as O(n squared) in the corpus size. For a corpus of 10,000 sentences, this implies 50 million BERT inferences, taking approximately 65 hours. Sentence Transformers reduce this to a single embedding pass per sentence — typically milliseconds per sentence — after which similarity is a simple vector dot product. Similarity search across ten thousand sentences takes seconds rather than hours.

Architecture

The foundational Sentence-BERT (SBERT) architecture, proposed by Reimers and Gurevych in their 2019 paper published at EMNLP, uses a siamese or triplet network structure built on top of BERT or RoBERTa.

In the siamese configuration, two identical encoder towers (sharing weights) each process one of two input sentences independently. Each encoder produces token embeddings, which are then aggregated by a pooling operation — mean pooling over all token embeddings is the most effective by default — to yield a single sentence embedding vector. The cosine similarity between the two sentence embeddings is computed, and the network is trained with a softmax loss over natural language inference labels (entailment, contradiction, neutral) or a regression loss over sentence similarity scores from STS datasets.

The resulting model produces sentence embeddings that can be pre-computed and indexed. At inference time, only the query sentence requires a new encoder pass; all corpus embeddings are retrieved from the index.

Training Objectives

Multiple training objectives have been developed for sentence transformer models depending on the intended downstream task.

Classification-based fine-tuning trains the model to predict semantic relationships between sentence pairs, with cross-entropy loss over NLI or STS classes. Regression fine-tuning optimises cosine similarity scores against human-annotated similarity ratings, common in STS benchmarks. Triplet loss training uses anchor-positive-negative triplets, pushing the embedding of a positive (semantically similar) example closer to the anchor than the embedding of a negative (semantically dissimilar) example by a margin. Contrastive learning with in-batch negatives — used in models such as SimCSE and E5 — treats all other examples in a training batch as negatives for each anchor, enabling efficient training on large-scale text pairs.

Model Families

The sentence-transformers library on Hugging Face, maintained by Nils Reimers and the SBERT team, hosts hundreds of pre-trained models. Notable families include all-MiniLM-L6-v2, a compact 22M-parameter model offering strong performance with low latency; all-mpnet-base-v2, a higher-quality 109M-parameter model; and multi-qa and multi-lingual variants that support retrieval across multiple languages.

Several commercial embedding APIs provide sentence transformer functionality as a managed service. OpenAI's text-embedding-3 series, Cohere Embed, Google's Gecko and Gemini embeddings, and Voyage AI all expose embedding endpoints. These APIs offer convenience and high-quality multilingual coverage at the cost of data egress and per-token pricing.

Applications

Semantic search and hybrid search pipelines use sentence transformers as the dense retrieval component, encoding both queries and corpus documents into a shared embedding space and retrieving the most similar documents by approximate nearest-neighbour search in a vector database such as Pinecone, Weaviate, or Qdrant.

Retrieval-augmented generation (RAG) systems rely on sentence transformers to embed knowledge base chunks and retrieve the most relevant passages to include in the LLM prompt. The quality of the embedding model is a primary determinant of RAG system accuracy.

Duplicate detection and question deduplication in customer support or forum systems use sentence transformer embeddings to cluster similar queries, routing duplicates to existing answers. Cross-lingual semantic search uses multilingual sentence transformers to retrieve documents in a different language from the query language, useful in multilingual document collections.

Evaluation

The primary evaluation benchmark for sentence transformers is the Semantic Textual Similarity (STS) suite, which measures Spearman correlation between model-predicted similarity scores and human annotations. The MTEB (Massive Text Embedding Benchmark) provides a more comprehensive evaluation across retrieval, clustering, classification, and STS tasks in over 50 languages.

See Also

References

  1. Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of EMNLP 2019.
  2. Feng, F., et al. (2022). Language-agnostic BERT Sentence Embedding. Proceedings of ACL 2022.
  3. Muennighoff, N., et al. (2023). MTEB: Massive Text Embedding Benchmark. Proceedings of EACL 2023.
  4. Gao, T., Yao, X., & Chen, D. (2021). SimCSE: Simple Contrastive Learning of Sentence Embeddings. Proceedings of EMNLP 2021.
  5. Wang, L., et al. (2024). Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv:2212.03533.