Embedding
An embedding is a dense numerical vector representation of data — such as text, images, or audio — that encodes semantic meaning in a continuous high-dimensional space, enabling machine learning models to measure similarity and relationships.
An embedding is a learned mapping from a discrete or high-dimensional object — a word, sentence, document, image, or structured record — into a continuous, dense vector of fixed dimensionality. The defining property of a well-trained embedding is that objects with similar meaning or function are located close together in the resulting vector space, while dissimilar objects are far apart. This geometric encoding of semantic relationships allows machine learning systems to perform operations that were previously intractable: computing textual similarity, retrieving relevant documents by meaning rather than exact keyword match, and powering the retrieval layer in architectures such as retrieval-augmented generation (RAG).
Historical Background
The modern conception of embeddings in natural language processing originates with Yoshua Bengio and colleagues' neural probabilistic language model in 2003, which jointly learned word representations and a language model. The technique gained broad popularity with Word2Vec, introduced by Tomas Mikolov and colleagues at Google in 2013. Word2Vec demonstrated that simple neural objectives — predicting a word from its context (CBOW) or a context from a word (skip-gram) — yield vectors with remarkable algebraic properties: the famous example that vector("king") − vector("man") + vector("woman") ≈ vector("queen") illustrated that relational meaning is geometrically encoded.
GloVe (Global Vectors for Word Representation) from Stanford in 2014 incorporated global co-occurrence statistics, offering complementary strengths. These static word embeddings were superseded by contextual embeddings from ELMo, BERT, and subsequent transformer models, in which a word's representation changes depending on its surrounding context, capturing polysemy and nuance that static embeddings cannot represent.
How Embeddings Are Generated
Modern embedding models are typically pre-trained on large text corpora using a contrastive or self-supervised objective. A sentence transformer model, for example, encodes an input sentence through multiple transformer layers and pools the resulting token representations into a single fixed-size vector. During training, the model is rewarded for placing semantically similar pairs close together in the embedding space (measured by cosine similarity or dot product) and dissimilar pairs far apart — a framework known as contrastive learning.
For specialised domains — code, biomedical text, legal documents — domain-adapted embedding models fine-tuned on in-domain corpora generally outperform general-purpose models. Multi-modal embedding models such as CLIP (Contrastive Language–Image Pre-training) from OpenAI jointly embed images and text into a shared space, enabling cross-modal retrieval.
Dimensionality and Distance Metrics
Embedding vectors typically range from 384 to 4096 dimensions depending on the model. The choice of distance metric is important: cosine similarity, which measures the angle between vectors normalised to unit length, is most common for semantic similarity tasks. Euclidean distance and dot product are used in specific contexts, particularly when vectors have not been normalised. Dimensionality reduction techniques such as principal component analysis (PCA) and UMAP are used to project high-dimensional embeddings into two or three dimensions for visualisation.
Types of Embeddings
Text embeddings can be generated at word, sentence, or document level. Word-level embeddings represent individual tokens; sentence embeddings capture the meaning of entire sentences, making them well suited to semantic search. Image embeddings, produced by convolutional or vision transformer encoders, represent visual content for tasks such as image retrieval and multi-modal search. Graph embeddings encode node relationships in knowledge graphs, enabling link prediction and entity resolution.
Applications
Semantic Search
Semantic search replaces traditional keyword matching with embedding-based nearest-neighbour retrieval. A query is embedded into the vector space, and the system retrieves documents whose embeddings are most similar, surfacing results that are conceptually related even when they share no exact keywords with the query.
Retrieval-Augmented Generation
In RAG systems, a document corpus is pre-embedded and stored in a vector database. At inference time, a user query is embedded and used to retrieve the most relevant document chunks, which are then passed as context to a large language model. The quality of the embedding model directly determines the quality of the retrieved context and, consequently, the accuracy of the generated answer.
Recommendation Systems
E-commerce and content platforms encode user behaviour histories and item attributes as embeddings. Recommendation is then framed as a nearest-neighbour search in the embedding space: items similar to those a user has interacted with are ranked highly.
Anomaly Detection and Classification
Embeddings can be used to detect outliers by identifying vectors that lie far from any cluster centre in the embedding space. Classification heads — simple linear layers trained on top of frozen embeddings — are a standard technique for adapting general-purpose embeddings to specific classification tasks with limited labelled data.
References
- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781.
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL-HLT 2019.
- Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. ICML 2021.
- Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using siamese BERT-networks. EMNLP 2019.
- Amazon Web Services. (2025). Amazon Bedrock now available in Asia Pacific (Malaysia). AWS Announcements.
- IBM. (2025). What is vector embedding. IBM Think. https://www.ibm.com/think/topics/vector-embedding