AIWiki
Malaysia

Word2Vec

A neural network-based algorithm developed by Google in 2013 that learns dense vector representations of words from large text corpora, capturing semantic and syntactic relationships through distributional similarity.

7 min readLast updated June 2026Foundations

Word2Vec is a neural network-based algorithm for learning dense vector representations of words from large text corpora, introduced by Tomas Mikolov and colleagues at Google in 2013. The algorithm learns to represent each word as a point in a continuous high-dimensional vector space such that words that appear in similar contexts are positioned close together. These learned representations, called word embeddings or word vectors, encode semantic and syntactic relationships that can be exploited in downstream natural language processing tasks.

Word2Vec represented a significant advance over earlier sparse, count-based representations of words such as one-hot encoding and term-frequency matrices. Its learned vectors exhibit remarkable geometric properties: vector arithmetic can capture analogical relationships, as in the famous example where the vector for "king" minus "man" plus "woman" yields a vector close to "queen". This finding demonstrated that meaningful structure had been captured implicitly from patterns of word co-occurrence in text.

Motivation: The Problem with Sparse Representations

Before word embeddings, the dominant approach to representing words in NLP was sparse, high-dimensional representations. In one-hot encoding, each word is a vector of zeros except for a single position, with vocabulary sizes often exceeding 50,000. Such representations treat all words as equidistant — there is no notion that "cat" and "dog" are more similar to each other than to "democracy". Bag-of-words and TF-IDF representations capture document-level word statistics but similarly lack semantic richness.

Word2Vec addressed this by leveraging the distributional hypothesis: the observation that words which occur in similar linguistic contexts tend to have similar meanings. By training a neural network to predict words from their context (or vice versa), Word2Vec implicitly learns to organise words in a vector space according to their distributional similarity.

Architectures

Word2Vec provides two neural network architectures for learning word embeddings:

Continuous Bag of Words (CBOW)

In the CBOW architecture, the model is trained to predict a target word given a window of surrounding context words. The context words are averaged into a single representation, which is fed into the neural network to predict the central word. CBOW is faster to train and tends to perform better on frequent words, making it well-suited to larger corpora.

Skip-Gram

The Skip-Gram architecture inverts the CBOW task: given a single target word, the model predicts the surrounding context words within a fixed window. Skip-Gram is slower to train than CBOW but produces higher-quality embeddings for rare and infrequent words, making it preferred when the vocabulary contains many specialised terms.

Both architectures are trained on raw text with a sliding window. The neural network is shallow — typically a single hidden layer — but the size of the training corpus compensates for architectural simplicity, and the hidden layer weights become the learned word vectors.

Training Efficiency: Negative Sampling

Naive training of Word2Vec requires a softmax over the entire vocabulary at each step, which is computationally expensive for large vocabularies. Negative sampling addresses this by replacing the full softmax with a binary classification task: for each positive (word, context) pair, a small number of randomly sampled "negative" word pairs are added, and the model learns to distinguish real co-occurrence pairs from random ones. This reduces the computation per training step to a constant independent of vocabulary size, making Word2Vec practical on billion-word corpora.

Properties of Word2Vec Embeddings

Trained Word2Vec vectors exhibit several useful properties:

  • Semantic similarity: Words with similar meanings cluster together. Animals group near each other, countries near each other, and so on.
  • Analogical reasoning: Linear vector arithmetic captures analogical relationships: vector("Paris") - vector("France") + vector("Germany") approximates vector("Berlin").
  • Syntactic structure: Relationships such as tense, pluralisation, and comparative forms are encoded in consistent vector directions.

Typical embedding dimensions range from 100 to 300, with 300-dimensional embeddings often used for general-purpose NLP tasks.

Limitations and Successors

Word2Vec has several limitations that later models addressed:

  • Context insensitivity: A single vector is assigned to each word regardless of context, so polysemous words (words with multiple meanings, such as "bank") receive one vector that averages across all meanings.
  • Out-of-vocabulary words: Words not seen during training have no representation.
  • No subword information: Word2Vec treats each word as an atomic unit, ignoring morphological structure. Related models such as FastText extend Word2Vec by representing words as bags of character n-grams, enabling representation of unseen words and improving performance on morphologically rich languages.

Contextualised word representations — where the vector for a word depends on its sentence context — were introduced by ELMo (2018), then superseded by BERT (2018) and the transformer-based language models that followed. These models produce a different embedding for each occurrence of a word depending on its surrounding context, resolving the polysemy limitation. Despite being superseded for many high-performance NLP tasks, Word2Vec embeddings remain widely used in production systems where computational efficiency is important and pre-trained transformer models would be too large to deploy.

Legacy and Impact

Word2Vec catalysed the field of representation learning and established word embeddings as a fundamental concept in NLP. It demonstrated that unsupervised pre-training on large text corpora could produce general-purpose linguistic representations that transfer across tasks — a principle that became the foundation for the subsequent generation of pre-trained language models. GloVe (Global Vectors for Word Representation), developed at Stanford in 2014, extended the distributional approach using global co-occurrence statistics rather than local window context. Together, Word2Vec and GloVe defined the embedding paradigm that transformer models later generalised and scaled.

References

  1. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781.
  2. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26.
  3. Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. Proceedings of EMNLP 2014.
  4. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146.
  5. Pathmind. (2023). Word2Vec and neural word embeddings. Pathmind AI Wiki.