What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Word2Vec

A neural network-based algorithm developed by Google in 2013 that learns dense vector representations of words from large text corpora, capturing semantic and syntactic relationships through distributional similarity.

7 min readLast updated June 2026Foundations

Word2Vec is a neural network-based algorithm for learning dense vector representations of words from large text corpora, introduced by Tomas Mikolov and colleagues at Google in 2013. The algorithm learns to represent each word as a point in a continuous high-dimensional vector space such that words that appear in similar contexts are positioned close together. These learned representations, called word embeddings or word vectors, encode semantic and syntactic relationships that can be exploited in downstream natural language processing tasks.

Word2Vec represented a significant advance over earlier sparse, count-based representations of words such as one-hot encoding and term-frequency matrices. Its learned vectors exhibit remarkable geometric properties: vector arithmetic can capture analogical relationships, as in the famous example where the vector for "king" minus "man" plus "woman" yields a vector close to "queen". This finding demonstrated that meaningful structure had been captured implicitly from patterns of word co-occurrence in text.

Motivation: The Problem with Sparse Representations

Before word embeddings, the dominant approach to representing words in NLP was sparse, high-dimensional representations. In one-hot encoding, each word is a vector of zeros except for a single position, with vocabulary sizes often exceeding 50,000. Such representations treat all words as equidistant — there is no notion that "cat" and "dog" are more similar to each other than to "democracy". Bag-of-words and TF-IDF representations capture document-level word statistics but similarly lack semantic richness.

Word2Vec addressed this by leveraging the distributional hypothesis: the observation that words which occur in similar linguistic contexts tend to have similar meanings. By training a neural network to predict words from their context (or vice versa), Word2Vec implicitly learns to organise words in a vector space according to their distributional similarity.

Architectures

Word2Vec provides two neural network architectures for learning word embeddings:

Continuous Bag of Words (CBOW)

In the CBOW architecture, the model is trained to predict a target word given a window of surrounding context words. The context words are averaged into a single representation, which is fed into the neural network to predict the central word. CBOW is faster to train and tends to perform better on frequent words, making it well-suited to larger corpora.

Skip-Gram

The Skip-Gram architecture inverts the CBOW task: given a single target word, the model predicts the surrounding context words within a fixed window. Skip-Gram is slower to train than CBOW but produces higher-quality embeddings for rare and infrequent words, making it preferred when the vocabulary contains many specialised terms.

Both architectures are trained on raw text with a sliding window. The neural network is shallow — typically a single hidden layer — but the size of the training corpus compensates for architectural simplicity, and the hidden layer weights become the learned word vectors.

Training Efficiency: Negative Sampling

Naive training of Word2Vec requires a softmax over the entire vocabulary at each step, which is computationally expensive for large vocabularies. Negative sampling addresses this by replacing the full softmax with a binary classification task: for each positive (word, context) pair, a small number of randomly sampled "negative" word pairs are added, and the model learns to distinguish real co-occurrence pairs from random ones. This reduces the computation per training step to a constant independent of vocabulary size, making Word2Vec practical on billion-word corpora.

Properties of Word2Vec Embeddings

Trained Word2Vec vectors exhibit several useful properties:

Semantic similarity: Words with similar meanings cluster together. Animals group near each other, countries near each other, and so on.
Analogical reasoning: Linear vector arithmetic captures analogical relationships: vector("Paris") - vector("France") + vector("Germany") approximates vector("Berlin").
Syntactic structure: Relationships such as tense, pluralisation, and comparative forms are encoded in consistent vector directions.

Typical embedding dimensions range from 100 to 300, with 300-dimensional embeddings often used for general-purpose NLP tasks.

Limitations and Successors

Word2Vec has several limitations that later models addressed:

Context insensitivity: A single vector is assigned to each word regardless of context, so polysemous words (words with multiple meanings, such as "bank") receive one vector that averages across all meanings.
Out-of-vocabulary words: Words not seen during training have no representation.
No subword information: Word2Vec treats each word as an atomic unit, ignoring morphological structure. Related models such as FastText extend Word2Vec by representing words as bags of character n-grams, enabling representation of unseen words and improving performance on morphologically rich languages.

Contextualised word representations — where the vector for a word depends on its sentence context — were introduced by ELMo (2018), then superseded by BERT (2018) and the transformer-based language models that followed. These models produce a different embedding for each occurrence of a word depending on its surrounding context, resolving the polysemy limitation. Despite being superseded for many high-performance NLP tasks, Word2Vec embeddings remain widely used in production systems where computational efficiency is important and pre-trained transformer models would be too large to deploy.

Legacy and Impact

Word2Vec catalysed the field of representation learning and established word embeddings as a fundamental concept in NLP. It demonstrated that unsupervised pre-training on large text corpora could produce general-purpose linguistic representations that transfer across tasks — a principle that became the foundation for the subsequent generation of pre-trained language models. GloVe (Global Vectors for Word Representation), developed at Stanford in 2014, extended the distributional approach using global co-occurrence statistics rather than local window context. Together, Word2Vec and GloVe defined the embedding paradigm that transformer models later generalised and scaled.

Malaysian Context — NLP for Bahasa Malaysia and Regional Languages

Word2Vec and its principles are foundational to natural language processing research targeting Bahasa Malaysia, as well as the regional languages present in Malaysia including Mandarin, Tamil, and indigenous Sabah and Sarawak languages.

Research groups at Universiti Malaya, Universiti Teknologi Malaysia, and Universiti Sains Malaysia have trained Word2Vec models on Malay-language corpora drawn from news archives, government documents, social media, and web crawls. These embeddings have been used as baselines and pre-training inputs for Malay NLP tasks including sentiment analysis of Malay social media, named entity recognition in Malay news, and machine translation between Malay and English.

The Malay language presents specific challenges for Word2Vec: Malay is agglutinative, with prefixes and suffixes (imbuhan) that significantly alter word form and meaning. For example, the word "jalan" (walk or road) can become "berjalan", "perjalanan", "menjalankan", and other forms. FastText's subword approach tends to outperform plain Word2Vec on Malay text for this reason, and Malaysian NLP practitioners increasingly prefer FastText or transformer-based models for production tasks.

Malaysian technology companies including Maxis, Telekom Malaysia, and various fintech startups have deployed word embedding-based systems for customer support intent classification, document routing, and fraud signal extraction from free-text transaction descriptions. These production deployments often use Word2Vec or FastText embeddings as the feature layer feeding into classical classifiers, chosen for their speed advantage over full transformer inference.

The Social Wellbeing Research Centre at Universiti Malaya and research units within CIMB Bank and Maybank have applied word embedding methods to analyse Bahasa Malaysia financial news and social media sentiment as part of economic monitoring and consumer insight programmes. MDEC has highlighted Malay NLP capabilities as a strategic gap under Malaysia's AI Roadmap, and the development of high-quality Malay word embeddings and language models is an identified national priority.

References

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26.
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. Proceedings of EMNLP 2014.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146.
Pathmind. (2023). Word2Vec and neural word embeddings. Pathmind AI Wiki.

Tags:word embeddings NLP natural language processing Google semantic similarity

Developed by	Google (Tomas Mikolov et al.)
Published	2013
Type	Word embedding algorithm
Architectures	CBOW and Skip-Gram
Output	Dense vector representations of words
Related	Embedding, NLP, Transformer architecture, GloVe