What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Named Entity Recognition

Named entity recognition (NER) is a natural language processing task that identifies and classifies named entities in text — such as people, organisations, locations, and dates — into predefined categories.

6 min readLast updated May 2026Applications

Named entity recognition (NER), also referred to as entity identification or entity chunking, is a natural language processing (NLP) task concerned with locating and classifying named entities mentioned in unstructured text into predefined semantic categories. Standard entity categories include persons (PER), organisations (ORG), geographic locations (LOC), dates and times, monetary values, percentages, and miscellaneous named items. For example, in the sentence "Maybank announced a partnership with Microsoft in Kuala Lumpur on Monday", a NER system would identify Maybank as an organisation, Microsoft as an organisation, Kuala Lumpur as a location, and Monday as a date.

NER is typically framed as a sequence labelling task. Each token in a sentence receives a label indicating whether it is part of a named entity and, if so, the entity type. A common labelling scheme is the BIO notation: B marks the beginning of an entity, I marks tokens inside a continuing entity, and O marks tokens that are not part of any entity.

Historical Development

Early NER systems were built using hand-crafted rules and lexical resources such as gazetteers — lists of known named entities such as city names, company registries, and person name dictionaries. These rule-based systems were precise for well-defined domains but required extensive manual effort and did not generalise well across domains or languages.

Statistical sequence labelling models, particularly conditional random fields (CRFs), dominated the field through the 2000s and into the 2010s. CRFs model the conditional probability of a label sequence given an input sequence, taking into account both the input features and the dependencies between adjacent labels. They outperformed earlier generative models such as Hidden Markov Models by conditioning on rich, overlapping input features.

Bidirectional LSTM-CRF models, which combine a bidirectional LSTM for encoding context with a CRF layer for structured output prediction, became the dominant neural NER architecture from approximately 2016. These models learn character-level and word-level representations, capturing morphological patterns useful for recognising entities in unseen forms.

Transformer-Based NER

The introduction of pre-trained transformer models — particularly BERT (Bidirectional Encoder Representations from Transformers) in 2018 — substantially advanced NER performance. Fine-tuning BERT on labelled NER datasets achieves state-of-the-art results on standard benchmarks. The model's bidirectional context representation captures long-range dependencies that LSTM-based models handle less effectively.

Subsequent models including RoBERTa, ALBERT, and domain-specific variants such as BioBERT (for biomedical text) and FinBERT (for financial text) have been fine-tuned for NER in specialised domains. In multilingual settings, XLM-RoBERTa enables NER across more than 100 languages from a single model, which is particularly valuable for low-resource languages where sufficient training data for language-specific models is unavailable.

Large language models can perform NER through prompting — presenting the text and asking the model to identify and classify entities in its output — but fine-tuned smaller models typically achieve higher precision on well-defined entity taxonomies in production settings.

Entity Linking and Knowledge Graphs

NER is often the first step in a broader information extraction pipeline. Entity linking (EL) or entity disambiguation takes the entity spans identified by a NER system and maps them to canonical entries in a knowledge base such as Wikidata, DBpedia, or a domain-specific knowledge graph. This transforms ambiguous surface mentions — Apple could refer to the technology company, the fruit, or a person's surname — into unambiguous entity identifiers.

The combination of NER and entity linking enables the construction and population of knowledge graphs from unstructured text, where entities become nodes and the relationships between co-occurring entities become edges. Search engines, question answering systems, and document intelligence platforms rely on this pipeline to extract structured information from large text corpora.

Applications

Document intelligence and information extraction systems use NER to automatically process and structure large volumes of unstructured documents. Legal contract analysis tools extract parties, dates, obligations, and governed jurisdictions. Financial document processing identifies company names, monetary figures, and reporting periods in earnings filings and analyst reports. Medical record analysis extracts patient information, diagnoses, medications, and dosages from clinical notes.

In regulatory compliance, NER enables automated screening of communications, transactions, and documents for mentions of sanctioned entities, politically exposed persons (PEPs), and geographies subject to trade restrictions. News and media monitoring services use NER to track coverage of specific companies, individuals, and topics across large article corpora.

Cybersecurity applications apply NER to threat intelligence feeds and security reports to extract indicators of compromise (IoCs) such as IP addresses, domain names, and malware family names.

Malaysian Context — NER for Bahasa Malaysia and Local Entity Extraction

NER in Malaysia presents both linguistic complexity and commercial importance. The country's multilingual environment means that named entities in Malaysian text may appear in Bahasa Malaysia, English, Mandarin Chinese, Tamil, or code-mixed combinations. A company like Petronas might be referred to as Petroliam Nasional in formal Bahasa Malaysia text, PETRONAS in English, or in abbreviated or colloquial forms in informal writing. Robust Malaysian NER systems must handle these variations across languages and registers.

Researchers at Universiti Malaya, Universiti Sains Malaysia (USM), and Universiti Teknologi MARA (UiTM) have developed annotated corpora for Malay-language NER, targeting the news, social media, and government document domains. The absence of large, high-quality Bahasa Malaysia NER training datasets has historically constrained local system quality, but the availability of multilingual pre-trained models such as XLM-RoBERTa has reduced this gap by enabling cross-lingual transfer.

Malaysian financial institutions use NER in compliance and anti-money laundering (AML) workflows. CIMB, Maybank, and RHB operate large compliance teams that screen customer data, transaction narratives, and communications against sanctions lists and PEP databases. NER systems that accurately extract entity names from varied text formats and map them to canonical identifiers reduce the manual review burden and improve detection accuracy.

In the legal technology sector, Malaysian law firms and courts have explored NER for automated case law analysis, extracting case citations, party names, and judicial entities from large collections of Malaysian legal documents. The Legal Affairs Division of the Prime Minister's Department has invested in digitisation of legal records, creating potential data infrastructure for NER-powered legal analytics.

MDEC's AI in Government initiative has promoted the use of NLP tools including NER for processing government documents, ministerial statements, and regulatory filings. Automating entity extraction from these sources supports policy monitoring, regulatory intelligence, and public sector service delivery improvement.

References

Lample, G. et al. (2016). Neural Architectures for Named Entity Recognition. Proceedings of NAACL-HLT 2016.
Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019.
Conneau, A. et al. (2020). Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of ACL 2020.
Abdullah, M.T. et al. (2023). A Survey of Named Entity Recognition for Bahasa Malaysia. Proceedings of the International Conference on Asian Language Processing (IALP 2023).
Finkel, J.R., Grenager, T., and Manning, C. (2005). Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of ACL 2005.

Tags:NER named entity recognition NLP information extraction text mining

Abbreviation	NER
Type	NLP information extraction task
Entity types	Person, organisation, location, date, quantity, misc.
Key approaches	CRF, BiLSTM-CRF, BERT-based models
Applications	Information extraction, knowledge graphs, search, compliance