What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Data Labelling

Data labelling is the process of attaching meaningful tags, classes, or annotations to raw data so that supervised machine learning models can learn to predict those labels on unseen examples.

5 min readLast updated June 2026Infrastructure

Data labelling, also called data annotation, is the process by which raw inputs are tagged with the correct outputs, classes, regions, or other structured metadata that a supervised machine learning model is expected to predict. Labelled datasets are the foundation of nearly every production AI system: a classifier learns from images paired with class names, a speech recogniser from audio paired with transcripts, an object detector from images paired with bounding boxes, and an instruction-tuned LLM from prompts paired with high-quality responses.

Why labelling matters

Although unsupervised, self-supervised, and synthetic-data techniques have reduced the absolute volume of human-labelled data needed for some tasks, almost every high-stakes deployment still depends on accurately labelled examples for training, fine-tuning, and evaluation. Industry estimates put the global data labelling market at roughly USD 4.9 billion in 2025 and expect it to grow several-fold this decade, driven by demand for high-quality data for foundation models and for vertical applications in healthcare, autonomous mobility, manufacturing, and finance.

Labelling tasks by modality

| Modality | Common annotation tasks | |---|---| | Image | Classification, bounding boxes, polygons, key points, segmentation masks | | Video | Per-frame classification, object tracking, temporal segmentation | | Text | Classification, named entity recognition, span tagging, relation extraction, preference ranking | | Audio | Transcription, speaker diarisation, sound-event detection | | 3D / LiDAR | Cuboids, semantic segmentation of point clouds | | Sensor / time series | Event detection, fault labelling | | Generative AI | Pairwise preference, rubric-based scoring, red-team labels |

Workforce and tooling

Labelling work is typically organised through one of four models: a small in-house team for sensitive or specialist data; a managed labelling vendor (Scale AI, iMerit, Sama, Appen, Surge AI) that provides trained annotators with quality controls; a crowdsourcing marketplace (Amazon Mechanical Turk, Toloka) for simple high-volume tasks; or AI-assisted labelling, where a model produces initial labels that humans review and correct. Active learning, in which the model selects the most informative examples for labelling, can reduce annotation cost substantially compared with random sampling.

Platforms in regular use include Amazon SageMaker Ground Truth, Label Studio, V7, Encord, Kili Technology, CVAT, Roboflow, Labelbox, Snorkel, Argilla, and Prodigy. Modern platforms combine project management, an annotation UI, model-assisted pre-labelling, quality scoring, audit trails, and APIs into a single workflow.

Quality control

Annotation quality is typically governed by clear labelling guidelines, multi-annotator consensus on a subset of items, gold-standard examples interleaved with the queue, periodic auditing, inter-annotator agreement statistics (Cohen's kappa, Krippendorff's alpha), and feedback loops to clarify ambiguous cases. Where the underlying judgement is inherently subjective, such as toxicity, helpfulness, or aesthetic quality, teams use rubrics, calibration sessions, and population-based annotation panels rather than relying on a single "ground truth".

Ethics and labour considerations

Data labelling at scale has been criticised for low pay, opaque task conditions, and psychological harm to annotators handling sensitive or graphic material. Responsible practice involves fair compensation, mental-health support for moderators of harmful content, transparent task descriptions, and informed consent for any data captured from labellers themselves.

Malaysian Context — Annotation Workforce and Regional Vendors

Malaysia has emerged as one of South-East Asia's growing hubs for data annotation, both as a delivery centre for global vendors and as a domestic supplier for local AI initiatives. Multilingual capability — English, Bahasa Malaysia, Mandarin, Tamil, Hokkien, Cantonese, and the regional Bahasa Indonesia variant — is a particular strength for text, speech, and conversational AI annotation across the ASEAN market.

Global vendors operate delivery centres or partner networks in Kuala Lumpur, Cyberjaya, Johor Bahru, and Penang, while domestic firms such as Innov8tif, Securemetric, and several MDEC Premier Digital Tech ecosystem partners offer specialised annotation services for eKYC, document processing, and language data. The HRD Corp levy reimbursement scheme covers annotation workforce upskilling under digital skill categories, and MDEC has supported gig-economy and microwork pathways for data annotation through programmes such as eRezeki and Global Online Workforce (GLOW).

Sector-specific labelling work in Malaysia includes medical imaging annotation for partner hospitals and research institutes (KPJ, IHH/Pantai Hospitals, KKM-linked university hospitals, IMU), palm oil and agronomy imagery under guidance from the Malaysian Palm Oil Board (MPOB) and FELDA, autonomous-driving and ADAS labelling for regional OEMs and Proton's R&D affiliates, and financial document annotation for banks and capital-market participants regulated by BNM and the Securities Commission.

Labelling work that involves personal data is subject to the Personal Data Protection Act 2010 (PDPA), and emerging Malaysian AI Governance Framework principles coordinated by MOSTI and the National AI Office are expected to address data provenance, annotator welfare, and dataset documentation requirements such as datasheets and model cards. The Cybersecurity Malaysia and NACSA roles in protecting critical national information infrastructure also shape annotation practice for sensitive datasets.

References

Snow, R. et al. (2008). Cheap and fast — but is it good? Evaluating non-expert annotations for natural language tasks. EMNLP.
Krippendorff, K. (2018). Content Analysis: An Introduction to Its Methodology. Sage.
Gebru, T. et al. (2021). Datasheets for Datasets. Communications of the ACM.
Roboflow. (2026). 5 Best Data Labeling Tools.
Personal Data Protection Department Malaysia. PDPA 2010 and 2024 Amendments.

Tags:supervised-learning annotation ground-truth training-data

Type	Data preparation discipline
Also called	Data annotation, tagging
Output	Labelled training datasets
Common modalities	Text, image, video, audio, 3D, sensor
Workforce models	In-house, managed, crowdsourced, AI-assisted
Related	Supervised learning, active learning, synthetic data

Why labelling matters

Labelling tasks by modality

Workforce and tooling

Quality control

Ethics and labour considerations

See Also

References

References