Data Labelling
Data labelling is the process of attaching meaningful tags, classes, or annotations to raw data so that supervised machine learning models can learn to predict those labels on unseen examples.
Data labelling, also called data annotation, is the process by which raw inputs are tagged with the correct outputs, classes, regions, or other structured metadata that a supervised machine learning model is expected to predict. Labelled datasets are the foundation of nearly every production AI system: a classifier learns from images paired with class names, a speech recogniser from audio paired with transcripts, an object detector from images paired with bounding boxes, and an instruction-tuned LLM from prompts paired with high-quality responses.
Why labelling matters
Although unsupervised, self-supervised, and synthetic-data techniques have reduced the absolute volume of human-labelled data needed for some tasks, almost every high-stakes deployment still depends on accurately labelled examples for training, fine-tuning, and evaluation. Industry estimates put the global data labelling market at roughly USD 4.9 billion in 2025 and expect it to grow several-fold this decade, driven by demand for high-quality data for foundation models and for vertical applications in healthcare, autonomous mobility, manufacturing, and finance.
Labelling tasks by modality
| Modality | Common annotation tasks | |---|---| | Image | Classification, bounding boxes, polygons, key points, segmentation masks | | Video | Per-frame classification, object tracking, temporal segmentation | | Text | Classification, named entity recognition, span tagging, relation extraction, preference ranking | | Audio | Transcription, speaker diarisation, sound-event detection | | 3D / LiDAR | Cuboids, semantic segmentation of point clouds | | Sensor / time series | Event detection, fault labelling | | Generative AI | Pairwise preference, rubric-based scoring, red-team labels |
Workforce and tooling
Labelling work is typically organised through one of four models: a small in-house team for sensitive or specialist data; a managed labelling vendor (Scale AI, iMerit, Sama, Appen, Surge AI) that provides trained annotators with quality controls; a crowdsourcing marketplace (Amazon Mechanical Turk, Toloka) for simple high-volume tasks; or AI-assisted labelling, where a model produces initial labels that humans review and correct. Active learning, in which the model selects the most informative examples for labelling, can reduce annotation cost substantially compared with random sampling.
Platforms in regular use include Amazon SageMaker Ground Truth, Label Studio, V7, Encord, Kili Technology, CVAT, Roboflow, Labelbox, Snorkel, Argilla, and Prodigy. Modern platforms combine project management, an annotation UI, model-assisted pre-labelling, quality scoring, audit trails, and APIs into a single workflow.
Quality control
Annotation quality is typically governed by clear labelling guidelines, multi-annotator consensus on a subset of items, gold-standard examples interleaved with the queue, periodic auditing, inter-annotator agreement statistics (Cohen's kappa, Krippendorff's alpha), and feedback loops to clarify ambiguous cases. Where the underlying judgement is inherently subjective, such as toxicity, helpfulness, or aesthetic quality, teams use rubrics, calibration sessions, and population-based annotation panels rather than relying on a single "ground truth".
Ethics and labour considerations
Data labelling at scale has been criticised for low pay, opaque task conditions, and psychological harm to annotators handling sensitive or graphic material. Responsible practice involves fair compensation, mental-health support for moderators of harmful content, transparent task descriptions, and informed consent for any data captured from labellers themselves.
See Also
References
References
- Snow, R. et al. (2008). Cheap and fast — but is it good? Evaluating non-expert annotations for natural language tasks. EMNLP.
- Krippendorff, K. (2018). Content Analysis: An Introduction to Its Methodology. Sage.
- Gebru, T. et al. (2021). Datasheets for Datasets. Communications of the ACM.
- Roboflow. (2026). 5 Best Data Labeling Tools.
- Personal Data Protection Department Malaysia. PDPA 2010 and 2024 Amendments.