Pose Estimation
Pose estimation is the computer vision task of detecting and tracking the position and orientation of human bodies, hands, or objects from images or video, typically by locating keypoints such as joints.
Pose estimation is the computer vision task of localising keypoints — typically anatomical landmarks such as joints — on a body, hand, or object, and inferring a structured representation of its pose. Outputs may be two-dimensional pixel coordinates, three-dimensional positions in camera or world coordinates, or higher-level skeletons and meshes. The task underlies applications from sports analytics to extended reality avatars and physiotherapy.
Problem variants
Pose estimation systems are categorised along several axes. Two-dimensional pose estimation predicts pixel-space keypoint locations from a single image, while three-dimensional pose estimation lifts these to 3D coordinates, typically with the help of camera calibration, multi-view input, or learned priors. Single-person systems assume one subject in the frame, while multi-person systems handle crowded scenes through top-down (detect-then-pose) or bottom-up (predict-all-keypoints-then-group) pipelines.
Specialised variants include hand pose estimation, facial landmark detection, animal pose estimation, and six-degree-of-freedom object pose estimation for robotics. Dense models such as DensePose produce per-pixel correspondences to a canonical body surface, supporting full-body reconstruction.
Architectures
Modern pose estimators are built on convolutional and transformer backbones. Influential systems include OpenPose, which introduced part affinity fields for multi-person bottom-up pose estimation; HRNet, which maintains high-resolution feature maps throughout the network for precise keypoint localisation; PoseNet and MoveNet, optimised for mobile and edge deployment; AlphaPose, a top-down system with high accuracy; and YOLOv8-Pose, which extends real-time object detection to keypoint regression. Google's MediaPipe provides production-grade pose, hand, and face solutions tuned for on-device inference. Transformer-based estimators such as ViTPose match or surpass convolutional baselines on COCO Keypoints.
Datasets and metrics
COCO Keypoints is the dominant benchmark for 2D multi-person pose estimation, with 17 annotated joints per person. MPII Human Pose and CrowdPose stress-test models on cluttered scenes. Human3.6M, 3DPW, and AMASS support 3D pose research. Average Precision based on Object Keypoint Similarity is the standard metric for COCO; for 3D pose, Mean Per-Joint Position Error is widely reported. Procrustes-aligned Mean Per-Joint Position Error removes global rotation and translation to isolate intrinsic skeletal accuracy.
Common challenges
Occlusion, motion blur, unusual viewpoints, scale variation, clothing variability, and crowded scenes remain difficult. Reliable 3D pose estimation from a single monocular view is fundamentally ambiguous and benefits from multi-view or temporal models. Generalisation across body types, age groups, and ethnicities is a recognised fairness concern, with several datasets and audits flagging performance gaps for under-represented demographics.
Applications
Pose estimation powers fitness coaching applications, physical therapy and rehabilitation tools, sports performance analytics, motion capture for animation and gaming, augmented reality avatars and try-on, surveillance and crowd analytics, ergonomic assessment in workplaces, gesture-based interfaces, sign language recognition, and human-robot interaction. Industrial uses include pose-based safety monitoring in manufacturing and construction.
References
- Cao, Z. et al. (2017). Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. CVPR.
- Sun, K. et al. (2019). Deep High-Resolution Representation Learning for Human Pose Estimation. CVPR.
- Lin, T.-Y. et al. (2014). Microsoft COCO: Common Objects in Context. ECCV.
- Google. (2023). MediaPipe Pose Landmarker Guide. developers.google.com/mediapipe.