Computer Vision
Computer vision is the field of artificial intelligence that enables machines to interpret and act upon visual information from the world — including images, video, and depth data.
Computer vision is the interdisciplinary field enabling machines to extract meaning from visual data. It combines techniques from image processing, pattern recognition, and deep learning to perform tasks that require visual understanding — from reading handwritten digits to detecting tumours in X-rays.
Core Tasks
- Image classification — assigning a single label to an image (e.g., "cat", "dog")
- Object detection — identifying and localising multiple objects with bounding boxes (YOLO, Faster R-CNN)
- Semantic segmentation — classifying every pixel in an image
- Instance segmentation — segmenting each individual object instance
- Pose estimation — inferring the spatial arrangement of body joints
- Optical character recognition (OCR) — recognising text in images
- Face recognition — identifying individuals from facial geometry
- Depth estimation — inferring 3D structure from 2D images
Architectures
Convolutional Neural Networks (CNNs) dominated computer vision from 2012 (AlexNet's ImageNet win) through the early 2020s. Landmark CNNs: VGG, ResNet, EfficientNet, MobileNet.
Vision Transformers (ViT) apply the transformer's attention mechanism to image patches, achieving state-of-the-art results on most benchmarks and enabling better integration with text (multimodal models like CLIP, Flamingo, GPT-4V).
Applications
| Sector | Application | |--------|-------------| | Healthcare | Radiology AI, pathology slide analysis, diabetic retinopathy screening | | Autonomous vehicles | Road scene understanding, obstacle detection, lane tracking | | Manufacturing | Defect detection, quality control, assembly verification | | Security | Surveillance, access control, crowd analytics | | Agriculture | Crop disease detection, yield estimation, drone-based field monitoring | | Retail | Self-checkout, inventory management, customer analytics |
References
- Krizhevsky, A. et al. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." NeurIPS 2012.
- Dosovitskiy, A. et al. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." ICLR 2021.
- Aerodyne Group (2024). Annual Report 2024. Aerodyne Group Sdn Bhd.