Optical Character Recognition
A computer vision technology that converts images of typed, handwritten, or printed text into machine-readable digital text, increasingly powered by deep learning and transformer-based vision models.
Optical Character Recognition, abbreviated OCR, is the technology that converts images containing text — whether scanned documents, photographed receipts, screenshots, or PDFs — into machine-readable digital text. OCR has been an active area of research since the 1950s and has progressed through three distinct technological generations: rule-based pattern matching, statistical machine learning, and modern deep learning. Contemporary OCR systems, built on convolutional neural networks and vision transformers, regularly achieve accuracy rates above 96% on a broad mix of document types.
How modern OCR works
A complete OCR pipeline performs several steps. The system first preprocesses the input image to correct skew, deskew lines of text, denoise, and binarise where appropriate. It then performs layout analysis to identify regions of interest such as paragraphs, tables, figures, and form fields. Text detection localises individual lines or words, and text recognition transcribes the localised regions into a sequence of characters. A post-processing stage applies language models, dictionaries, and structured-output decoding to correct errors and produce semantically meaningful output.
Modern systems use deep learning at every stage. The detection stage typically uses architectures derived from DBNet, EAST, or DETR. The recognition stage relies on CRNN, TrOCR (a transformer-based OCR model), or end-to-end multimodal vision-language models that handle detection and recognition jointly. Vision transformers and document-understanding models such as LayoutLM, Donut, Pix2Struct, and Nougat have largely replaced the older two-stage pipelines for many enterprise workloads.
Capabilities and limitations
A modern OCR system can handle printed text in dozens of scripts, cursive handwriting with reduced accuracy, structured forms, multi-column layouts, mixed-language documents, low-resolution photographs, and scene text from natural images. The technology remains imperfect on heavily degraded documents, unusual fonts, complex mathematical notation, and tightly handwritten free-form text. Tables with merged cells, footnotes, and forms with overlapping fields continue to challenge even leading systems.
Document AI and structured extraction
OCR is often a component of a broader document-AI pipeline rather than the end product. Enterprise systems increasingly combine OCR with named entity recognition, key-value extraction, and large language model post-processing to convert documents into structured records — invoices into rows in an accounting system, identity documents into customer profiles, contracts into negotiable clauses.
| Provider | Notable OCR offering | |----------|---------------------| | Google | Document AI, Cloud Vision OCR | | AWS | Textract | | Microsoft | Azure AI Document Intelligence | | ABBYY | FineReader, Vantage | | Open-source | Tesseract, PaddleOCR, TrOCR |
In December 2025, Mistral AI released Mistral OCR 3, a smaller open-weight OCR model designed for structured document understanding at scale. The OCR market is projected to exceed forty-three billion US dollars by 2032.
Open-source landscape
Tesseract, originally developed at HP Labs and now maintained by Google, remains the most widely used open-source OCR engine for printed text. PaddleOCR, developed by Baidu, has overtaken Tesseract in many production deployments thanks to stronger Asian-language support and better handwriting handling. EasyOCR, docTR, and Surya provide lighter-weight Python-native alternatives. Transformer-based models such as TrOCR and Donut are increasingly preferred for difficult or structured documents.
See Also
References
References
- Smith, R. (2007). An Overview of the Tesseract OCR Engine. ICDAR 2007.
- Li, M., et al. (2021). TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models. AAAI 2023.
- Mistral AI. (2025). Mistral OCR 3 Model Card. Mistral AI.
- Inland Revenue Board of Malaysia. (2024). MyInvois e-Invoicing Implementation Guidelines. LHDN.
- Bank Negara Malaysia. (2023). Electronic Know-Your-Customer (e-KYC) Policy Document. BNM.