TensorFlow Lite
TensorFlow Lite is an open-source deep learning framework from Google for running optimised machine learning models on mobile phones, microcontrollers, and other edge devices.
TensorFlow Lite is the on-device runtime and toolchain in the TensorFlow ecosystem, designed to execute machine learning models efficiently on resource-constrained hardware such as Android and iOS smartphones, single-board computers, microcontrollers, and other embedded systems. It was first released by Google in 2017 as a successor to TensorFlow Mobile and was rebranded LiteRT in 2024 to reflect its expansion to accept models from PyTorch, JAX, and Keras in addition to TensorFlow.
Purpose and design goals
TensorFlow Lite is built for three operational requirements that distinguish edge inference from cloud serving: low latency, small binary footprint, and offline operation. The runtime is a fraction of the size of full TensorFlow and is compiled as a static library that ships inside mobile apps. By executing models locally, applications avoid the round-trip cost of cloud inference, continue to function without connectivity, and keep raw user data such as photos, voice, and biometrics on the device.
Workflow
The standard workflow has three stages. First, a model is trained in TensorFlow, Keras, JAX, or (via the AI Edge Torch converter) PyTorch. Second, the trained model is converted to the .tflite flatbuffer format, optionally applying optimisations such as post-training integer quantisation, weight pruning, or operator fusion. Third, the .tflite file is bundled with the application and loaded by the LiteRT interpreter, which dispatches operations to the most appropriate backend.
The interpreter supports a delegate mechanism that offloads computation to hardware accelerators: the GPU delegate uses OpenGL, OpenCL, or Metal; the NNAPI delegate routes through Android's Neural Networks API; the Core ML delegate targets Apple Neural Engine on iOS; and the Hexagon delegate uses Qualcomm DSPs. Custom delegates exist for Edge TPU, MediaTek APUs, and other vendor silicon.
Optimisation techniques
| Technique | Effect | |---|---| | Post-training quantisation | Convert float32 weights to int8 or float16; 2-4x size reduction, modest accuracy loss | | Quantisation-aware training | Train with simulated quantisation to preserve accuracy | | Pruning | Zero out small-magnitude weights; combine with sparse kernels for speedup | | Operator fusion | Merge adjacent ops (e.g., conv + bias + ReLU) at conversion time | | Selective build | Strip unused operators from the binary |
TensorFlow Lite for Microcontrollers
A subset of the runtime, TensorFlow Lite for Microcontrollers (TFLite Micro), targets devices with as little as a few kilobytes of RAM. It dispenses with dynamic memory allocation, supports a curated operator set, and underpins the TinyML movement, where keyword-spotting, gesture recognition, and anomaly detection run on Arm Cortex-M and ESP32 class chips drawing milliwatts of power.
Use cases
Common deployments include real-time computer vision in mobile camera apps, on-device speech recognition and keyword spotting, gesture and pose detection in fitness apps, OCR and document scanning, on-device translation, predictive text and smart reply, and physical-world inference in wearables, drones, and industrial sensors.
See Also
References
References
- Google. LiteRT (formerly TensorFlow Lite) Documentation.
- David, R. et al. (2021). TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems. MLSys.
- Google Developers Blog. (2024). AI Edge Torch: High Performance Inference of PyTorch Models on Mobile Devices.
- Jacob, B. et al. (2018). Quantization and training of neural networks for efficient integer-arithmetic-only inference. CVPR.
- Lee, J. et al. (2019). On-device neural net inference with mobile GPUs. arXiv:1907.01989.