AIWiki
Malaysia

Groq

Groq is an American AI inference company that developed the Language Processing Unit (LPU), a custom silicon architecture optimised for high-throughput, low-latency inference of large language models using on-chip SRAM rather than external DRAM.

5 min readLast updated June 2026Companies & Tools

Groq is an American artificial intelligence company founded in 2016 that designs and manufactures the Language Processing Unit (LPU), a custom silicon chip architecture built specifically for performing inference on large language models at high speed and low latency. Groq also operates GroqCloud, a public inference API that allows developers to run open-weight LLMs — including Meta's Llama series and Mistral models — at speeds substantially faster than those achievable on conventional GPU hardware.

The Language Processing Unit

The LPU is the core technology distinguishing Groq from conventional AI hardware providers. Where graphics processing units (GPUs) are designed for parallel floating-point computation across thousands of cores and rely on high-bandwidth DRAM for memory, the LPU architecture uses on-chip SRAM for its entire working set, eliminating the memory bandwidth bottleneck that constrains GPU inference throughput.

Traditional GPU-based inference suffers from a fundamental memory-wall problem: the time spent loading model weights from DRAM into compute units often exceeds the actual computation time, meaning utilisation of the compute silicon is low. Groq's LPU addresses this by sizing the on-chip SRAM to hold the entire model weight matrix for the models it targets, allowing the silicon to perform matrix multiplications against weights that are already resident on chip rather than streaming them from external memory.

Additionally, the LPU employs a deterministic execution model rather than the dynamic scheduling used by GPUs. Because the execution schedule of every operation is fixed at compile time, there is no runtime overhead from task scheduling, and the chip can produce tokens at a predictable, consistent rate. This determinism is particularly valuable for latency-sensitive applications such as voice interfaces, real-time coding assistants, and interactive agents.

Performance Characteristics

Groq has publicly demonstrated token generation rates exceeding 800 tokens per second for smaller LLMs and competitive rates for 70-billion-parameter-class models — figures that compare favourably to GPU-based serving on high-end NVIDIA hardware. The low latency of LPU inference — often returning the first token in under 100 milliseconds — is practically significant for user-facing applications where perceived responsiveness matters.

On an energy-efficiency basis, Groq claims the LPU can perform inference at up to ten times the energy efficiency of equivalent GPU deployments, which has implications for total cost of ownership in large-scale deployment scenarios where power consumption is a significant operating cost.

GroqCloud

GroqCloud is Groq's public inference API, providing access to open-weight models including Llama 3, Mixtral, Gemma, and Whisper. The service uses a REST API compatible with the OpenAI API format, making it straightforward for developers to switch from other providers with minimal code changes. GroqCloud offers a free tier with rate limits and paid tiers for higher throughput, serving a developer community that spans individual researchers, startups, and enterprise teams evaluating LLM latency requirements.

Industry Developments

The commercial and technical significance of Groq's LPU architecture was underscored in 2026 when NVIDIA reached a licensing agreement for the technology. Groq continues to operate as an independent inference cloud. The architecture may also appear in future NVIDIA hardware products under the licence terms, representing a notable validation of Groq's approach to AI inference silicon.

Comparison with GPU-Based Inference

The GPU remains the dominant hardware platform for AI training and many inference workloads because of its programming flexibility, software ecosystem maturity (particularly CUDA), and support for a wide range of model architectures and sizes. The LPU's advantages are most pronounced for auto-regressive text generation with models whose weight matrices fit within on-chip SRAM — a category that covers most current LLMs in the 7B to 70B parameter range. For very large models that do not fit on-chip, or for training workloads that require frequent gradient computations, GPU clusters remain the preferred infrastructure.

References

  1. Groq. (2024). The Groq LPU Explained. groq.com.
  2. Groq. (2024). Inside the LPU: Deconstructing Groq's Speed. groq.com.
  3. Voiceflow. (2026). Groq AI in 2026: Nvidia Deal, LPU Architecture, GroqCloud, and What It Means for Builders. voiceflow.com.
  4. Introl. (2025). Groq LPU Infrastructure: Ultra-Low Latency AI Inference. introl.com.
  5. NVIDIA. (2026). NVIDIA Groq 3 LPX: Inference Accelerator for Agentic AI. NVIDIA Technical Blog.