Search Results
6 results for “latency”
Edge AI
Edge AI is the deployment of artificial intelligence algorithms and inference workloads directly on local devices or edge computing nodes rather than in centralised cloud data centres, enabling low-latency, privacy-preserving, and bandwidth-efficient AI applications.
Inference (Machine Learning)
Inference is the phase in which a trained machine learning model is used to generate predictions or outputs from new input data, distinct from the earlier training phase.
Model Pruning
A model compression technique that removes redundant or low-importance parameters from a neural network to reduce size, memory footprint, and inference latency while preserving accuracy.
Model Serving
Model serving is the discipline of deploying trained machine learning models behind APIs or runtimes so that production applications can request predictions at scale with predictable latency, throughput, and reliability.
Neural Architecture Search
Neural architecture search is the automated design of neural network architectures using search algorithms, reinforcement learning, or gradient-based methods to discover models that meet target accuracy, latency, and size constraints.
Pinecone
Pinecone is a managed, cloud-native vector database designed for storing high-dimensional embeddings and serving low-latency similarity search for retrieval-augmented AI applications.