Search Results
11 results for “memory”
AI Memory
AI memory refers to the mechanisms that allow artificial intelligence agents to retain, retrieve, and use information across interactions, extending capability beyond a single context window.
Autonomous Agents
Autonomous AI agents are software systems that use large language models as a reasoning core, enabling them to plan multi-step tasks, use external tools, maintain memory, and take actions to achieve goals with minimal human intervention.
Flash Attention
FlashAttention is an IO-aware exact attention algorithm that restructures the standard attention computation into memory-efficient tiled blocks, dramatically reducing GPU memory usage and wall-clock time for transformer models on long sequences.
KV Cache
A KV cache (key-value cache) is a memory optimisation used in transformer inference that stores pre-computed key and value tensors from the attention mechanism, eliminating redundant recomputation when generating tokens sequentially.
LangChain
LangChain is an open-source framework for building applications powered by large language models, providing composable abstractions for chaining LLM calls with tools, memory, and data retrieval in Python and JavaScript.
Long Short-Term Memory (LSTM)
Long Short-Term Memory is a recurrent neural network architecture designed to learn long-range dependencies in sequential data by using gating mechanisms to control information flow.
Model Compression
Model compression is a set of techniques that reduce the size, memory footprint, and computational cost of machine learning models while preserving predictive accuracy, enabling deployment on resource-constrained hardware.
Model Pruning
A model compression technique that removes redundant or low-importance parameters from a neural network to reduce size, memory footprint, and inference latency while preserving accuracy.
Parameter-Efficient Fine-Tuning
A family of techniques that adapts a pretrained language or vision model to a downstream task by training only a small fraction of its parameters, dramatically reducing compute, memory, and storage requirements compared to full fine-tuning.
Quantisation
Quantisation is a model compression technique that reduces the numerical precision of a neural network's weights and activations from high-bit floating-point formats to lower-bit representations, decreasing memory usage and accelerating inference with minimal accuracy loss.
TinyML
TinyML is a field of machine learning focused on running machine learning models on microcontrollers and other resource-constrained edge devices that typically operate with milliwatts of power and kilobytes of memory.