AIWiki
Malaysia

Search Results

11 results for memory

Applications

AI Memory

AI memory refers to the mechanisms that allow artificial intelligence agents to retain, retrieve, and use information across interactions, extending capability beyond a single context window.

5 min readUpdated June 2026
Applications

Autonomous Agents

Autonomous AI agents are software systems that use large language models as a reasoning core, enabling them to plan multi-step tasks, use external tools, maintain memory, and take actions to achieve goals with minimal human intervention.

6 min readUpdated May 2026
Foundations

Flash Attention

FlashAttention is an IO-aware exact attention algorithm that restructures the standard attention computation into memory-efficient tiled blocks, dramatically reducing GPU memory usage and wall-clock time for transformer models on long sequences.

6 min readUpdated June 2026
Infrastructure

KV Cache

A KV cache (key-value cache) is a memory optimisation used in transformer inference that stores pre-computed key and value tensors from the attention mechanism, eliminating redundant recomputation when generating tokens sequentially.

6 min readUpdated June 2026
Infrastructure

LangChain

LangChain is an open-source framework for building applications powered by large language models, providing composable abstractions for chaining LLM calls with tools, memory, and data retrieval in Python and JavaScript.

6 min readUpdated May 2026
Foundations

Long Short-Term Memory (LSTM)

Long Short-Term Memory is a recurrent neural network architecture designed to learn long-range dependencies in sequential data by using gating mechanisms to control information flow.

5 min readUpdated May 2026
Infrastructure

Model Compression

Model compression is a set of techniques that reduce the size, memory footprint, and computational cost of machine learning models while preserving predictive accuracy, enabling deployment on resource-constrained hardware.

6 min readUpdated June 2026
Infrastructure

Model Pruning

A model compression technique that removes redundant or low-importance parameters from a neural network to reduce size, memory footprint, and inference latency while preserving accuracy.

6 min readUpdated June 2026
Infrastructure

Parameter-Efficient Fine-Tuning

A family of techniques that adapts a pretrained language or vision model to a downstream task by training only a small fraction of its parameters, dramatically reducing compute, memory, and storage requirements compared to full fine-tuning.

5 min readUpdated May 2026
Infrastructure

Quantisation

Quantisation is a model compression technique that reduces the numerical precision of a neural network's weights and activations from high-bit floating-point formats to lower-bit representations, decreasing memory usage and accelerating inference with minimal accuracy loss.

7 min readUpdated May 2026
Foundations

TinyML

TinyML is a field of machine learning focused on running machine learning models on microcontrollers and other resource-constrained edge devices that typically operate with milliwatts of power and kilobytes of memory.

6 min readUpdated May 2026