Search Results
4 results for “optimization”
Direct Preference Optimization
Direct Preference Optimization (DPO) is a stable, computationally efficient algorithm for aligning large language models with human preferences by directly optimising a policy from comparison data, without training a separate reward model or using reinforcement learning.
Flash Attention
FlashAttention is an IO-aware exact attention algorithm that restructures the standard attention computation into memory-efficient tiled blocks, dramatically reducing GPU memory usage and wall-clock time for transformer models on long sequences.
KV Cache
A KV cache (key-value cache) is a memory optimisation used in transformer inference that stores pre-computed key and value tensors from the attention mechanism, eliminating redundant recomputation when generating tokens sequentially.
Speculative Decoding
Speculative decoding is an inference acceleration technique that uses a small draft model to propose multiple candidate tokens that a larger target model then verifies in parallel, achieving 2-4x throughput gains without changing output quality.