1 result for “memory-efficiency”
FlashAttention is an IO-aware exact attention algorithm that restructures the standard attention computation into memory-efficient tiled blocks, dramatically reducing GPU memory usage and wall-clock time for transformer models on long sequences.