1 result for “inference optimization”
An attention variant that partitions query heads into groups sharing a single set of keys and values, reducing memory bandwidth during inference while retaining most of the quality of full multi-head attention.