PagedAttention is a new attention algorithm inspired by virtual memory and paging techniques in operating systems, designed to help with the high throughput serving of large language models (LLMs).
The creators also built an LLM serving system called vLLM that minimizes waste in key-value cache memory and allows for flexible sharing of this memory within and across requests, which ultimately improves the throughput of popular LLMs by 2-4 times and reduces memory usage.