Emma

Summary:

  • PagedAttention is a new attention algorithm inspired by virtual memory and paging techniques in operating systems, designed to help with the high throughput serving of large language models (LLMs).
  • The creators also built an LLM serving system called vLLM that minimizes waste in key-value cache memory and allows for flexible sharing of this memory within and across requests, which ultimately improves the throughput of popular LLMs by 2-4 times and reduces memory usage.

Tags:

Research