SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation

J Wu, Z Wang, L Zhang, Y Lai, Y He, D Zhou - arxiv preprint arxiv …, 2024 - arxiv.org
Key-Value (KV) cache has become a bottleneck of LLMs for long-context generation.
Despite the numerous efforts in this area, the optimization for the decoding phase is …

[PDF][PDF] Make Every Token Count: A Systematic Survey on Decoding Methods for Foundation Models

H Wang, K Shu - researchgate.net
Foundation models, such as large language models (LLMs) and large vision-language
models (LVLMs), have gained significant attention for their remarkable performance across …