Why does the KV cache grow so large in long-context LLM inference?

During autoregressive generation, the model must store the key and value vectors for every token in the context to compute attention. For a 32K-token prompt, this storage scales linearly with context length and batch size. For Llama-3.1-8B with batch size 4, the KV cache alone requires ~16GB, exceeding the model's parameter memory. This memory bottleneck limits deployment on resource-constrained devices.

How does OjaKV decide which tokens to compress and which to keep full-rank?

OjaKV uses a hybrid strategy: it preserves the first token and the most recent tokens in full-rank to serve as high-fidelity anchors for attention computation. All intermediate tokens are compressed using low-rank projection. This selective approach recognizes that not all tokens contribute equally to attention; anchors at the boundaries are critical for maintaining accuracy while intermediate tokens can be safely compressed.

What is online subspace adaptation and why does it matter for KV cache compression?

Online subspace adaptation updates the low-rank projection basis incrementally as new tokens arrive, rather than using a static basis learned offline. OjaKV uses Oja's algorithm for online PCA to adapt the subspace during prompt prefilling and periodically during decoding. This matters because input distributions shift across requests; a static basis degrades under distribution shift, while online adaptation keeps the compression subspace aligned with the evolving context, maintaining accuracy at high compression ratios.

ai · 6 min read · Apr 20, 2026

OjaKV: Online Low-Rank Compression for LLM Key-Value Caches

A hybrid storage and adaptive subspace method reduces KV cache memory by compressing intermediate tokens while preserving critical anchors, compatible with FlashAttention.

Source: arxiv/cs.AI · Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen · open original ↗

OjaKV compresses LLM key-value caches by selectively preserving first and recent tokens while adaptively compressing intermediate tokens using online PCA.

— KV cache memory often exceeds model weights; Llama-3.1-8B at 32K tokens needs ~16GB.
— Static offline-learned compression subspaces degrade when input distribution shifts.
— OjaKV keeps first and most recent tokens uncompressed as high-fidelity attention anchors.
— Intermediate tokens undergo low-rank projection with incrementally updated basis via Oja's algorithm.
— Subspace adapts comprehensively during prompt prefilling, lightly during decoding.
— Framework integrates with FlashAttention without model retraining.
— Maintains or improves zero-shot accuracy at high compression ratios on long-context tasks.

Astrobobo tool mapping

Focus Brief Summarize OjaKV's three core ideas (hybrid storage, online PCA, FlashAttention compatibility) in a one-page brief for your inference team to evaluate fit against your current KV cache bottleneck.
Knowledge Capture Log the specific memory requirement (16GB for Llama-3.1-8B at 32K tokens, batch 4) as a concrete benchmark. Capture the Oja's algorithm update schedule (comprehensive during prefill, periodic during decode) for later reference.
Reading Queue Queue the full arxiv paper and any follow-up work on online subspace adaptation for LLMs. Flag sections on hyperparameter sensitivity and latency overhead for deeper review.

Frequently asked

During autoregressive generation, the model must store the key and value vectors for every token in the context to compute attention. For a 32K-token prompt, this storage scales linearly with context length and batch size. For Llama-3.1-8B with batch size 4, the KV cache alone requires ~16GB, exceeding the model's parameter memory. This memory bottleneck limits deployment on resource-constrained devices.

Share X LinkedIn

cite ▸

APA

Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen. (2026, April 20). OjaKV: Online Low-Rank Compression for LLM Key-Value Caches. Astrobobo Content Engine (rewrite of arxiv/cs.AI). https://astrobobo-content-engine.vercel.app/article/ojakv-online-low-rank-compression-for-llm-key-value-caches-63fcfd

MLA

Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen. "OjaKV: Online Low-Rank Compression for LLM Key-Value Caches." Astrobobo Content Engine, 20 Apr 2026, https://astrobobo-content-engine.vercel.app/article/ojakv-online-low-rank-compression-for-llm-key-value-caches-63fcfd. Based on "arxiv/cs.AI", https://arxiv.org/abs/2509.21623.

BibTeX

@misc{astrobobo_ojakv-online-low-rank-compression-for-llm-key-value-caches-63fcfd_2026,
  author       = {Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen},
  title        = {OjaKV: Online Low-Rank Compression for LLM Key-Value Caches},
  year         = {2026},
  url          = {https://astrobobo-content-engine.vercel.app/article/ojakv-online-low-rank-compression-for-llm-key-value-caches-63fcfd},
  note         = {Astrobobo rewrite of arxiv/cs.AI, https://arxiv.org/abs/2509.21623},
}

#llm #memory #compression #inference #kvcache #lowrank

OjaKV: Online Low-Rank Compression for LLM Key-Value Caches

Astrobobo tool mapping

Frequently asked

Related insights

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs