ai · 6 min read · Apr 20, 2026

OjaKV: Online Low-Rank Compression for LLM Key-Value Caches

A hybrid storage and adaptive subspace method reduces KV cache memory by compressing intermediate tokens while preserving critical anchors, compatible with FlashAttention.

Source: arxiv/cs.AI · Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen · open original ↗

OjaKV compresses LLM key-value caches by selectively preserving first and recent tokens while adaptively compressing intermediate tokens using online PCA.

  • KV cache memory often exceeds model weights; Llama-3.1-8B at 32K tokens needs ~16GB.
  • Static offline-learned compression subspaces degrade when input distribution shifts.
  • OjaKV keeps first and most recent tokens uncompressed as high-fidelity attention anchors.
  • Intermediate tokens undergo low-rank projection with incrementally updated basis via Oja's algorithm.
  • Subspace adapts comprehensively during prompt prefilling, lightly during decoding.
  • Framework integrates with FlashAttention without model retraining.
  • Maintains or improves zero-shot accuracy at high compression ratios on long-context tasks.

Astrobobo tool mapping

  • Focus Brief Summarize OjaKV's three core ideas (hybrid storage, online PCA, FlashAttention compatibility) in a one-page brief for your inference team to evaluate fit against your current KV cache bottleneck.
  • Knowledge Capture Log the specific memory requirement (16GB for Llama-3.1-8B at 32K tokens, batch 4) as a concrete benchmark. Capture the Oja's algorithm update schedule (comprehensive during prefill, periodic during decode) for later reference.
  • Reading Queue Queue the full arxiv paper and any follow-up work on online subspace adaptation for LLMs. Flag sections on hyperparameter sensitivity and latency overhead for deeper review.

Frequently asked

  • During autoregressive generation, the model must store the key and value vectors for every token in the context to compute attention. For a 32K-token prompt, this storage scales linearly with context length and batch size. For Llama-3.1-8B with batch size 4, the KV cache alone requires ~16GB, exceeding the model's parameter memory. This memory bottleneck limits deployment on resource-constrained devices.
Share X LinkedIn
cite
APA
Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen. (2026, April 20). OjaKV: Online Low-Rank Compression for LLM Key-Value Caches. Astrobobo Content Engine (rewrite of arxiv/cs.AI). https://astrobobo-content-engine.vercel.app/article/ojakv-online-low-rank-compression-for-llm-key-value-caches-63fcfd
MLA
Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen. "OjaKV: Online Low-Rank Compression for LLM Key-Value Caches." Astrobobo Content Engine, 20 Apr 2026, https://astrobobo-content-engine.vercel.app/article/ojakv-online-low-rank-compression-for-llm-key-value-caches-63fcfd. Based on "arxiv/cs.AI", https://arxiv.org/abs/2509.21623.
BibTeX
@misc{astrobobo_ojakv-online-low-rank-compression-for-llm-key-value-caches-63fcfd_2026,
  author       = {Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen},
  title        = {OjaKV: Online Low-Rank Compression for LLM Key-Value Caches},
  year         = {2026},
  url          = {https://astrobobo-content-engine.vercel.app/article/ojakv-online-low-rank-compression-for-llm-key-value-caches-63fcfd},
  note         = {Astrobobo rewrite of arxiv/cs.AI, https://arxiv.org/abs/2509.21623},
}

Related insights