OjaKV: Online Low-Rank Compression for LLM Key-Value Caches
A hybrid storage and adaptive subspace method reduces KV cache memory by compressing intermediate tokens while preserving critical anchors, compatible with FlashAttention.
OjaKV compresses LLM key-value caches by selectively preserving first and recent tokens while adaptively compressing intermediate tokens using online PCA.
- — KV cache memory often exceeds model weights; Llama-3.1-8B at 32K tokens needs ~16GB.
- — Static offline-learned compression subspaces degrade when input distribution shifts.
- — OjaKV keeps first and most recent tokens uncompressed as high-fidelity attention anchors.
- — Intermediate tokens undergo low-rank projection with incrementally updated basis via Oja's algorithm.
- — Subspace adapts comprehensively during prompt prefilling, lightly during decoding.
- — Framework integrates with FlashAttention without model retraining.
- — Maintains or improves zero-shot accuracy at high compression ratios on long-context tasks.
Astrobobo tool mapping
- Focus Brief Summarize OjaKV's three core ideas (hybrid storage, online PCA, FlashAttention compatibility) in a one-page brief for your inference team to evaluate fit against your current KV cache bottleneck.
- Knowledge Capture Log the specific memory requirement (16GB for Llama-3.1-8B at 32K tokens, batch 4) as a concrete benchmark. Capture the Oja's algorithm update schedule (comprehensive during prefill, periodic during decode) for later reference.
- Reading Queue Queue the full arxiv paper and any follow-up work on online subspace adaptation for LLMs. Flag sections on hyperparameter sensitivity and latency overhead for deeper review.
Frequently asked
- During autoregressive generation, the model must store the key and value vectors for every token in the context to compute attention. For a 32K-token prompt, this storage scales linearly with context length and batch size. For Llama-3.1-8B with batch size 4, the KV cache alone requires ~16GB, exceeding the model's parameter memory. This memory bottleneck limits deployment on resource-constrained devices.
cite ▸
Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen. (2026, April 20). OjaKV: Online Low-Rank Compression for LLM Key-Value Caches. Astrobobo Content Engine (rewrite of arxiv/cs.AI). https://astrobobo-content-engine.vercel.app/article/ojakv-online-low-rank-compression-for-llm-key-value-caches-63fcfd
Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen. "OjaKV: Online Low-Rank Compression for LLM Key-Value Caches." Astrobobo Content Engine, 20 Apr 2026, https://astrobobo-content-engine.vercel.app/article/ojakv-online-low-rank-compression-for-llm-key-value-caches-63fcfd. Based on "arxiv/cs.AI", https://arxiv.org/abs/2509.21623.
@misc{astrobobo_ojakv-online-low-rank-compression-for-llm-key-value-caches-63fcfd_2026,
author = {Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen},
title = {OjaKV: Online Low-Rank Compression for LLM Key-Value Caches},
year = {2026},
url = {https://astrobobo-content-engine.vercel.app/article/ojakv-online-low-rank-compression-for-llm-key-value-caches-63fcfd},
note = {Astrobobo rewrite of arxiv/cs.AI, https://arxiv.org/abs/2509.21623},
}