Tag
#inference
6 insights
- engineering · arxiv/cs.LG · 4 min
Kernel-Level LLM Safety via Logit Inspection
ProbeLogits reads token probabilities before generation to enforce safety policies at the OS level, achieving parity with learned classifiers at 2.5x speed.
Apr 21, 2026 Read → - ai · arxiv/cs.AI · 6 min
OjaKV: Online Low-Rank Compression for LLM Key-Value Caches
A hybrid storage and adaptive subspace method reduces KV cache memory by compressing intermediate tokens while preserving critical anchors, compatible with FlashAttention.
Apr 20, 2026 Read → - engineering · hackernoon · 7 min
LLMesh routes local LLM requests across machines via one endpoint
A distributed inference broker lets teams share GPU hardware without changing application code between dev, staging, and production.
Apr 18, 2026 Read → - ai · arxiv/cs.AI · 8 min
Small Models Match Large Ones via Inference Scaffolding
McClendon et al. show that role-based prompt structuring at inference time doubles small-model performance on complex tasks without retraining.
Apr 17, 2026 Read → - ai · arxiv/cs.LG · 3 min
Framework uses AI outputs as features, not proxies, for labeled data
Generative Augmented Inference treats LLM predictions as informative signals rather than direct substitutes, reducing human labeling needs by 75–90% across operations tasks.
Apr 17, 2026 Read → - ai · arxiv/cs.LG · 8 min
Quantum kernel inference cuts query cost by removing data-size dependence
New algorithm reduces quantum machine learning inference complexity from O(N) to O(1) in data size, achieving query-optimal bounds via amplitude estimation.
Apr 17, 2026 Read →