Tag

#inference

6 insights

engineering · arxiv/cs.LG · 4 min

Kernel-Level LLM Safety via Logit Inspection

ProbeLogits reads token probabilities before generation to enforce safety policies at the OS level, achieving parity with learned classifiers at 2.5x speed.

Apr 21, 2026 Read →
ai · arxiv/cs.AI · 6 min

OjaKV: Online Low-Rank Compression for LLM Key-Value Caches

A hybrid storage and adaptive subspace method reduces KV cache memory by compressing intermediate tokens while preserving critical anchors, compatible with FlashAttention.

Apr 20, 2026 Read →
engineering · hackernoon · 7 min

LLMesh routes local LLM requests across machines via one endpoint

A distributed inference broker lets teams share GPU hardware without changing application code between dev, staging, and production.

Apr 18, 2026 Read →
ai · arxiv/cs.AI · 8 min

Small Models Match Large Ones via Inference Scaffolding

McClendon et al. show that role-based prompt structuring at inference time doubles small-model performance on complex tasks without retraining.

Apr 17, 2026 Read →
ai · arxiv/cs.LG · 3 min

Framework uses AI outputs as features, not proxies, for labeled data

Generative Augmented Inference treats LLM predictions as informative signals rather than direct substitutes, reducing human labeling needs by 75–90% across operations tasks.

Apr 17, 2026 Read →
ai · arxiv/cs.LG · 8 min

Quantum kernel inference cuts query cost by removing data-size dependence

New algorithm reduces quantum machine learning inference complexity from O(N) to O(1) in data size, achieving query-optimal bounds via amplitude estimation.

Apr 17, 2026 Read →

#inference

Kernel-Level LLM Safety via Logit Inspection

OjaKV: Online Low-Rank Compression for LLM Key-Value Caches

LLMesh routes local LLM requests across machines via one endpoint

Small Models Match Large Ones via Inference Scaffolding

Framework uses AI outputs as features, not proxies, for labeled data

Quantum kernel inference cuts query cost by removing data-size dependence