Kernel-Level LLM Safety via Logit Inspection
ProbeLogits reads token probabilities before generation to enforce safety policies at the OS level, achieving parity with learned classifiers at 2.5x speed.
A kernel primitive that inspects LLM logits before token generation to classify and block unsafe outputs without learned parameters.
- — ProbeLogits performs one forward pass and reads specific token logits to detect unsafe agent actions.
- — Achieves 97–99% block rate on HarmBench and F1=0.812 on ToxicChat, matching or exceeding Llama Guard 3.
- — Runs 2.5x faster than token-generation classifiers; bare-metal latency is 65 ms.
- — Uses calibration strength alpha as a deployment-time policy knob instead of learned weights.
- — Implemented in Anima OS (86k lines Rust); operates below WASM sandbox, harder to circumvent.
- — Contextual calibration corrects verbalizer bias asymmetry across model and prompt pairs.
- — Tested on Qwen 2.5-7B, Llama 3 8B, and Mistral 7B with three external benchmarks.
Astrobobo tool mapping
- Knowledge Capture Record the three verbalizer types (Safe/Dangerous, Yes/No, etc.) and their F1 scores on your domain. Note which model-verbalizer pair matches your safety bar.
- Focus Brief Summarize the alpha calibration concept: a single scalar knob that tunes safety strictness at deployment time without retraining. Clarify how your team would set this in production.
- Reading Queue Queue the Anima OS architecture paper and Llama Guard 3 technical report to understand the kernel-level enforcement model and the baseline classifier you are comparing against.
Frequently asked
- ProbeLogits reads a single logit value before token generation, while Llama Guard 3 generates a full classification token sequence. ProbeLogits is 2.5x faster and requires no learned parameters, only a calibration scalar (alpha). On ToxicChat, ProbeLogits achieves F1=0.812 versus Llama Guard 3's baseline, with some model-verbalizer pairs exceeding it by 4.4 percentage points.
cite ▸
APA
Daeyeon Son. (2026, April 21). Kernel-Level LLM Safety via Logit Inspection. Astrobobo Content Engine (rewrite of arxiv/cs.LG). https://astrobobo-content-engine.vercel.app/article/kernel-level-llm-safety-via-logit-inspection-cdcdf8
MLA
Daeyeon Son. "Kernel-Level LLM Safety via Logit Inspection." Astrobobo Content Engine, 21 Apr 2026, https://astrobobo-content-engine.vercel.app/article/kernel-level-llm-safety-via-logit-inspection-cdcdf8. Based on "arxiv/cs.LG", https://arxiv.org/abs/2604.11943.
BibTeX
@misc{astrobobo_kernel-level-llm-safety-via-logit-inspection-cdcdf8_2026,
author = {Daeyeon Son},
title = {Kernel-Level LLM Safety via Logit Inspection},
year = {2026},
url = {https://astrobobo-content-engine.vercel.app/article/kernel-level-llm-safety-via-logit-inspection-cdcdf8},
note = {Astrobobo rewrite of arxiv/cs.LG, https://arxiv.org/abs/2604.11943},
}