engineering · 4 min read · Apr 21, 2026

Kernel-Level LLM Safety via Logit Inspection

ProbeLogits reads token probabilities before generation to enforce safety policies at the OS level, achieving parity with learned classifiers at 2.5x speed.

Source: arxiv/cs.LG · Daeyeon Son · open original ↗

A kernel primitive that inspects LLM logits before token generation to classify and block unsafe outputs without learned parameters.

  • ProbeLogits performs one forward pass and reads specific token logits to detect unsafe agent actions.
  • Achieves 97–99% block rate on HarmBench and F1=0.812 on ToxicChat, matching or exceeding Llama Guard 3.
  • Runs 2.5x faster than token-generation classifiers; bare-metal latency is 65 ms.
  • Uses calibration strength alpha as a deployment-time policy knob instead of learned weights.
  • Implemented in Anima OS (86k lines Rust); operates below WASM sandbox, harder to circumvent.
  • Contextual calibration corrects verbalizer bias asymmetry across model and prompt pairs.
  • Tested on Qwen 2.5-7B, Llama 3 8B, and Mistral 7B with three external benchmarks.

Astrobobo tool mapping

  • Knowledge Capture Record the three verbalizer types (Safe/Dangerous, Yes/No, etc.) and their F1 scores on your domain. Note which model-verbalizer pair matches your safety bar.
  • Focus Brief Summarize the alpha calibration concept: a single scalar knob that tunes safety strictness at deployment time without retraining. Clarify how your team would set this in production.
  • Reading Queue Queue the Anima OS architecture paper and Llama Guard 3 technical report to understand the kernel-level enforcement model and the baseline classifier you are comparing against.

Frequently asked

  • ProbeLogits reads a single logit value before token generation, while Llama Guard 3 generates a full classification token sequence. ProbeLogits is 2.5x faster and requires no learned parameters, only a calibration scalar (alpha). On ToxicChat, ProbeLogits achieves F1=0.812 versus Llama Guard 3's baseline, with some model-verbalizer pairs exceeding it by 4.4 percentage points.
Share X LinkedIn
cite
APA
Daeyeon Son. (2026, April 21). Kernel-Level LLM Safety via Logit Inspection. Astrobobo Content Engine (rewrite of arxiv/cs.LG). https://astrobobo-content-engine.vercel.app/article/kernel-level-llm-safety-via-logit-inspection-cdcdf8
MLA
Daeyeon Son. "Kernel-Level LLM Safety via Logit Inspection." Astrobobo Content Engine, 21 Apr 2026, https://astrobobo-content-engine.vercel.app/article/kernel-level-llm-safety-via-logit-inspection-cdcdf8. Based on "arxiv/cs.LG", https://arxiv.org/abs/2604.11943.
BibTeX
@misc{astrobobo_kernel-level-llm-safety-via-logit-inspection-cdcdf8_2026,
  author       = {Daeyeon Son},
  title        = {Kernel-Level LLM Safety via Logit Inspection},
  year         = {2026},
  url          = {https://astrobobo-content-engine.vercel.app/article/kernel-level-llm-safety-via-logit-inspection-cdcdf8},
  note         = {Astrobobo rewrite of arxiv/cs.LG, https://arxiv.org/abs/2604.11943},
}

Related insights