LLMs use hidden confidence signals to detect and fix their own errors
Research shows large language models maintain a second-order evaluative signal that predicts error detection and self-correction beyond what their output probabilities reveal.
LLMs detect errors via internal confidence signals independent of output probabilities, enabling self-correction without external feedback.
- — Models cache a confidence representation at post-answer newline (PANL) that drives error detection.
- — PANL activations predict which errors the model can correct, outperforming verbal confidence signals.
- — Second-order confidence architecture mirrors decision neuroscience frameworks with independent evaluative signals.
- — Causal interventions show PANL signals rescue error detection when answer information is corrupted.
- — Findings replicate across Gemma 3 27B, Qwen 2.5 7B, and tasks like TriviaQA and MNLI.
- — Verbal confidence alone fails to predict correctable errors; internal signals encode fixability.
- — First-order models cannot explain error detection since confidence would always favor chosen response.
Astrobobo tool mapping
- Knowledge Capture Document the PANL signal concept and second-order confidence framework as a reference model for evaluating LLM reliability in your domain.
- Focus Brief Summarize the causal intervention results (PANL rescues error detection when answers are corrupted) as a design principle for robust prompting strategies.
- Reading Queue Queue the full arxiv paper and Kumaran et al. (2026) for deeper study of activation patterns and mechanistic interpretability techniques.
Frequently asked
- LLMs maintain an internal confidence signal at the post-answer newline (PANL) that operates independently of output probabilities. This second-order evaluative signal can disagree with the model's chosen response, allowing it to recognize when an answer is likely wrong. The signal encodes not only error likelihood but whether the model has the knowledge to fix it, enabling self-correction without human input.
cite ▸
Dharshan Kumaran, Viorica Patraucean, Simon Osindero, Petar Velickovic, Nathaniel Daw. (2026, April 27). LLMs use hidden confidence signals to detect and fix their own errors. Astrobobo Content Engine (rewrite of arxiv/cs.LG). https://astrobobo-content-engine.vercel.app/article/llms-use-hidden-confidence-signals-to-detect-and-fix-their-own-errors-0bf32d
Dharshan Kumaran, Viorica Patraucean, Simon Osindero, Petar Velickovic, Nathaniel Daw. "LLMs use hidden confidence signals to detect and fix their own errors." Astrobobo Content Engine, 27 Apr 2026, https://astrobobo-content-engine.vercel.app/article/llms-use-hidden-confidence-signals-to-detect-and-fix-their-own-errors-0bf32d. Based on "arxiv/cs.LG", https://arxiv.org/abs/2604.22271.
@misc{astrobobo_llms-use-hidden-confidence-signals-to-detect-and-fix-their-own-errors-0bf32d_2026,
author = {Dharshan Kumaran, Viorica Patraucean, Simon Osindero, Petar Velickovic, Nathaniel Daw},
title = {LLMs use hidden confidence signals to detect and fix their own errors},
year = {2026},
url = {https://astrobobo-content-engine.vercel.app/article/llms-use-hidden-confidence-signals-to-detect-and-fix-their-own-errors-0bf32d},
note = {Astrobobo rewrite of arxiv/cs.LG, https://arxiv.org/abs/2604.22271},
}