Why does output confidence mask internal decision signals?

Confidence (max-softmax) and activation norm absorb approximately 57.7% of the raw signal that probes can extract from mid-layer activations. This means a model can be confident in its output while the internal decision-making process—visible only in frozen activations—shows uncertainty or error. Controlling for these factors reveals hidden signal that confidence alone cannot expose.

Which transformer architectures preserve observability?

In Pythia's controlled suite, 24-layer 16-head configurations collapse to near-zero observability, while other tested configurations maintain healthy signal (rho_partial 0.21–0.38). Across different model families, observability varies: Qwen 2.5 and Llama differ by 2.9x at 3B scale. No single architecture is universally safe; observability must be measured per model family and configuration.

Can error-detection probes trained on one task work on others?

Yes, partially. Probes trained on WikiText-based observability transfer to downstream QA tasks without retraining and catch 10.9–13.4% of errors that confidence scores miss. However, transfer is not perfect across all model-task pairs, suggesting that observability is somewhat task-dependent and requires validation for each deployment context.

ai · 8 min read · Apr 29, 2026

Model Architecture Controls Whether Errors Stay Hidden

Transformer design determines if internal decision signals remain observable after training, independent of output confidence metrics.

Source: arxiv/cs.LG · Thomas Carmichael · open original ↗

Transformer architecture, not just training, determines whether mid-layer activations expose token-level decision quality hidden from confidence scores.

— Output confidence absorbs 57.7% of raw probe signal, masking true decision quality in frozen activations.
— 24-layer 16-head configurations collapse to near-zero observability across parameter scales; other configs maintain healthy signal.
— Observability collapse emerges during training despite improving loss, suggesting architectural constraints erase internal signals.
— Qwen 2.5 and Llama differ by 2.9x observability at matched 3B scale with non-overlapping probe distributions.
— Error-detection probes trained on WikiText catch 10.9–13.4% of errors confidence misses across downstream tasks.
— Nonlinear probes and layer sweeps fail to recover signal in collapsed configurations.
— Architecture selection functions as a monitoring decision with measurable consequences for error detection.

Astrobobo tool mapping

Knowledge Capture Document the observability scores (rho_partial) for each model architecture you evaluate. Store alongside accuracy, latency, and other selection criteria.
Focus Brief Before committing to a model family, create a one-page observability summary: which layer-head configs collapse, which preserve signal, and what that means for your error-detection strategy.
Reading Queue Add Carmichael's paper and related interpretability work to your queue. Observability is an emerging design criterion; staying current is now part of responsible model selection.

Frequently asked

Confidence (max-softmax) and activation norm absorb approximately 57.7% of the raw signal that probes can extract from mid-layer activations. This means a model can be confident in its output while the internal decision-making process—visible only in frozen activations—shows uncertainty or error. Controlling for these factors reveals hidden signal that confidence alone cannot expose.

Share X LinkedIn

cite ▸

APA

Thomas Carmichael. (2026, April 29). Model Architecture Controls Whether Errors Stay Hidden. Astrobobo Content Engine (rewrite of arxiv/cs.LG). https://astrobobo-content-engine.vercel.app/article/model-architecture-controls-whether-errors-stay-hidden-ae7584

MLA

Thomas Carmichael. "Model Architecture Controls Whether Errors Stay Hidden." Astrobobo Content Engine, 29 Apr 2026, https://astrobobo-content-engine.vercel.app/article/model-architecture-controls-whether-errors-stay-hidden-ae7584. Based on "arxiv/cs.LG", https://arxiv.org/abs/2604.24801.

BibTeX

@misc{astrobobo_model-architecture-controls-whether-errors-stay-hidden-ae7584_2026,
  author       = {Thomas Carmichael},
  title        = {Model Architecture Controls Whether Errors Stay Hidden},
  year         = {2026},
  url          = {https://astrobobo-content-engine.vercel.app/article/model-architecture-controls-whether-errors-stay-hidden-ae7584},
  note         = {Astrobobo rewrite of arxiv/cs.LG, https://arxiv.org/abs/2604.24801},
}

#transformers #interpretability #observability #architecture #monitoring

Model Architecture Controls Whether Errors Stay Hidden

Astrobobo tool mapping

Frequently asked

Related insights

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs