Model Architecture Controls Whether Errors Stay Hidden
Transformer design determines if internal decision signals remain observable after training, independent of output confidence metrics.
Transformer architecture, not just training, determines whether mid-layer activations expose token-level decision quality hidden from confidence scores.
- — Output confidence absorbs 57.7% of raw probe signal, masking true decision quality in frozen activations.
- — 24-layer 16-head configurations collapse to near-zero observability across parameter scales; other configs maintain healthy signal.
- — Observability collapse emerges during training despite improving loss, suggesting architectural constraints erase internal signals.
- — Qwen 2.5 and Llama differ by 2.9x observability at matched 3B scale with non-overlapping probe distributions.
- — Error-detection probes trained on WikiText catch 10.9–13.4% of errors confidence misses across downstream tasks.
- — Nonlinear probes and layer sweeps fail to recover signal in collapsed configurations.
- — Architecture selection functions as a monitoring decision with measurable consequences for error detection.
Astrobobo tool mapping
- Knowledge Capture Document the observability scores (rho_partial) for each model architecture you evaluate. Store alongside accuracy, latency, and other selection criteria.
- Focus Brief Before committing to a model family, create a one-page observability summary: which layer-head configs collapse, which preserve signal, and what that means for your error-detection strategy.
- Reading Queue Add Carmichael's paper and related interpretability work to your queue. Observability is an emerging design criterion; staying current is now part of responsible model selection.
Frequently asked
- Confidence (max-softmax) and activation norm absorb approximately 57.7% of the raw signal that probes can extract from mid-layer activations. This means a model can be confident in its output while the internal decision-making process—visible only in frozen activations—shows uncertainty or error. Controlling for these factors reveals hidden signal that confidence alone cannot expose.
cite ▸
Thomas Carmichael. (2026, April 29). Model Architecture Controls Whether Errors Stay Hidden. Astrobobo Content Engine (rewrite of arxiv/cs.LG). https://astrobobo-content-engine.vercel.app/article/model-architecture-controls-whether-errors-stay-hidden-ae7584
Thomas Carmichael. "Model Architecture Controls Whether Errors Stay Hidden." Astrobobo Content Engine, 29 Apr 2026, https://astrobobo-content-engine.vercel.app/article/model-architecture-controls-whether-errors-stay-hidden-ae7584. Based on "arxiv/cs.LG", https://arxiv.org/abs/2604.24801.
@misc{astrobobo_model-architecture-controls-whether-errors-stay-hidden-ae7584_2026,
author = {Thomas Carmichael},
title = {Model Architecture Controls Whether Errors Stay Hidden},
year = {2026},
url = {https://astrobobo-content-engine.vercel.app/article/model-architecture-controls-whether-errors-stay-hidden-ae7584},
note = {Astrobobo rewrite of arxiv/cs.LG, https://arxiv.org/abs/2604.24801},
}