Small Models Match Large Ones via Inference Scaffolding
McClendon et al. show that role-based prompt structuring at inference time doubles small-model performance on complex tasks without retraining.
Structured inference-time prompting with role assignment doubles small-model task completion without training overhead.
- — Qwen3-8B with three-role scaffolding reaches 8.9% task completion, up from 5.4% baseline.
- — Three roles: summarizer (compress history), agent (reason), corrector (fix code without context).
- — No retraining required; same frozen weights deployed three times with different prompts.
- — 8B model with scaffolding outperforms unscaffolded 33B DeepSeek-Coder on AppWorld benchmark.
- — 4-bit quantized version improves from 3.0% to 5.9%, showing gains persist under compression.
- — Strongest gains on difficulty-1 tasks: 15.8% to 26.3% (FP16), 5.3% to 14.0% (4-bit).
- — Approach formalizes as test-time compute scaling and action-space shaping from RL theory.
Astrobobo tool mapping
- Focus Brief Summarize the three roles (summarizer, agent, corrector) and map them to your current inference pipeline. Identify which role is missing or weak in your setup.
- Knowledge Capture Document the failure modes you observe in your agent (e.g., repeated API calls, hallucinated credentials, context overflow). Use these to design role-specific prompts.
- Daily Log Track inference latency and task success rate as you add each role. Log which role contributes most to performance gain in your domain.
Frequently asked
- No. McClendon et al. apply the three-role structure to a frozen Qwen3-8B model without any fine-tuning or additional training. The improvement comes entirely from inference-time prompt engineering and role assignment, making it immediately applicable to existing models.
cite ▸
S. Aaron McClendon, Jorge Gallego-Feliciano, Stavros Zervoudakis, Antonios Saravanos. (2026, April 17). Small Models Match Large Ones via Inference Scaffolding. Astrobobo Content Engine (rewrite of arxiv/cs.AI). https://astrobobo-content-engine.vercel.app/article/small-models-match-large-ones-via-inference-scaffolding-dc9f78
S. Aaron McClendon, Jorge Gallego-Feliciano, Stavros Zervoudakis, Antonios Saravanos. "Small Models Match Large Ones via Inference Scaffolding." Astrobobo Content Engine, 17 Apr 2026, https://astrobobo-content-engine.vercel.app/article/small-models-match-large-ones-via-inference-scaffolding-dc9f78. Based on "arxiv/cs.AI", https://arxiv.org/abs/2604.11465.
@misc{astrobobo_small-models-match-large-ones-via-inference-scaffolding-dc9f78_2026,
author = {S. Aaron McClendon, Jorge Gallego-Feliciano, Stavros Zervoudakis, Antonios Saravanos},
title = {Small Models Match Large Ones via Inference Scaffolding},
year = {2026},
url = {https://astrobobo-content-engine.vercel.app/article/small-models-match-large-ones-via-inference-scaffolding-dc9f78},
note = {Astrobobo rewrite of arxiv/cs.AI, https://arxiv.org/abs/2604.11465},
}