Token Importance in On-Policy Distillation: Entropy and Disagreement
Research identifies two regions of high-value tokens in knowledge distillation: high-entropy positions and low-entropy positions where student and teacher disagree, enabling 50–80% token reduction.
On-policy distillation learns most effectively from tokens with high student uncertainty or low confidence paired with teacher disagreement.
- — Student entropy alone captures ~50% of useful tokens; entropy-based sampling matches full-token training with 47% less peak memory.
- — Low-entropy, high-divergence tokens (overconfident errors) carry dense corrective signal despite being rare.
- — TIP framework organizes token importance across two axes: student entropy and teacher–student divergence.
- — Type-aware selection combining uncertainty and disagreement outperforms entropy-only rules.
- — Experiments on Qwen, Llama, and Qwen2.5 show <20% token retention can exceed full-token baselines on math and planning tasks.
- — Overconfident-wrong tokens are structurally invisible to entropy-only methods but critical for learning.
- — Memory savings enable distillation of larger models under constrained GPU budgets.
Astrobobo tool mapping
- Knowledge Capture Log the two-axis taxonomy (entropy vs. divergence) and your model's token importance distribution as a reference table for future distillation projects.
- Focus Brief Summarize the key insight—overconfident errors matter more than uncertainty—and share with your ML team to shift token selection strategy from entropy-only to disagreement-aware.
- Experiment Tracker Record baseline (full-token) and selective (50%, 20% tokens) runs side-by-side with memory, latency, and accuracy metrics to validate the paper's claims on your hardware.
Frequently asked
- High-entropy tokens (where the student is uncertain) and low-entropy, high-divergence tokens (where the student is overconfident but wrong). Together, these regions contain the densest learning signal. Entropy alone captures about 50% of useful tokens, but the second region—overconfident errors—carries corrective information that entropy-only methods miss.
cite ▸
Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard. (2026, April 17). Token Importance in On-Policy Distillation: Entropy and Disagreement. Astrobobo Content Engine (rewrite of arxiv/cs.AI). https://astrobobo-content-engine.vercel.app/article/token-importance-in-on-policy-distillation-entropy-and-disagreement-6230f8
Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard. "Token Importance in On-Policy Distillation: Entropy and Disagreement." Astrobobo Content Engine, 17 Apr 2026, https://astrobobo-content-engine.vercel.app/article/token-importance-in-on-policy-distillation-entropy-and-disagreement-6230f8. Based on "arxiv/cs.AI", https://arxiv.org/abs/2604.14084.
@misc{astrobobo_token-importance-in-on-policy-distillation-entropy-and-disagreement-6230f8_2026,
author = {Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard},
title = {Token Importance in On-Policy Distillation: Entropy and Disagreement},
year = {2026},
url = {https://astrobobo-content-engine.vercel.app/article/token-importance-in-on-policy-distillation-entropy-and-disagreement-6230f8},
note = {Astrobobo rewrite of arxiv/cs.AI, https://arxiv.org/abs/2604.14084},
}