Distilling Transformers into Mamba via Linearized Attention
A two-stage knowledge transfer method preserves Transformer performance in State Space Models by routing through linearized attention as an intermediate step.
Two-stage distillation through linearized attention enables Mamba models to match Transformer performance without hybrid architectures.
- — Naive Transformer-to-Mamba distillation fails; hybrid models combining both architectures were prior workaround.
- — Principled initialization of Mamba weights during distillation recovers performance loss.
- — Stage one: distill Transformer into linearized attention using kernel trick adaptation.
- — Stage two: distill linearized attention into pure Mamba without any Attention blocks.
- — Distilled 1B Mamba maintains teacher perplexity (14.11 vs 13.86) on downstream tasks.
- — Ablations test sequence mixer variants, model scaling, token allocation, and total distillation budget.
- — Method avoids hybrid solutions, enabling deployment of efficient SSM-only models.
Astrobobo tool mapping
- Knowledge Capture Document the distillation hyperparameters (learning rate, token budget per stage, sequence mixer variant) that work best for your model size and domain, creating a reusable recipe.
- Focus Brief Summarize the ablation results (scaling curves, token allocation sensitivity) relevant to your inference constraints to guide stage-one vs stage-two budget trade-offs.
- Reading Queue Queue follow-up papers on linearized attention and SSM initialization to deepen understanding of why the two-stage approach outperforms naive distillation.
Frequently asked
- The architectural mismatch between Attention (which learns token-to-token relationships) and State Space Models (which use recurrent state transitions) means gradients do not flow effectively. The student SSM lacks the inductive bias to capture Attention patterns directly. The two-stage approach solves this by using linearized attention as a bridge—a mathematically compatible intermediate form that both Attention and SSM can learn from.
cite ▸
Abhinav Moudgil, Ningyuan Huang, Eeshan Gunesh Dhekane, Pau Rodr\'iguez, Luca Zappella, Federico Danieli. (2026, April 17). Distilling Transformers into Mamba via Linearized Attention. Astrobobo Content Engine (rewrite of arxiv/cs.LG). https://astrobobo-content-engine.vercel.app/article/distilling-transformers-into-mamba-via-linearized-attention-a0592c
Abhinav Moudgil, Ningyuan Huang, Eeshan Gunesh Dhekane, Pau Rodr\'iguez, Luca Zappella, Federico Danieli. "Distilling Transformers into Mamba via Linearized Attention." Astrobobo Content Engine, 17 Apr 2026, https://astrobobo-content-engine.vercel.app/article/distilling-transformers-into-mamba-via-linearized-attention-a0592c. Based on "arxiv/cs.LG", https://arxiv.org/abs/2604.14191.
@misc{astrobobo_distilling-transformers-into-mamba-via-linearized-attention-a0592c_2026,
author = {Abhinav Moudgil, Ningyuan Huang, Eeshan Gunesh Dhekane, Pau Rodr\'iguez, Luca Zappella, Federico Danieli},
title = {Distilling Transformers into Mamba via Linearized Attention},
year = {2026},
url = {https://astrobobo-content-engine.vercel.app/article/distilling-transformers-into-mamba-via-linearized-attention-a0592c},
note = {Astrobobo rewrite of arxiv/cs.LG, https://arxiv.org/abs/2604.14191},
}