Why does naive Transformer-to-Mamba distillation fail?

The architectural mismatch between Attention (which learns token-to-token relationships) and State Space Models (which use recurrent state transitions) means gradients do not flow effectively. The student SSM lacks the inductive bias to capture Attention patterns directly. The two-stage approach solves this by using linearized attention as a bridge—a mathematically compatible intermediate form that both Attention and SSM can learn from.

What is linearized attention and how does the kernel trick help?

Linearized attention approximates standard Attention by replacing the softmax with a kernel function, reducing quadratic complexity to linear. The kernel trick reformulates this as a low-rank matrix operation. In distillation, this creates a representation that preserves Attention's learned patterns while being structurally closer to SSM recurrence, making the second-stage transfer more stable.

Does the distilled Mamba model match the original Transformer's speed and memory?

The paper reports matching perplexity (14.11 vs 13.86) and downstream task performance, but does not measure wall-clock latency or peak memory usage. SSMs theoretically offer lower memory and faster generation, but actual gains depend on sequence length, batch size, and hardware. Readers should benchmark on their own infrastructure to confirm practical efficiency improvements.

ai · 8 min read · Apr 17, 2026

Distilling Transformers into Mamba via Linearized Attention

A two-stage knowledge transfer method preserves Transformer performance in State Space Models by routing through linearized attention as an intermediate step.

Source: arxiv/cs.LG · Abhinav Moudgil, Ningyuan Huang, Eeshan Gunesh Dhekane, Pau Rodr\'iguez, Luca Zappella, Federico Danieli · open original ↗

Two-stage distillation through linearized attention enables Mamba models to match Transformer performance without hybrid architectures.

— Naive Transformer-to-Mamba distillation fails; hybrid models combining both architectures were prior workaround.
— Principled initialization of Mamba weights during distillation recovers performance loss.
— Stage one: distill Transformer into linearized attention using kernel trick adaptation.
— Stage two: distill linearized attention into pure Mamba without any Attention blocks.
— Distilled 1B Mamba maintains teacher perplexity (14.11 vs 13.86) on downstream tasks.
— Ablations test sequence mixer variants, model scaling, token allocation, and total distillation budget.
— Method avoids hybrid solutions, enabling deployment of efficient SSM-only models.

Astrobobo tool mapping

Knowledge Capture Document the distillation hyperparameters (learning rate, token budget per stage, sequence mixer variant) that work best for your model size and domain, creating a reusable recipe.
Focus Brief Summarize the ablation results (scaling curves, token allocation sensitivity) relevant to your inference constraints to guide stage-one vs stage-two budget trade-offs.
Reading Queue Queue follow-up papers on linearized attention and SSM initialization to deepen understanding of why the two-stage approach outperforms naive distillation.

Frequently asked

The architectural mismatch between Attention (which learns token-to-token relationships) and State Space Models (which use recurrent state transitions) means gradients do not flow effectively. The student SSM lacks the inductive bias to capture Attention patterns directly. The two-stage approach solves this by using linearized attention as a bridge—a mathematically compatible intermediate form that both Attention and SSM can learn from.

Share X LinkedIn

cite ▸

APA

Abhinav Moudgil, Ningyuan Huang, Eeshan Gunesh Dhekane, Pau Rodr\'iguez, Luca Zappella, Federico Danieli. (2026, April 17). Distilling Transformers into Mamba via Linearized Attention. Astrobobo Content Engine (rewrite of arxiv/cs.LG). https://astrobobo-content-engine.vercel.app/article/distilling-transformers-into-mamba-via-linearized-attention-a0592c

MLA

Abhinav Moudgil, Ningyuan Huang, Eeshan Gunesh Dhekane, Pau Rodr\'iguez, Luca Zappella, Federico Danieli. "Distilling Transformers into Mamba via Linearized Attention." Astrobobo Content Engine, 17 Apr 2026, https://astrobobo-content-engine.vercel.app/article/distilling-transformers-into-mamba-via-linearized-attention-a0592c. Based on "arxiv/cs.LG", https://arxiv.org/abs/2604.14191.

BibTeX

@misc{astrobobo_distilling-transformers-into-mamba-via-linearized-attention-a0592c_2026,
  author       = {Abhinav Moudgil, Ningyuan Huang, Eeshan Gunesh Dhekane, Pau Rodr\'iguez, Luca Zappella, Federico Danieli},
  title        = {Distilling Transformers into Mamba via Linearized Attention},
  year         = {2026},
  url          = {https://astrobobo-content-engine.vercel.app/article/distilling-transformers-into-mamba-via-linearized-attention-a0592c},
  note         = {Astrobobo rewrite of arxiv/cs.LG, https://arxiv.org/abs/2604.14191},
}

#distillation #mamba #transformers #ssm #efficiency

Distilling Transformers into Mamba via Linearized Attention

Astrobobo tool mapping

Frequently asked

Related insights

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs