ai · arxiv/cs.LG · 8 min
Distilling Transformers into Mamba via Linearized Attention
A two-stage knowledge transfer method preserves Transformer performance in State Space Models by routing through linearized attention as an intermediate step.
Apr 17, 2026 Read →