Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction (arxiv.org)
arXiv:2601.11667v2 Announce Type: replace
Abstract: Transformer architectures deliver state-of-the-art accuracy via dense full-attention, but their quadratic time and memory complexity with respect to sequence length limits practical deployment. Linear attention mechanisms offer linear or near-linear scaling yet often incur performance degradation. Hybrid models that integrate full and linear attention layers promise a balance between efficiency and expressiveness, but face two major challenges: training such hybrid models from scratch is computationally expensive, and manually designing the optimal placement of attention types is highly nontrivial. We propose DtR (Distill-then-Replace), which first transfers weights from the pretrained full-attention modules to its linear attention counterparts through blockwise local distillation, and then applies a greedy layer replacement strategy that iteratively substitutes full attention blocks with linear ones while monitoring validation performance on the target task. DtR yields a task-specific hybrid model in a single efficient pass, without costly re-training or neural architecture search, and can be applied to any pretrained full-attention backbone for diverse downstream tasks.
Abstract: Transformer architectures deliver state-of-the-art accuracy via dense full-attention, but their quadratic time and memory complexity with respect to sequence length limits practical deployment. Linear attention mechanisms offer linear or near-linear scaling yet often incur performance degradation. Hybrid models that integrate full and linear attention layers promise a balance between efficiency and expressiveness, but face two major challenges: training such hybrid models from scratch is computationally expensive, and manually designing the optimal placement of attention types is highly nontrivial. We propose DtR (Distill-then-Replace), which first transfers weights from the pretrained full-attention modules to its linear attention counterparts through blockwise local distillation, and then applies a greedy layer replacement strategy that iteratively substitutes full attention blocks with linear ones while monitoring validation performance on the target task. DtR yields a task-specific hybrid model in a single efficient pass, without costly re-training or neural architecture search, and can be applied to any pretrained full-attention backbone for diverse downstream tasks.
Comments