Scaling depth capacity via zero/one-layer model expansion (arxiv.org)

arXiv:2511.04981v2 Announce Type: replace
Abstract: Model depth is a double-edged sword in deep learning: deeper models achieve higher accuracy but require higher computational cost. To efficiently train models at scale, progressive training (also known as model expansion) scales up model capacity during training and significantly reduces computation with little performance degradation. In this work, we study the depth expansion of large-scale models through the lens of optimization theory and feature learning, offering insights on the initialization of new layers, hyperparameter transfer, learning rate schedule, and timing of model expansion. Specifically, we propose zero/one-layer progressive training to achieve an optimal tradeoff between computation and loss, with a comprehensive ablations on our expansion strategy. For example, zero/one-layer progressive training on GPT2 can save $\approx 80\%$ compute, or equivalently achieve an $\approx 5\times$ acceleration, while attaining a loss comparable to that of a fully trained 60-layer model with 7B parameters, thus demonstrating a mixing behavior in terms of loss. Furthermore, scaling laws on LLAMA3 and DeepSeekV3 models show a $3\sim 5\times$ improvement in compute efficiency, with an increasing advantage at larger scales.