MiniCPM-SALA

MiniCPM-SALA (Sparse Attention and Linear Attention) is a 9B hybrid model built from a MiniCPM-4.0 checkpoint via continual training (~2T tokens, 25% of training-from-scratch cost). It interleaves 25% InfLLM-V2 sparse attention and 75% Lightning Attention layers, achieving up to 3.5x inference speed over dense baselines at 256K tokens. With HyPE (Hybrid Positional Encoding) and NoPE in sparse layers, the model extrapolates to 2048K tokens despite a 520K training length, enabling 1M-token inference on consumer GPUs like the RTX 5090.

Context 256K

Benchmarks