Subquadratic Attention
Also known as: subquadratic LLM, linear attention, state space model, SSM, sub-quadratic attention, linear complexity LLM, attention-free LLM
Linear Attention / SSM
Definition
Subquadratic attention refers to any language model architecture that scales compute with sequence length at less than O(n²) — the quadratic cost of standard transformer self-attention. Standard transformers compare every token against every other token, making long-context processing computationally expensive (and practically limited to 8K–200K tokens before inference cost becomes prohibitive). Subquadratic architectures break this bottleneck via one of three main approaches:
State Space Models (SSMs): Replace attention with a compressed recurrent state that evolves through the sequence. The state size is fixed regardless of context length, yielding O(n) training and O(1) constant-memory inference. The Mamba architecture (and Mamba-2) is the canonical SSM, used in models like Falcon Mamba, Falcon3-Mamba, Jamba, Zamba, Codestral Mamba, and Liquid Foundation Models.
Recurrent LLMs: The RWKV family (Eagle, Finch, Goose) treats the model as a pure RNN with linear complexity, enabling infinite-length inference with a fixed memory footprint. Unlike attention-based SSMs, RWKV requires no KV cache at all — inference memory does not grow with context length.
Sparse / Selective Attention: Instead of computing all-pairs attention, models select a subset of relevant tokens per query. SubQ (Subquadratic Sparse Attention, 2026) is the most prominent example, claiming O(n) complexity via sparse token selection with a production 1M-token context window.
Hybrid SSM + Attention: Several production models combine attention layers with SSM layers for the best of both: AI21's Jamba family and Zyphra's Zamba family interleave Mamba blocks with standard transformer attention.
Subquadratic models trade off some in-context retrieval precision (transformers with full attention tend to have stronger exact-match recall) against dramatically lower inference cost and memory at long contexts. As context windows extend past 1M tokens, subquadratic approaches become the only economically viable option at scale.