architectureintermediate

Subquadratic Attention

Also known as: subquadratic LLM, linear attention, state space model, SSM, sub-quadratic attention, linear complexity LLM, attention-free LLM

Linear Attention / SSM

See matching models with benchmark scores and pricing.

Definition

Subquadratic attention refers to any language model architecture that scales compute with sequence length at less than O(n²) — the quadratic cost of standard transformer self-attention. Standard transformers compare every token against every other token, making long-context processing computationally expensive (and practically limited to 8K–200K tokens before inference cost becomes prohibitive). Subquadratic architectures break this bottleneck via one of three main approaches:

State Space Models (SSMs): Replace attention with a compressed recurrent state that evolves through the sequence. The state size is fixed regardless of context length, yielding O(n) training and O(1) constant-memory inference. The Mamba architecture (and Mamba-2) is the canonical SSM, used in models like Falcon Mamba, Falcon3-Mamba, Jamba, Zamba, Codestral Mamba, and Liquid Foundation Models.

Recurrent LLMs: The RWKV family (Eagle, Finch, Goose) treats the model as a pure RNN with linear complexity, enabling infinite-length inference with a fixed memory footprint. Unlike attention-based SSMs, RWKV requires no KV cache at all — inference memory does not grow with context length.

Sparse / Selective Attention: Instead of computing all-pairs attention, models select a subset of relevant tokens per query. SubQ (Subquadratic Sparse Attention, 2026) is the most prominent example, claiming O(n) complexity via sparse token selection with a production 1M-token context window.

Hybrid SSM + Attention: Several production models combine attention layers with SSM layers for the best of both: AI21's Jamba family and Zyphra's Zamba family interleave Mamba blocks with standard transformer attention.

Subquadratic models trade off some in-context retrieval precision (transformers with full attention tend to have stronger exact-match recall) against dramatically lower inference cost and memory at long contexts. As context windows extend past 1M tokens, subquadratic approaches become the only economically viable option at scale.

Models Mentioning Subquadratic Attention(12)

Qwen3.6-35B-A3B2026-04 Holotron-12B2026-03 Granite 4.0 H 1B2025-10 Granite 4.0 H 350M2025-10 Granite 4.0 1B2025-10 Jamba Mini 1.62025-03 Jamba Large 1.62025-03 Jamba-Instruct2024-05 Palmyra Fin 56B2024-01 StripedHyena Hessian 7B2023-12 Mamba 2 780M2023-12 Mamba 2 370M2023-12