Efficient Long-context Language Model Training by Core Attention Disaggregation

Yonghao Zhuang; Jiajia Chen; Bo Pang; Yi Gu; Yibo Zhu; Yimin Jiang; Ion Stoica; Eric P. Xing; Hao Zhang

doi:10.48550/arxiv.2510.18121

Abstract

1 min read

We present core attention disaggregation (CAD), a technique that improves long-context large language model training by decoupling the core attention computation, softmax(QK^T)V, from the rest of the model and executing it on a separate pool of devices. In existing systems, core attention is colocated with other layers; at long context lengths, its quadratic compute growth compared to the near-linear growth of other components causes load imbalance and stragglers across data and pipeline parallel groups. CAD is enabled by two observations. First, core attention is stateless: it has no trainable parameters and only minimal transient data, so balancing reduces to scheduling compute-bound tasks. Second, it is composable: modern attention kernels retain high efficiency when processing fused batches of token-level shards with arbitrary lengths. CAD partitions core attention into token-level tasks and dispatches them to dedicated attention servers, which dynamically rebatch tasks to equalize compute without sacrificing kernel efficiency. We implement CAD in a system called DistCA, which uses a ping-pong execution scheme to fully overlap communication with computation and in-place execution on attention servers to reduce memory use. On 512 H200 GPUs and context lengths up to 512k tokens, DistCA improves end-to-end training throughput by up to 1.35x, eliminates data and pipeline parallel stragglers, and achieves near-perfect compute and memory balance.

Related publications

Preprint2023

Efficient Long-context Language Model Training by Core Attention Disaggregation

Abstract

Discussion(0)

Open reviews(0)

Related publications

DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Post-Training Sparse Attention with Double Sparsity

The Streaming Batch Model for Efficient and Fault-Tolerant Heterogeneous Execution