Papers
Topics
Authors
Recent
Search
2000 character limit reached

Efficient Long-context Language Model Training by Core Attention Disaggregation

Published 20 Oct 2025 in cs.LG and cs.DC | (2510.18121v1)

Abstract: We present core attention disaggregation (CAD), a technique that improves long-context LLM training by decoupling the core attention computation, softmax(QKT)V, from the rest of the model and executing it on a separate pool of devices. In existing systems, core attention is colocated with other layers; at long context lengths, its quadratic compute growth compared to the near-linear growth of other components causes load imbalance and stragglers across data and pipeline parallel groups. CAD is enabled by two observations. First, core attention is stateless: it has no trainable parameters and only minimal transient data, so balancing reduces to scheduling compute-bound tasks. Second, it is composable: modern attention kernels retain high efficiency when processing fused batches of token-level shards with arbitrary lengths. CAD partitions core attention into token-level tasks and dispatches them to dedicated attention servers, which dynamically rebatch tasks to equalize compute without sacrificing kernel efficiency. We implement CAD in a system called DistCA, which uses a ping-pong execution scheme to fully overlap communication with computation and in-place execution on attention servers to reduce memory use. On 512 H200 GPUs and context lengths up to 512k tokens, DistCA improves end-to-end training throughput by up to 1.35x, eliminates data and pipeline parallel stragglers, and achieves near-perfect compute and memory balance.

Summary

  • The paper demonstrates that decoupling core attention with CAD significantly reduces compute and memory imbalances in long-context LLM training.
  • CAD leverages the statelessness and composability of core attention, using token-level sharding to achieve up to 1.35x throughput improvement.
  • Through ping-pong execution and optimized communication, DistCA attains near-linear scaling and cuts memory requirements by 20-25%.

Core Attention Disaggregation for Efficient Long-context LLM Training

Motivation and Problem Analysis

The paper introduces Core Attention Disaggregation (CAD), a system-level technique for mitigating compute and memory load imbalance in long-context LLM training. The imbalance arises from the quadratic scaling of core attention (CA) computation with sequence length, contrasted with the near-linear scaling of other transformer components. Document packing, a standard throughput optimization, exacerbates this issue by creating microbatches with highly variable attention FLOPs, resulting in stragglers in both data parallel (DP) and pipeline parallel (PP) regimes. Figure 1

Figure 1

Figure 1: Transformer and its workload imbalance caused by core attention.

Existing remedies—variable-length data chunking and per-document context parallelism (CP)—address either compute or memory balance, but not both. Variable-length chunking equalizes attention FLOPs at the cost of memory divergence, while CP introduces significant all-gather communication and memory overhead, especially at scale. Figure 2

Figure 2

Figure 2: All-gather latency percentage increases with CP degree, limiting scalability.

Figure 3

Figure 3

Figure 3: Memory divergence in variable-length chunking grows with DP size, leading to memory imbalance.

Core Attention Disaggregation: Design and Implementation

CAD exploits two properties of core attention: statelessness and composability. CA is parameter-free and generates negligible intermediate state, allowing its computation to be scheduled independently. Furthermore, CA can be partitioned at token granularity and recombined into high-occupancy kernels, leveraging modern attention implementations (e.g., FlashAttention) for efficient execution.

The CAD system, DistCA, disaggregates CA from the rest of the model and schedules CA tasks on a pool of attention servers. Each CA task corresponds to the computation for a shard of query tokens and its context's key-value states. The runtime alternates between context-independent layers and CA, inserting communication as needed. To maximize resource utilization, DistCA implements in-place attention servers, allowing GPUs to time-share between CA and other layers. Figure 4

Figure 4: Ping-Pong computation and communication for inplace attention server, overlapping communication with compute.

DistCA employs a ping-pong execution scheme to overlap communication and computation, dividing each microbatch into two nano-batches and interleaving their execution. This design hides communication latency and maintains high throughput. Figure 5

Figure 5: Pipeline Parallel Schedule for normal 1F1B and disaggregated attention, showing integration of CAD with PP.

A central scheduler partitions documents into shards and assigns CA tasks to attention servers, optimizing for both load balance (FLOPs) and communication volume (bytes). The scheduling algorithm uses a cost-benefit heuristic to migrate shards between servers, minimizing communication per unit of compute transferred.

Experimental Evaluation

DistCA is evaluated on LLaMA-8B and LLaMA-34B models with context lengths up to 512K tokens, using up to 512 H200 GPUs. The experiments compare DistCA against WLB-LLM, a state-of-the-art workload-balanced parallelism baseline. Figure 6

Figure 6: 3D Parallel (no PP) experiment. DistCA achieves 1.07–1.20x speedup over WLB-LLM in Pretrain and 1.05–1.12x in ProLong.

Figure 7

Figure 7: 4D Parallel (with PP) experiment. DistCA achieves up to 1.35x speedup over WLB-LLM, with increasing advantage at larger scale and longer context.

DistCA consistently outperforms WLB-LLM, with speedups up to 1.35x and near-linear weak scaling. The throughput advantage increases with context length and model size, as WLB-LLM suffers from compounded memory and communication bottlenecks. DistCA's flexible token-level scheduling and communication overlap enable near-perfect compute and memory balance. Figure 8

Figure 8

Figure 8: Throughput of core attention remains high for fused shards above kernel tile size, validating composability.

Ablation studies demonstrate that DistCA's ping-pong execution fully hides communication overhead, and that tuning the scheduler's imbalance tolerance factor can reduce memory requirements by 20–25% without impacting latency. Figure 9

Figure 9: Throughput for different communication patterns, showing that ping-pong execution eliminates communication bottlenecks.

Figure 10

Figure 10: Impact of the compute imbalance tolerance factor; optimal range minimizes memory without increasing iteration latency.

System and Algorithmic Trade-offs

DistCA's design choices—token-level sharding, in-place attention servers, and ping-pong scheduling—address the fundamental mismatch between CA and other transformer components. The system achieves high memory utilization and compute balance without the scalability limitations of CP or the memory divergence of variable-length chunking. The scheduler's cost-benefit heuristic enables fine-grained control over the trade-off between load balance and communication volume.

The main limitation is memory fragmentation due to variable tensor shapes in CA requests, which introduces CPU overhead and degrades performance in large-scale experiments. Static memory allocation and CUDA Graphs are proposed as future optimizations.

Implications and Future Directions

CAD provides a principled approach to decoupling compute and memory scaling in LLM training, enabling efficient utilization of heterogeneous resources. The stateless and composable nature of CA suggests further opportunities for hardware specialization and dynamic resource allocation. Dedicated attention server pools could enhance fault tolerance and performance isolation. Extending the scheduler to support partial context sharding and more precise communication modeling could further improve efficiency.

The separation of CA from context-independent layers may also inform future model architectures and training paradigms, particularly for ultra-long-context or multi-modal models where attention bottlenecks dominate. The demonstrated throughput improvements and scalability suggest that CAD can be foundational for next-generation LLM training systems.

Conclusion

Core Attention Disaggregation, as implemented in DistCA, addresses the critical challenge of load imbalance in long-context LLM training by decoupling and independently scheduling core attention computation. The system leverages the statelessness and composability of CA, achieving up to 1.35x throughput improvement and near-linear scaling on large GPU clusters. The approach eliminates DP/PP stragglers and maintains high resource utilization, with practical implications for scalable, efficient LLM training. Future work may extend CAD to dedicated hardware, more flexible scheduling, and broader model classes.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 17 tweets with 185 likes about this paper.