Sequential Chunk-wise Optimization (SeCO)

Updated 24 June 2026

SeCO is a computational framework that divides large inputs into sequential chunks, enabling efficient optimization while preserving global context via cross-chunk coordination.
It balances tradeoffs between memory, computational parallelism, and global optimality through techniques like SSC scheme, C2Conv, and gradient checkpointing.
Applications span long-context language models, streaming ASR, and trajectory optimization, achieving significant speedups and scalability improvements across domains.

Sequential Chunk-wise Optimization (SeCO) refers to a family of computational and algorithmic strategies that partition an input—either data or a control problem—into manageable sequential “chunks,” and optimize, process, or solve each chunk in a manner that trades off memory, computational parallelism, information flow, and global optimality. While the term "SeCO" appears across diverse domains such as deep learning for long-context LLMs, efficient streaming sequence modeling (e.g., speech recognition), and real-time optimal control, its unifying principle is sequential chunk processing with cross-chunk coordination for scalability and performance.

1. Core Principles and Problem Motivations

Traditional models or solvers struggle with unmanageably large inputs due to quadratic complexity, global self-attention, or prohibitive memory requirements for storing forward activations or solving all-at-once optimization problems. SeCO approaches address these challenges by chunking inputs and exploiting sequential, localized computation, with carefully designed schemes for inter-chunk context propagation or constraint enforcement. The major motivating problems include:

Long-context model training: In LLMs, attention and activation memory scale linearly or quadratically with sequence length, making standard backpropagation infeasible for long sequences on limited hardware (Li et al., 22 May 2025).
Streaming sequence modeling: In streaming ASR, chunk-wise attention enables linear cost and low-latency inference but loses global context, limiting accuracy (Wang et al., 2022).
Trajectory optimization: Multi-phase optimal control problems, such as rocket landing, can become intractable without chunking (phasing) and sequential solution (Kamath et al., 2022).

2. SeCO in Streaming Sequence Models and Speech Recognition

A canonical example is SSCFormer (Wang et al., 2022), where SeCO is instantiated for streaming ASR:

a. Sequential Sampling Chunk (SSC) Scheme:

Vanilla chunk-wise models split sequences into fixed, non-overlapping chunks, applying multi-head self-attention (MHSA) independently in each block, yielding linear complexity but no cross-chunk information. SSC re-partitions chunks at every other layer so that each new chunk comprises tokens interleaved across the previous chunks, permitting cross-chunk attention at constant per-layer cost. Over multiple layers, every token gains global context exposure without sacrificing parallel training or incurring quadratic attention cost. This process is implemented via index-gather operations and matrix reshaping, fully utilizing GPU hardware.

b. Chunked Causal Convolution (C2Conv):

To provide both unlimited left context (across all past chunks) and limited right (future-within-chunk) context, C2Conv fuses standard causal convolution with chunked convolution, parameterized by an interpolation factor $\lambda$ . This module enables the model to access both streaming (future-restricted) and batch (local-future) information, improving error rates without additional inference latency.

c. Architectural Integration:

SeCO in SSCFormer alternates regular chunk-MHSA layers (with C2Conv) and SSC-repartitioned MHSA layers (with C2Conv), ensuring each token receives local, global, and chunk-future context in a linear-complexity pipeline. Empirically, this delivers state-of-the-art performance: on AISHELL-1, SSCFormer achieves CER 5.33% with $W=16$ , $\lambda=0.7$ , outperforming several strong baselines, with a flat inference real-time factor regardless of utterance length.

3. SeCO in Long-Context LLM Training

In the domain of long-context LLMs, SeCO serves as a training memory optimization strategy (Li et al., 22 May 2025):

a. Sequence-Chunked Gradient Checkpointing:

SeCO partitions the input token sequence into $k$ chunks of size $C=L/k$ . During a first forward pass, only the minimal key-value (KV) caches for each chunk are kept, dropping intermediate activations. In the backward pass, each chunk is replayed in reverse order, reconstructing its forward activations from the stored KV caches and model parameters, and the local chunk loss is backpropagated with gradients accumulated at every step.

b. Exact Gradient and Memory Tradeoffs:

This method achieves $O(1)$ -in- $L$ activation memory (per chunk only), compared to standard $O(L)$ , by checkpointing along the sequence dimension—the first such approach. Computational overhead is incurred by recomputation, but time overhead is modest ( $\sim30\%$ in practice with large chunk size). The gradients are theoretically exact.

c. Sparse Chunk-wise Optimization (SpaCO):

SpaCO extends SeCO by randomly selecting a subset $t<k$ of chunks per iteration for backpropagation, scaling their contributions by $W=16$ 0 where $W=16$ 1 is the gradient path length, to ensure unbiased estimation. This decouples training compute from context length and enables up to $W=16$ 2 speedup at negligible loss increase (LM loss rises $W=16$ 3).

d. Empirical Impact:

SeCO allows fine-tuning of 8B-parameter models with 16K context length on a single 24GB GPU, where naive approaches are limited to 1K. Code is open-sourced and amenable to integration with LoRA adapters (Li et al., 22 May 2025).

4. SeCO Variants for Long-sequence Processing in Transformers

Alternative SeCO-style frameworks such as SimCAS—Chunk, Align, Select—structure processing in chunks for manageable compute, and introduce explicit alignment and selection stages (Xie et al., 2023):

Chunking and Batch Processing: Input sequences are chunked by length or sentence boundary, each chunk prepended/appended with special tokens, then parallel-encoded.
Inter-chunk Alignment: At each encoder layer, the start/end token embeddings are batch-averaged across all chunks, propagating global context.
Learned Selection: After all layers, a token selection policy chooses a small output subset for the decoder using actor-critic reinforcement learning, further compressing representation and compute.
Complexity: All stages scale linearly in input length, with actual decoder attention cost capped at the number of selected tokens. Empirical results show strong ROUGE/BERTScore gains on multi-document summarization/QA.

5. SeCO in Trajectory Optimization and Control

Sequential conic optimization in optimal control (SeCO for OCPs) (Kamath et al., 2022) exemplifies the framework for dynamic systems:

Phasing and Time-interval Dilation: The trajectory is split into temporal phases, with each phase mapped to a normalized interval by introducing dilated time variables as optimization parameters.
Successive Convexification and Discretization: Each phase’s nonlinear dynamics and constraints are repeatedly linearized/discretized around a reference, forming a sequence of strongly convex conic subproblems over variables $W=16$ 4.
Virtual States and Trust Regions: To guarantee feasibility, a virtual state is used for constraint satisfaction, and soft penalties drive the solution towards the nominal trajectory.
Efficient Solution: Each subproblem is solved by an extrapolated proportional-integral projected gradient (PIPG) method, exploiting sparsity and cheap projections. Empirical results show a $W=16$ 5 speedup over ECOS with negligible accuracy loss.

6. Comparative Analysis and Practical Implementation

Domain	Chunking Axis	Key Innovation
LLM training (Li et al., 22 May 2025)	Sequence tokens	Sequential KV-cache checkpointing
Streaming ASR (Wang et al., 2022)	Temporal frames	Cross-layer chunk repartition, C2Conv
Trajectory optimization (Kamath et al., 2022)	Temporal phases	Dilation, virtual state, PIPG
Long-sequence Transformers (Xie et al., 2023)	Tokens/Segments	Chunk-alignment, RL selection

Each domain leverages the core SeCO ideas—sequential chunking, localized compute with cross-chunk dependencies/caching, and memory/computation tradeoff—guided by domain-specific requirements (e.g., hardware efficiency, low latency, phase linking). In all cases, SeCO achieves substantial linear or sublinear scaling in problem size, accurate optimization, and is implementable with simple code wrappers or efficient solvers.

7. Significance and Future Directions

SeCO-type methodologies have established a new paradigm for scalable, efficient modeling and optimization of long-context and real-time problems where standard approaches are computationally infeasible. Theoretical results confirm that exact gradients or primal solutions are preserved despite the decomposition and recomputation. Practical results indicate that SeCO can extend input lengths by more than an order of magnitude on fixed hardware, accelerate large-scale optimal control, and substantially improve real-time sequence recognition. Further research is likely to focus on hybridization (e.g., with sparsified gradient flows), more sophisticated cross-chunk context propagation, and applications in domains such as continual learning, time-series forecasting, and multi-agent coordination.