Channel-Aggregative Sequential Processing (CASP)

Updated 22 December 2025

Channel-Aggregative Sequential Processing (CASP) is a neural network module that fuses multi-channel sequential data via specialized attention mechanisms for long-range prediction tasks.
It alternates multi-head channel attention and masked self-attention layers to dynamically integrate spatial, kinematic, and semantic features across long AIS trajectories.
Empirical results indicate that CASP boosts vessel destination prediction accuracy by over 14 percentage points compared to simpler fusion approaches, showcasing its effectiveness and scalability.

Channel-Aggregative Sequential Processing (CASP) is a neural network module for deep learning on multi-channel sequential data. It is designed to efficiently transform and propagate complex, multi-aspect trajectory and time series information within long sequences. CASP alternates multi-head channel attention and masked multi-head self-attention to jointly aggregate, weight, and integrate disparate input channels over time, enabling highly expressive and scalable modeling of spatio-temporally structured signals such as those encountered in vessel trajectory analytics (Kim et al., 15 Dec 2025).

1. Motivations and Problem Domain

Channel-Aggregative Sequential Processing was introduced to address key challenges in long-range sequential prediction tasks entailing heterogeneous input representations. The canonical application motivating CASP involves destination prediction for vessels based on Automatic Identification System (AIS) trajectory data. In this scenario, multi-year global AIS records are recast as long sequences indexed by spatial grid cells, with each trajectory step represented as a multi-channel vector comprising spatial, kinematic, semantic, and temporal-regularity encodings. CASP addresses two main requirements:

Channel aggregation: At each sequence position, inputs from conceptually different channels must be weighted and fused according to their contextual relevance for downstream prediction.
Long-range sequential integration: Information from all earlier sequence steps—after channel aggregation—must be efficiently and contextually delivered to the current step, supporting fully autoregressive modeling over hundreds of steps, without the bottlenecks or vanishing gradients of recurrent architectures.

This dual function is realized via an alternation of channel and sequence attention mechanisms, stacked in deep blocks (Kim et al., 15 Dec 2025).

2. Multi-Channel Sequence Representation

Let $N$ be the trajectory length, $C$ the number of channels ( $C=4$ in the reference implementation), and $d$ the hidden dimensionality. The CASP input tensor at layer $l$ is

$X^{(l)} \in \mathbb{R}^{C \times N \times d}$

where each of the $C$ channels represents a distinct aspect. In the principal application:

Channel 1: Spatial encoding (vector representation of the current grid-cell center).
Channel 2: Kinematic encoding (final hidden state of a GRU applied to raw AIS messages within the cell).
Channel 3: Semantic context (sum of departure-port and vessel-type embeddings, along with a time encoding).
Channel 4: Optionally reserved for further semantic features.

These multi-channel representations allow the network to separately process and flexibly combine features that capture static geography, local movement statistics, vessel/class meta-data, and trajectory timing.

3. Structural Elements of the CASP Block

Each CASP block consists of three major sublayers:

Multi-Head Channel Attention (MCA): Aggregates the $C$ channels at each of the $N$ sequence positions, dynamically re-weighting channels per step based on content and context.
Masked Multi-Head Self-Attention (MSA): Propagates contextual information along the temporal axis using transformer-style attention, with masking to enforce auto-regressive flow.
Shared Feed-Forward Network (SFF): Positionwise MLP, enabling non-linear transformations.

Each sublayer is wrapped with residual connections and layer normalization. MCA and subsequent operations are computed in parallel across all sequence positions, leveraging GPU acceleration for scalability.

Blockwise Operation

The functional flow for a single CASP block is:

Compute MCA at each step to obtain an aggregated representation; replace one of the C channels (typically the first) by this output.
Normalize via LayerNorm; apply MSA on the flattened selected channel (shape $N \times d$ ).
Replace the designated channel with the MSA output and apply LayerNorm.
Apply the shared feed-forward network independently to all $(C, N)$ positions, with residual and normalization.
The output is the updated tensor $X^{(l+1)} \in \mathbb{R}^{C \times N \times d}$ .

Pseudocode for a single CASP block is provided in (Kim et al., 15 Dec 2025):

function CASP_Block(X : C×N×d) → C×N×d
  # Multi-Head Channel Attention (MCA)
  for n in 1…N in parallel:
    x_step = X[:,n,:]            # shape C×d
    Z[n,:] = MCA(x_step)         # shape d
  end
  # Replace designated channel by Z
  U = X; U[1,:,:] = Z           # shape C×N×d
  U = LayerNorm(U)              
  # Masked Self-Attention
  S = U[1,:,:]                  
  H = MSA(S)                    
  V = U; V[1,:,:] = H       
  V = LayerNorm(V)       
  # Shared feed-forward network
  for c in 1…C, n in 1…N in parallel:
    v = V[c,n,:]                 
    v2= ReLU(W1 v)               
    v3= W2 v2                    
    Y[c,n,:] = LayerNorm(v + v3)
  end
  return Y                       
end

4. Multi-Head Channel Attention (MCA) Sub-Layer

At each spatial step $n$ , MCA processes the $C \times d$ matrix of per-channel features as follows:

For each of $h$ heads, transform each channel with head-specific linear projections, $W_i^{\mathrm{tr}} \in \mathbb{R}^{d \times d_k}$ .
Compute channel-level pooling: average and max across channels for each head, then stack these to form $s_i \in \mathbb{R}^{2d_k}$ .
Pass $s_i$ through a 2-layer bottleneck MLP (“squeeze” and “excitation”) to produce channel weights $\alpha_i \in (0,1)^C$ via sigmoid.
Compute weighted sum of projected channels per head; concatenate all heads, and affinely project back to $d$ dimensions:

$\widetilde{x} = \mathrm{concat}(y_1, ..., y_h) W^{\mathrm{out}}$

where $y_i = \sum_{c=1}^C \alpha_{i,c} F^{\mathrm{tr}}_i(x)_{c,*}$ . This mechanism adapts the relative channel importance at each location, informed by the joint channel context (Kim et al., 15 Dec 2025).

5. Masked Multi-Head Self-Attention and Sequential Propagation

Flattening the aggregated sequence yields $S = \mathrm{MCA}(X^{(l)}) \in \mathbb{R}^{N \times d}$ . The subsequent MSA applies transformer-style sequence modeling:

Queries, keys, and values are constructed via $h$ learned projections per head.
Attention weights are computed with a strict lower-triangular mask to ensure only leftward (causal) information flow:

$A_i = \mathrm{softmax}\left( \mathrm{Mask}_{\leq}(Q_i K_i^T / \sqrt{d_k}) \right)$

Per-head outputs are aggregated and projected.
The MSA thus propagates information from all previous steps to the current, bypassing the need for recurrence and circumventing vanishing gradients.

This alternation of per-step channel attention and sequence-wide masked self-attention within each block enables deep integration of both spatial/semantic channel features and long-range sequential context.

6. Empirical Performance, Ablation, and Computational Characteristics

Quantitative results from (Kim et al., 15 Dec 2025) indicate:

Removing the local kinematic (GRU) channel reduces accuracy from $79.45\%$ to $64.70\%$ (–14.75 pp), underlining the necessity of multi-aspect input.
Removal of semantic embeddings (departure port or ship type) degrades performance by several points (to $76.35\%$ and $78.84\%$ respectively).
Substituting MCA with simple concatenation yields $73.59\%$ accuracy; with single-head channel attention, $77.94\%$ . Multi-head MCA achieves $79.45\%$ .
Adding the specialized Gradient Dropout technique increases accuracy to $80.44\%$ and macro-F1 to $52.01\%$ , suggesting effectiveness under variable sequence lengths.

In terms of computational complexity, MCA per step is $O(h C d)$ , performed in parallel across all $N$ steps. MSA costs $O(h N^2 d_k)$ . The absence of recurrence permits full GPU parallelization over long sequences, ensuring scalability for lengthy AIS trajectories (Kim et al., 15 Dec 2025).

7. Connections to Broader Channel Combination Paradigms

CASP’s channel-aggregative design is closely related to self-attention channel combination strategies applied in other domains. For example, in distant-automatic speech recognition (ASR), the Self-Attention Channel Combiner (SACC) algorithm is used post-beamforming to select and fuse multiple beam outputs, with weights adaptively learned via softmax attention over each frame (Sharma et al., 2022). Both CASP and SACC exhibit the following:

Channel-wise content-driven weighting at each temporal step.
Integration into deep learning pipelines where multi-view or multi-channel context is critical for robust prediction.
Replacing static or concatenation-based fusion with trainable, context-adaptive mechanisms.

A plausible implication is that the success of CASP in long-range trajectory prediction and SACC in multi-channel spatial audio processing highlights the generality of channel-attention paradigms across diverse temporal and spatially-indexed modalities. Each method’s architecture reflects task-adaptive choices in the sequence and channel structure, masking strategy, and aggregation projection.

CASP formalizes this as a repeatable block, facilitating deep end-to-end architectures for complex data fusion scenarios.