Intra/Inter Secondary Transforms (IST)
- IST is a class of secondary transforms that refine primary block-wise or vectorized transforms through non-separable processing for enhanced energy compaction.
- They employ learned non-separable kernels on selected low-frequency coefficients, boosting rate-distortion performance with minimal signaling overhead.
- Adapted for both video coding and neuromorphic/transformer architectures, ISTs optimize intra-token and inter-token processing for energy-efficient computation.
Intra/Inter Secondary Transforms (IST) refer to a class of signal and neural network transformations that enhance primary block-wise or vectorized transforms by applying additional, typically non-separable, processing stages. Initially developed for video coding—where they target both intra-frame (spatial prediction) and inter-frame (temporal prediction) residuals—IST principles have also been adapted in neuromorphic and transformer-based AI, where they formalize a systematic distinction between operations within elements of a vector (intra-token) and across vectors in a sequence (inter-token). IST modules aim to improve energy compaction, contextual processing, or energy-efficient computation by introducing a secondary transform or mixing stage after a primary transform or embedding, but restricted to carefully targeted subspaces or axes.
1. Formal Definitions and Theoretical Structure
In video coding, ISTs are non-separable transforms applied after a separable primary transform (such as DCT or ADST) and before quantization. The process involves applying a learned orthonormal kernel only to a small support of low-frequency coefficients, typically in the upper-left block of residual data, to exploit residual correlations left by the primary transform while keeping arithmetic and signaling overhead minimal. Mathematically, for a primary-transformed block , a mask operator selects a vector containing the coefficients to which a kernel is applied: Only is quantized; inverse IST reconstructs the vector by and returns it into the original coefficient positions (Nalci et al., 6 Jan 2026, Pakiyarajah et al., 21 May 2025).
In neuromorphic AI and transformer-based architectures, ISTs are generalized to encompass intra-token (mixing channels/features within a token) and inter-token (mixing information across tokens) secondary transforms after the initial embedding. Intra-token transforms are typically channel-mixing operations such as pointwise feedforward or neural networks: Inter-token transforms operate along the token dimension; for each channel,
Both stages preserve the notion of a primary/secondary distinction, with the latter targeting refinement or higher-level abstraction (Simeone, 1 Jan 2026).
2. Motivation and Algorithmic Rationale
Primary transforms in video codecs (DCT, ADST, path-graph KLT) are separable and tailored to efficiently decorrelate pixels in smooth or structured blocks, but they are sub-optimal for residuals featuring non-axis-aligned or highly directional textures. Applying a full non-separable KLT is infeasible due to memory and compute costs. IST addresses this by overlaying a small, non-separable transform specifically trained (e.g., via PCA or cluster-based KLT) on the significant low-frequency subspace, thereby enhancing energy compaction and reducing bitrate (Nalci et al., 6 Jan 2026, Pakiyarajah et al., 21 May 2025).
In neuromorphic architectures and energy-efficient AI, a similar principle holds: the initial token embedding disperses features into a vector space; secondary (IST) transforms then refine intra-token structure (via spiking neural elements or feedforward subnets) and build or update context along the inter-token axis (via state-space recurrences, attention, or neuromorphic approximation of these) (Simeone, 1 Jan 2026).
3. Mathematical Formulation and Implementation
Video Coding IST (AV2/AVM/VVC)
Let be the coefficient block post-primary transform. IST applies as follows:
- Define a support mask : selects the low-frequency coefficients.
- Let , and apply learned ():
- is quantized and encoded; on decode, obtain , reconstruct , and insert back via .
- is cluster- or mode-dependent, trained via offline residual statistics (e.g., non-separable KLT on cluster data) and orthonormalized such that (Nalci et al., 6 Jan 2026).
Signal syntax differs between intra- and inter-modes:
- Intra IST: several kernel sets, signaled via set and kernel indices.
- Inter IST: a single kernel set, only signal the kernel index.
Memory and compute costs are tightly bounded. For blocks or larger, the dominant kernel sizes are (DCT) or (ADST) with under 15 multiplies per pixel in the worst-case, and no FFT-style acceleration is required (Nalci et al., 6 Jan 2026).
Neuromorphic/Transformer IST
After embedding :
- Intra-token (per-token channel mixing):
Realized as spiking neural networks (SNNs) with leaky integrate-and-fire (LIF) neurons or feedforward subnets.
- Inter-token (across-token position mixing):
Realized as SSMs (state-space models), softmax self-attention, or neuromorphic approximations thereof using spike-based coincidence or stochastic codes.
A schematic architecture places intra-token SNNs/FFNs before and after an inter-token (SSM or self-attention) module within each layer (Simeone, 1 Jan 2026).
4. Rate-Distortion and Coding Efficiency
Joint optimization of primary and IST kernels via rate-distortion objectives is central. In AVM codec experiments, joint clustering and separate path-graph transform (SPGT) design yielded lowest total RD cost. Explicit improvements are:
- blocks (mean over 12 intra modes): up to BD-rate savings using joint/SPGT IST over a DCT/ADST-only baseline.
- blocks: up to with joint/SPGT IST (Pakiyarajah et al., 21 May 2025). In AV2, IST alone produces BD-rate reductions of (PSNR, all-intra natural video), (random access), and (low-delay), making it the largest contributor among residual-related tools (Nalci et al., 6 Jan 2026).
IST integration yields only minimal run-time overhead; all heavy optimization and kernel learning occurs offline, with negligible signaling (2 bits for primary transform, 1 bit for IST apply or bypass) (Pakiyarajah et al., 21 May 2025).
5. Comparative Analysis in Neuromorphic and Transformer Models
IST divides post-embedding processing into intra-token (FFN/SNN, independent per token) and inter-token (attention/SSM, context-building) axes. Intra-token SNNs, operating over a virtual time axis for each token, support highly sparse and energy-efficient computation, approaching theoretical improvements of 30–100× in certain tasks (e.g., keyword spotting, image classification) versus dense GPU approaches. Inter-token modules enable online context integration with O(N) or O(N2) cost, with sparse attention or SSMs enabling further energy reduction. Both are compatible with surrogate-gradient training or local/plasticity rules, maintaining >95% accuracy relative to dense transformer or SSM baselines while reducing computational overhead by more than an order of magnitude (Simeone, 1 Jan 2026).
6. Extensions, Application Domains, and Relationship to Prior Art
ISTs in video coding generalize and improve upon prior non-separable transforms such as NSST (HEVC) and LFNST (VVC) by allowing efficient inclusion in both intra- and inter-modes, restricting support to low-frequency coefficients, and minimizing signaling/compute overhead. AV1 (predecessor of AV2) did not deploy secondary transforms, representing a substantial step forward in AV2 (Nalci et al., 6 Jan 2026).
In neuromorphic AI, distinction and reflection of intra/inter IST correspond to advances in spiking transformers (Spikformer, SpikeGPT) and SSM-based SNNs (P-SpikeSSM), supporting application to sequence modeling, language, and energy-constrained domains. ISTs, as modular processing units, provide a scalable template for balancing energy use, context integration, and representational flexibility—central desiderata in next-generation AI hardware and software (Simeone, 1 Jan 2026).
7. Summary Table: IST Variants in Video Coding
| Context | IST Support Size | Kernel Shape | Application Mode(s) |
|---|---|---|---|
| TB < 8×8 | 16 coeffs | 8×16 | Intra/Inter (AV2) |
| TB ≥ 8×8, DCT-2 | 48 coeffs | 32×48 | Intra/Inter (AV2) |
| TB ≥ 8×8, ADST | 48 coeffs | 20×48 | Intra (AV2) |
ISTs are now established as a core design pattern for improving compaction and energy efficiency both in video codecs and in AI hardware/software, with mature methodologies for kernel learning, signaling, mode selection, and integration into block-by-block pipelines (Nalci et al., 6 Jan 2026, Pakiyarajah et al., 21 May 2025, Simeone, 1 Jan 2026).