Papers
Topics
Authors
Recent
Search
2000 character limit reached

Intra/Inter Secondary Transforms (IST)

Updated 13 January 2026
  • IST is a class of secondary transforms that refine primary block-wise or vectorized transforms through non-separable processing for enhanced energy compaction.
  • They employ learned non-separable kernels on selected low-frequency coefficients, boosting rate-distortion performance with minimal signaling overhead.
  • Adapted for both video coding and neuromorphic/transformer architectures, ISTs optimize intra-token and inter-token processing for energy-efficient computation.

Intra/Inter Secondary Transforms (IST) refer to a class of signal and neural network transformations that enhance primary block-wise or vectorized transforms by applying additional, typically non-separable, processing stages. Initially developed for video coding—where they target both intra-frame (spatial prediction) and inter-frame (temporal prediction) residuals—IST principles have also been adapted in neuromorphic and transformer-based AI, where they formalize a systematic distinction between operations within elements of a vector (intra-token) and across vectors in a sequence (inter-token). IST modules aim to improve energy compaction, contextual processing, or energy-efficient computation by introducing a secondary transform or mixing stage after a primary transform or embedding, but restricted to carefully targeted subspaces or axes.

1. Formal Definitions and Theoretical Structure

In video coding, ISTs are non-separable transforms applied after a separable primary transform (such as DCT or ADST) and before quantization. The process involves applying a learned orthonormal kernel only to a small support of low-frequency coefficients, typically in the upper-left block of residual data, to exploit residual correlations left by the primary transform while keeping arithmetic and signaling overhead minimal. Mathematically, for a primary-transformed block R∈RN×NR \in \mathbb{R}^{N\times N}, a mask operator M\mathcal{M} selects a vector v∈RNv \in \mathbb{R}^{N} containing the coefficients to which a kernel K∈RM×NK \in \mathbb{R}^{M \times N} is applied: u=Kv,u∈RMu = K v, \quad u \in \mathbb{R}^{M} Only uu is quantized; inverse IST reconstructs the vector by v^=KTu^v̂ = K^T \hat{u} and returns it into the original coefficient positions (Nalci et al., 6 Jan 2026, Pakiyarajah et al., 21 May 2025).

In neuromorphic AI and transformer-based architectures, ISTs are generalized to encompass intra-token (mixing channels/features within a token) and inter-token (mixing information across tokens) secondary transforms after the initial embedding. Intra-token transforms are typically channel-mixing operations such as pointwise feedforward or neural networks: Yd,n=fd,n(X:,n)Y_{d,n} = f_{d,n}(X_{:,n}) Inter-token transforms operate along the token dimension; for each channel,

Yd,n=gd,n(Xd,:)Y_{d,n} = g_{d,n}(X_{d,:})

Both stages preserve the notion of a primary/secondary distinction, with the latter targeting refinement or higher-level abstraction (Simeone, 1 Jan 2026).

2. Motivation and Algorithmic Rationale

Primary transforms in video codecs (DCT, ADST, path-graph KLT) are separable and tailored to efficiently decorrelate pixels in smooth or structured blocks, but they are sub-optimal for residuals featuring non-axis-aligned or highly directional textures. Applying a full non-separable KLT is infeasible due to memory and compute costs. IST addresses this by overlaying a small, non-separable transform specifically trained (e.g., via PCA or cluster-based KLT) on the significant low-frequency subspace, thereby enhancing energy compaction and reducing bitrate (Nalci et al., 6 Jan 2026, Pakiyarajah et al., 21 May 2025).

In neuromorphic architectures and energy-efficient AI, a similar principle holds: the initial token embedding disperses features into a vector space; secondary (IST) transforms then refine intra-token structure (via spiking neural elements or feedforward subnets) and build or update context along the inter-token axis (via state-space recurrences, attention, or neuromorphic approximation of these) (Simeone, 1 Jan 2026).

3. Mathematical Formulation and Implementation

Video Coding IST (AV2/AVM/VVC)

Let RR be the N×NN \times N coefficient block post-primary transform. IST applies as follows:

  • Define a support mask M\mathcal{M}: selects the NN low-frequency coefficients.
  • Let v=M(R)∈RNv = \mathcal{M}(R) \in \mathbb{R}^N, and apply learned K∈RM×NK \in \mathbb{R}^{M \times N} (M<NM < N): u=Kvu = K v
  • uu is quantized and encoded; on decode, obtain u^\hat{u}, reconstruct v^=KTu^vÌ‚ = K^T \hat{u}, and insert back via M−1(v^)\mathcal{M}^{-1}(vÌ‚).
  • KK is cluster- or mode-dependent, trained via offline residual statistics (e.g., non-separable KLT on cluster data) and orthonormalized such that KKT=IMK K^T = I_M (Nalci et al., 6 Jan 2026).

Signal syntax differs between intra- and inter-modes:

  • Intra IST: several kernel sets, signaled via set and kernel indices.
  • Inter IST: a single kernel set, only signal the kernel index.

Memory and compute costs are tightly bounded. For 8×88 \times 8 blocks or larger, the dominant kernel sizes are 32×4832 \times 48 (DCT) or 20×4820 \times 48 (ADST) with under 15 multiplies per pixel in the worst-case, and no FFT-style acceleration is required (Nalci et al., 6 Jan 2026).

Neuromorphic/Transformer IST

After embedding E∈RDemb×NE \in \mathbb{R}^{D_{\text{emb}} \times N}:

  • Intra-token (per-token channel mixing):

Yd,n=fd,n(X:,n)Y_{d,n} = f_{d,n}(X_{:,n})

Realized as spiking neural networks (SNNs) with leaky integrate-and-fire (LIF) neurons or feedforward subnets.

  • Inter-token (across-token position mixing):

Yd,n=gd,n(Xd,:)Y_{d,n} = g_{d,n}(X_{d,:})

Realized as SSMs (state-space models), softmax self-attention, or neuromorphic approximations thereof using spike-based coincidence or stochastic codes.

A schematic architecture places intra-token SNNs/FFNs before and after an inter-token (SSM or self-attention) module within each layer (Simeone, 1 Jan 2026).

4. Rate-Distortion and Coding Efficiency

Joint optimization of primary and IST kernels via rate-distortion objectives is central. In AVM codec experiments, joint clustering and separate path-graph transform (SPGT) design yielded lowest total RD cost. Explicit improvements are:

  • 8×88\times8 blocks (mean over 12 intra modes): up to −7.56%-7.56\% BD-rate savings using joint/SPGT IST over a DCT/ADST-only baseline.
  • 16×1616\times16 blocks: up to −7.82%-7.82\% with joint/SPGT IST (Pakiyarajah et al., 21 May 2025). In AV2, IST alone produces BD-rate reductions of −3.85%-3.85\% (PSNR, all-intra natural video), −1.76%-1.76\% (random access), and −1.09%-1.09\% (low-delay), making it the largest contributor among residual-related tools (Nalci et al., 6 Jan 2026).

IST integration yields only minimal run-time overhead; all heavy optimization and kernel learning occurs offline, with negligible signaling (2 bits for primary transform, 1 bit for IST apply or bypass) (Pakiyarajah et al., 21 May 2025).

5. Comparative Analysis in Neuromorphic and Transformer Models

IST divides post-embedding processing into intra-token (FFN/SNN, independent per token) and inter-token (attention/SSM, context-building) axes. Intra-token SNNs, operating over a virtual time axis for each token, support highly sparse and energy-efficient computation, approaching theoretical improvements of 30–100× in certain tasks (e.g., keyword spotting, image classification) versus dense GPU approaches. Inter-token modules enable online context integration with O(N) or O(N2) cost, with sparse attention or SSMs enabling further energy reduction. Both are compatible with surrogate-gradient training or local/plasticity rules, maintaining >95% accuracy relative to dense transformer or SSM baselines while reducing computational overhead by more than an order of magnitude (Simeone, 1 Jan 2026).

6. Extensions, Application Domains, and Relationship to Prior Art

ISTs in video coding generalize and improve upon prior non-separable transforms such as NSST (HEVC) and LFNST (VVC) by allowing efficient inclusion in both intra- and inter-modes, restricting support to low-frequency coefficients, and minimizing signaling/compute overhead. AV1 (predecessor of AV2) did not deploy secondary transforms, representing a substantial step forward in AV2 (Nalci et al., 6 Jan 2026).

In neuromorphic AI, distinction and reflection of intra/inter IST correspond to advances in spiking transformers (Spikformer, SpikeGPT) and SSM-based SNNs (P-SpikeSSM), supporting application to sequence modeling, language, and energy-constrained domains. ISTs, as modular processing units, provide a scalable template for balancing energy use, context integration, and representational flexibility—central desiderata in next-generation AI hardware and software (Simeone, 1 Jan 2026).

7. Summary Table: IST Variants in Video Coding

Context IST Support Size Kernel Shape Application Mode(s)
TB < 8×8 16 coeffs 8×16 Intra/Inter (AV2)
TB ≥ 8×8, DCT-2 48 coeffs 32×48 Intra/Inter (AV2)
TB ≥ 8×8, ADST 48 coeffs 20×48 Intra (AV2)

ISTs are now established as a core design pattern for improving compaction and energy efficiency both in video codecs and in AI hardware/software, with mature methodologies for kernel learning, signaling, mode selection, and integration into block-by-block pipelines (Nalci et al., 6 Jan 2026, Pakiyarajah et al., 21 May 2025, Simeone, 1 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Intra/Inter Secondary Transforms (IST).