Info Injection vs Structural Stacking

Updated 14 December 2025

The paper introduces a paradigm that replaces deep, multi-layered models with targeted injection of domain-specific priors into flat neural networks.
Empirical results illustrate that models like FlatFormer achieve higher accuracy with fewer parameters and faster convergence compared to stacked architectures.
The approach highlights trade-offs between engineered injections and traditional structural stacking, offering insights into effective model design and potential adversarial countermeasures.

Information Injection over Structural Stacking refers to a neural architecture and design paradigm that emphasizes directly incorporating domain or task-specific information (“information injection”) into otherwise flat or minimally layered models, rather than relying on deep, multi-stage hierarchical architectures that achieve the same inductive bias through “structural stacking.” The core principle asserts that inductive priors typically realized through complex stacking of encoders, specialized heads, or sequence of structural transformations can often be replaced—at reduced computational and parameter cost—by injecting a compact set of specifically engineered features or biases at critical points in the network. This paradigm is distinguished from related but orthogonal techniques such as conventional architectural stacking in recurrent or graph models, and stands in contrast to deep reservoir stacking, hierarchical session encoders, or layered graph modules.

1. Conceptual Framework: Information Injection versus Structural Stacking

Information Injection over Structural Stacking posits that it is possible to match or exceed the functional performance of heavy, multi-level models by judiciously introducing session, memory, or structural domain priors through lightweight, targeted pathways in a flat network. Structural stacking, as typified by models like HiTSKT or RCNet, arranges modules hierarchically: encoders for intra-session and inter-session, concept-level memory, or signal transformations, each stacked to progressively refine representations. In contrast, information injection operates by encoding such priors as static input features, bias matrices, or compositional embeddings—integrated either before or within the attention/processing backbone. This approach aims to break the “performance–complexity trap,” wherein deeper or more stacked architectures incur significant inference cost, high parameter count, and slow convergence, ostensibly for superior expressivity or generalization.

2. Injection Mechanisms: FlatFormer as Archetype

FlatFormer exemplifies this paradigm in the domain of knowledge tracing. It employs two primary injection mechanisms in a flat Transformer, dispensing entirely with multi-stage session or memory modules (Xia et al., 7 Dec 2025):

Hybrid Input Encoding (“Injection-i”): For each timestep $t$ , a session identifier $s_t$ and within-session step index $\tau_t$ are computed, with session boundaries detected by temporal discontinuities (gap exceeding $\Delta_\text{gap}$ ). The session ID is embedded via a learnable matrix $E_S$ , and the intra-session step is encoded using sinusoidal positional encodings. The final input embedding at $t$ is

$X^{(0)}_t = E_{Q}(q_t) + E_A(a_{t-1}) + E_S(s_t) + PE(\tau_t).$

Power-Law Forgetting Bias (“Injection-ii”): To encode memory decay akin to the Ebbinghaus forgetting curve, a pre-computed bias is added to the Transformer attention logits. For timestamps $ts_t$ and $ts_j$ , the normalized lag $\Delta t'_{t,j}$ is computed and the bias

$M_{\mathrm{forget}}[t,j] = -\beta \ln(\Delta t'_{t,j} + 1)$

is directly added to each logit. This is non-parametric and computed offline.

Both mechanisms are applied without changing the core $O(L^2)$ attention complexity, avoiding the introduction of deep session-level or temporal modules.

3. Structural Stacking in Deep Reservoir and Graph Models

Structural stacking remains central in several sequence and signal processing architectures. In RCNet (Zhou et al., 2020), deep RNNs for MIMO-OFDM symbol detection are built by stacking small reservoir computing (RC) modules, each capturing time or time-frequency structure. Stacking alternates time-domain RC blocks (handling temporal memory/nonlinearity) with time-frequency RC blocks (inserting FFT layers, enforcing cyclic prefix and subcarrier orthogonality constraints). This layer-wise injection of domain structure achieves progressive error reduction, each layer functioning as an iterative interference canceller. While this approach provides strong structural priors and robust generalization—especially under limited-data regimes—it incurs higher latency and complexity than designs which inject such priors directly.

4. Formalization and Mathematical Properties

In information injection, the design explicitly engineers embeddings or attention biases to encode structural knowledge, rather than learning them implicitly via depth. For FlatFormer, session boundaries and forgetting are not emergent but are explicitly mapped to embedding space and attention logit masks, respectively:

Given interaction sequence $(q_t, a_{t-1})$ with timestamps $ts_t$ , sessions are segmented, and each uncovered session is marked in $E_S$ .
Memory decay is mathematically simulated in attention via $M_{\mathrm{forget}}$ , directly enforcing exponential or power-law attenuation of long-range contributions.

In structural stacking (RCNet), each layer—parameterized by its own fixed random reservoir matrices—enforces a specific layer of structural prior (e.g., cyclic prefix alignment, frequency orthogonality) by design, and each readout is learned to minimize the squared error or bit error rate (BER) at that stage. The design remains modular and regularized, with ridge-penalized closed-form solutions available for all readout matrices. No backpropagation or end-to-end deep weight updates are required.

5. Comparative Advantages: Empirical and Theoretical Insights

Recent ablation and benchmarking studies demonstrate the following distinctions:

FlatFormer achieves superior AUC on large knowledge tracing benchmarks (e.g., EdNet: 0.846 vs. HiTSKT’s 0.763, a gain of +8.3 points) while using less than 15% of the parameters, and providing over 3 $\times$ faster inference latency (Xia et al., 7 Dec 2025).
In RCNet, stacking $L$ RC blocks provides as much as 20% lower BER than a shallow RC network, due to gradual, cumulative injection and enforcement of OFDM structure (Zhou et al., 2020).
FlatFormer’s ablation shows both session and forgetting injections yield additive gains and preserve attention locality and decay even with a flat backbone.
A plausible implication is that high cognitive or domain fidelity does not inherently require increased model depth, provided the relevant biases can be captured by appropriately engineered injection mechanisms.

Empirical results for FlatFormer indicate that both injection strategies are synergistic, with the loss landscape flattening and convergence accelerating compared to stacked architectures.

6. Applications and Extensions

Knowledge Tracing: FlatFormer demonstrates that session boundary and decay priors can be encoded in a flat attention model to improve student performance modeling, eliminating the need for hierarchical encoders (Xia et al., 7 Dec 2025).
Sequence Signal Detection: In communication, stacked RC blocks in RCNet inject temporal/frequency priors at each layer for robust OFDM symbol recovery in adverse nonlinearities (Zhou et al., 2020).
Prompt Injection in Tabular Agents: StruPhantom shows that even adversarial payloads must pass through multilayered structure (rows, columns, nested tables), demonstrating how information injection can be adversarially repurposed; here, evolutionary search and MCTS are hybridized to optimize "stacked" injection paths, maximizing attack success without breaking format (Feng et al., 14 Apr 2025).

7. Limitations, Countermeasures, and Open Questions

Limitations include the requirement to identify appropriate injection points and biases with sufficient domain generality, as poorly chosen priors may restrict model expressivity or degrade under distributional shift. In adversarial applications (StruPhantom), attacks exploiting stacked information injection incur heavy API cost and are susceptible to robust file-level sanitization or strict schema adherence. In knowledge tracing, the long-term limit of flat models with injected priors over more complex, graph-structured hierarchies is not fully characterized.

Potential countermeasures against adversarial information injection include strong input validation, schema enforcement, behavioral auditing, and disallowing content injection from raw cells into LLM contexts (Feng et al., 14 Apr 2025).

A plausible implication is that as injection mechanisms become more expressive and automatically learned, the practical distinction between injected and stacked priors may blur, warranting further investigation into the theoretical and empirical boundaries of this paradigm.