Test-Time Padding (TTP) in Modern ML

Updated 25 December 2025

Test-Time Padding (TTP) is the process of modifying inputs at inference—by appending spatial borders, PAD tokens, or dummy packets—to satisfy fixed input constraints and improve model robustness.
It is used across various domains including vision-language models, transformers, and network protocols, where fixed or trainable padding methods help counter adversarial attacks and optimize computational efficiency.
Empirical results highlight that TTP can raise adversarial accuracy by up to 39.7% in vision models and significantly reduce traffic analysis success in privacy networks.

Test-Time Padding (TTP) denotes any modification to the input—whether in data, token streams, or network traffic—at inference or deployment, by appending or integrating additional elements such as spatial borders, PAD tokens, or dummy packets. TTP is used to satisfy fixed input constraints, regularize or adapt deep network behavior, defend against adversarial attacks, mask information from adversaries, and systematically alter the computational power or parallelism of inference in modern machine learning architectures. Its use spans vision-LLMs, sequence models, transformer LLMs, and privacy/network systems.

1. Conceptual Foundations and Taxonomy

Test-Time Padding arises as a response to the frequent architectural need for fixed-size or shape-aligned inputs at inference, and as a defense mechanism in adversarial or privacy-challenging settings.

Three principal TTP contexts emerge in current research:

Machine Learning Models: In computer vision, NLP, and multi-modal models, TTP typically manifests as spatial or token padding for compatibility, computational regularization, or robustness. VLMs (e.g., CLIP) and deep CNN/RNN architectures all utilize TTP, in forms ranging from static zero/reflection padding to learned or adaptive border augmentation (“trainable padding module”) (Li et al., 18 Dec 2025, Alrasheedi et al., 2023, Dwarampudi et al., 2019).
Transformers and LLMs: TTP compensates for variable-length sequences during batch inference by introducing PAD tokens. If mismanaged, it can impact activation drift, generation, bias, and safety, due to incomplete masking or embedding leaks (Himelstein et al., 23 Sep 2025).
Privacy and Security/Network Protocols: In low-latency anonymous communication (e.g., Tor), TTP corresponds to circuit-padding at runtime, where dummy packets are emitted according to stochastic policies to defeat traffic analysis or fingerprinting attacks (Kadianakis et al., 2021, Pulls, 2020).

In all regimes, TTP is distinct from training-time padding and may exert unique effects not mirrored during model fitting.

2. Methods and Algorithms

Machine Learning and Vision-LLMs

In vision-LLMs (e.g., CLIP), TTP includes both fixed and trainable spatial padding. The general pipeline (Li et al., 18 Dec 2025):

Pad the input image $X \in \mathbb{R}^{H \times W \times 3}$ with a border to get $P^{\text{fix}}(X)$ (fixed padding), or $P_\theta(X)$ (trainable, with border parameter $\theta$ ).
Compute cosine similarity between embeddings of $X$ and $P^{\text{fix}}(X)$ to detect adversarial manipulation.
Update trainable padding via entropy minimization over stochastic augmentations if $X$ is deemed adversarial.
Employ a similarity-aware weighted ensemble for prediction on adversarial samples.

Trainable padding modules in generic CNNs (e.g., VGG16, ResNet50V2) use a local self-supervised MSE loss to learn convolutional predictors for plausible, data-adaptive borders, decoupled from the global task loss (Alrasheedi et al., 2023).

LLMs and Transformers

At inference, TTP typically inserts PAD tokens to align all prompt sequences to a maximum batch length. These tokens should be rendered computationally inert via explicit masking in attention and position embeddings:

$M_{ij} = \begin{cases} 0 & \text{if } j \text{ is real} \ -\infty & \text{if } j \text{ is PAD} \end{cases}$

$e_{\text{pos}}(i) = \begin{cases} \text{LearnedPos}(i) & i \text{ real} \ 0~\text{or PAD embedding} & i \text{ PAD} \end{cases}$

Absent correct masking, PAD tokens can lead to activation drift, generation and safety failures (Himelstein et al., 23 Sep 2025). For computational expressivity, systematic TTP (padding width and optional block looping) can expand transformer classes from TC⁰ to TC^d to NC, enabling efficient simulation of parallelizable problems (Merrill et al., 25 May 2025).

Tor and Network Padding

In Tor, TTP is formalized as run-time adaptive dummy cell emission per-circuit. Each padding machine is a finite-state automaton with probabilistic transitions, parameterized inter-packet gaps (Log-Logistic) and burst (Pareto) distributions, which is tuned via genetic programming or manual adjustment for optimal bandwidth/effectiveness tradeoff (Kadianakis et al., 2021, Pulls, 2020).

3. Empirical Effects and Robustness

Vision-Language and Deep Networks

TTP can dramatically recover adversarial robustness with negligible harm to clean accuracy. In CLIP models under PGD attack ( $\epsilon=4/255$ ), TTP raises adversarial accuracy from $\sim 0\%$ (vanilla) to $\sim 39.7\%$ (ViT-B/32), beating prior best test-time defenses by $+4.4$ points while maintaining clean accuracy (~90.9% vs 91.4%). Detection accuracy for clean/adversarial separation reaches $98.5$– $98.7\%$ (Li et al., 18 Dec 2025). In generic CNNs, trainable TTP increases test accuracy over zero/reflection pads by 0.44–1.23 pp on ResNet/VGG, respectively (Alrasheedi et al., 2023).

Sequence Models (LSTM/CNN)

For LSTMs on sentiment classification, mismatch between training- and test-time padding drops test accuracy to near chance (50%). Proper alignment (pre-padding throughout) yields $\sim 80\%$ accuracy. In CNNs, TTP direction (pre/post) has minimal effect ( $\lesssim 0.1\%$ ) (Dwarampudi et al., 2019).

Transformers and Robustness/Safety

Insertion of unmasked PAD tokens even in small quantities (e.g., $k=4$ –$16$) causes BLEU and BERTScore to fall sharply in smaller Llama/Qwen models, with activation similarity $\delta_{(k)}$ dropping (e.g., from $0.98$ at $k=0$ to $0.85$ at $k=32$ ). Bias and safety metrics also shift: Attack Success Rate for Llama-3.1-8B rises from $8\%$ ( $k=0$ ) to $77.5\%$ ( $k=128$ ) (Himelstein et al., 23 Sep 2025).

Network Privacy

Circuit-padding machines (Interspace) on Tor can reduce Deep Fingerprinting max recall from $0.88$ (no defense) to $0.31$ at $305\%$ overhead, with continuous performance/efficiency tuning (Pulls, 2020). Fractional-delay and zero-delay (PCP) variants can force attacker accuracy to random-guess baselines at modest latency and bandwidth costs (Kadianakis et al., 2021).

4. Theoretical and Complexity Implications

TTP in transformers realizes a direct correspondence with uniform threshold circuit classes:

Polynomial padding width, constant depth: recognizes FO-uniform $\mathsf{TC}^0$
Polynomial padding + $O(\log^d n)$ block loops: recognizes $L$ -uniform $\mathsf{TC}^d$
Polylog-depth and padding: converges to $\mathsf{NC}$ (efficiently parallelizable problems)

This establishes that TTP is a formal, parallelizable alternative to chain-of-thought prompting, which is more powerful (can exceed $\mathsf{NC}$ ) but sequential and thus less parallelizable (Merrill et al., 25 May 2025).

5. Critical Pitfalls, Limitations, and Best Practices

In LLMs, improper attention masking, position embedding leakage, or reuse of EOS tokens as PAD can induce severe robustness and safety regressions (Himelstein et al., 23 Sep 2025).
Excessive spatial padding in vision models can distort input structure and degrade clean accuracy, while too little reduces adversarial detection power (Li et al., 18 Dec 2025).
Randomized shape (e.g., in Interspace) is effective in traffic defense, but adversary retraining on defense-shape data can increase recall, suggesting a continuous arms race (Pulls, 2020).
For LSTMs, test-time padding must match the convention used at training; misalignment negates learning (Dwarampudi et al., 2019).

Best practices include rigorous masking, minimal and uniform padding, direct alignment between training- and test-time strategies, explicit test suites for PAD-robustness, dynamical batching (to avoid padding), and, for safety-critical deployments, explicit PAD-masking objectives during model training (Himelstein et al., 23 Sep 2025, Li et al., 18 Dec 2025).

6. Open Challenges and Future Directions

Scalable and automated tuning of stochastic padding machines in privacy/traffic defense (Pulls, 2020).
Adaptive, content-aware, or non-uniform padding in vision models for adversarial or distributional shifts (Li et al., 18 Dec 2025).
Optimizing the test-time tradeoff between computational overhead, robustness, and utility (e.g., minimal-iteration trainable border updates, parallel vs sequential adaptation).
Modelling side-channel leakage from partial masking or dynamic architectural artifacts (e.g., partial LayerNorm, embedding sharing) not captured by current TTP implementations in LLMs (Himelstein et al., 23 Sep 2025).
Integration of TTP as a systematically parameterized knob for inference-time expressivity that maintains parallel efficiency within transformer architectures (Merrill et al., 25 May 2025).

Advances in TTP will likely continue to impact adversarial robustness, fairness, privacy, and the theoretical understanding of computation in modern deep learning systems.