LLaMA-X Decoder for Visual Recognition

Updated 15 November 2025

LLaMA-X Decoder is a decoder-only Transformer adapted from LLaMA for visual recognition, employing a post-sequence [CLS] token to aggregate global image features under causal masking.
It integrates a soft-mask warmup schedule that smoothly transitions from bidirectional to causal masking, stabilizing training and mitigating early optimization issues.
Empirical results show competitive accuracy on ImageNet benchmarks along with improved calibration, higher attention-map rank, and efficient computation compared to encoder-only models.

LLaMA-X Decoder refers to a family of decoder-only Transformer architectures adapted from LLaMA, initially designed for LLMs, and repurposed for visual recognition tasks via architectural and training strategies that overcome the inherent limitations of causal masking when applied to images. The canonical implementation, denoted as image LLaMA (iLLaMA), integrates a post-sequence class token (PS [CLS]) mechanism and a soft-mask warmup schedule, enabling causal self-attention to function effectively in vision settings, while delivering computational advantages and strong empirical performance across multiple benchmarks.

1. LLaMA Decoder-Only Transformer Architecture

The LLaMA decoder-only architecture is constructed exclusively from decoder blocks and features the following core components for input sequence $X \in \mathbb{R}^{N \times d}$ :

Causal Self-Attention Layer:
- Query, Key, Value projections: $Q = W_q X$ , $K = W_k X$ , $V = W_v X$
- Scaled dot-product attention: $A = \frac{Q K^T}{\sqrt{d}}$
- Causal masking: output $O = \text{Softmax}(A + M_\text{causal}) V$ , where $M_\text{causal} \in \{0, -\infty\}^{N \times N}$ , with $M_{i,j}=0$ for $i \ge j$ , $-\infty$ for $i < j$
Feed-Forward Network (FFN):
- Gated structure using SwiGLU activation: $\mathrm{FFN}(X) = W_2\, \mathrm{SiLU}(W_1 X) \odot (W_3 X)$
- Followed by a linear projection
Root-Mean-Square Layer Norm (RMSNorm): Applied on each sublayer in place of standard LayerNorm.
Rotary Positional Embeddings (RoPE): Incorporated within the attention mechanism.

While excelling at autoregressive text generation, this architecture exhibits catastrophic training collapse if naively applied to images with patches as tokens and a class ([CLS]) token at the first position, as causal masking prevents the class token from attending to any subsequent (patch) tokens, rendering gradient flow ineffective and resulting in training failure.

2. Post-Sequence Class Token Strategy

To circumvent attention collapse under causal masking, the post-sequence class token (PS [CLS]) strategy positions the [CLS] token at the end of the patch sequence, not at the beginning:

Input sequence: $[\text{patch}_1, \text{patch}_2, \ldots, \text{patch}_P, \mathrm{CLS}]$ ; with $N = P+1$ .
Under causal masking, the [CLS] token at index $N$ has attention access to all preceding patches ( $i \geq j$ ), permitting it to aggregate global image representations.
No custom or hybrid masks are required—standard lower-triangular masks suffice.

Token Preparation Example:

def prepare_tokens(image):
    patches = split_image_into_patches(image)
    sequence = concat(patches, [CLS])
    return embed(sequence)

Positioning the class token at the sequence head under a causal mask disables its receptive field, while a PS [CLS] resolves this without architectural or masking exceptions.

3. Causal Self-Attention and Soft-Mask Warmup

Causal attention employs the mask: $M_{\text{full}}(i, j) = \begin{cases} 0, & i \geq j \ -\infty, & i < j \end{cases}$

To stabilize early-stage training, a soft-mask interpolates between bidirectional ( $M_\text{id}$ : zeros) and strict causal ( $M_\text{full}$ ) masks: $M_t = (1 - \alpha_t) M_\text{full} + \alpha_t M_\text{id}$ with the warmup scalar $\alpha_t \in [0, 1]$ decreasing from $1$ (bidirectional) to $0$ (strictly causal) over a schedule.

Alternatively, in the attention-weight domain: $S_t = \alpha_t \mathbf{1} \mathbf{1}^T + (1-\alpha_t) C,\quad C_{i,j} = \begin{cases}1, & i \ge j \ 0, & i < j\end{cases}$ with attention output $O = \text{Softmax}(A) \odot S_t \cdot V$ .

This mechanism smooths optimization, mitigating underfitting and preventing collapse during the convergence process of decoder-only architectures on visual data.

4. Training Protocols and Architectural Variants

The supervised training procedure with soft-mask warmup follows:

Algorithm Steps:
- Total epochs $T$ , cutoff epoch $t_c$ , base learning rate $\eta$ , warmup epochs $w$ , schedule type (linear/constant).
- At each epoch $e = 1 \ldots T$ :
- Compute $\alpha_e = \max\{0, 1 - e/t_c\}$ (linear), or $\alpha_e = 1$ for $e \leq t_c$ , else $0$ (constant).
- Set $M_e = (1-\alpha_e) M_\text{full} + \alpha_e M_\text{id}$ .
- Forward: causal self-attention with $M_e$ and PS [CLS].
- Backward: update via AdamW $(\beta_1=0.9, \beta_2=0.999)$ , weight decay $0.05$.
- Learning rate: cosine schedule from $\eta$ to 0, with linear warmup for $w$ epochs.
Training Hyperparameters for ImageNet-1K:
- $T=300$ , $t_c\in\{25,50,100\}$ , $w=50$ , $\eta=4\times 10^{-3}$ (for tiny/S/B), $\eta=1\times 10^{-3}$ (for large pretraining)
- Data augmentations: RandAugment, Mixup (0.1–0.95), CutMix (0.1–1.0), label smoothing 0.1
Architectural Variants: Four isotropic iLLaMA models are instantiated, mirroring ViT scaling.

Model	Depth	Embedding Dim	Heads	#Params	MACs
Tiny (T)	12	192	3	5.7 M	1.3 G
Small (S)	12	384	6	21.9 M	4.6 G
Base (B)	12	768	12	86.3 M	17.6 G
Large (L)	24	1024	16	310.2 M	62.8 G

Notable modifications relative to ViT include: SwiGLU FFN, RMSNorm layers, causal self-attention with PS [CLS] and rotary embeddings, and retention of learnable 2D positional embeddings.

5. Empirical Performance across Tasks

iLLaMA exhibits competitive results compared to encoder-only ViTs:

ImageNet-1K (224×224, supervised):
- iLLaMA-T: 75.0% top-1 (vs. DeiT-Ti: 72.2%)
- iLLaMA-S: 79.9% (vs. DeiT-S: 79.8%)
- iLLaMA-B: 81.6% (vs. DeiT-B: 81.8%)
- Fine-tuned at 384×384: iLLaMA-B → 83.0%
ImageNet-21K Pretraining + 1K Finetuning:
- iLLaMA-B: 83.6% @224, 85.0% @384
- iLLaMA-L: 84.8% @224, 86.0% @384
Model Calibration (Expected Calibration Error):
- ConvNeXt-B: 0.0281
- DeiT3-B: 0.0415
- iLLaMA-B: 0.0335
Shape–Texture Bias (shape-preference, higher is better):
- ConvNeXt-B: 33.3%
- DeiT3-B: 39.9%
- iLLaMA-B: 41.5%
Attention Map Rank (Layer 1, Head 1):
- ViT-T: rank ≈ 81; iLLaMA-T: rank ≈ 129
- The increased uniform singular-value distribution in iLLaMA attention maps suggests elevated representational capacity.
Task Transfer:
- ADE20K semantic segmentation (UperNet): iLLaMA-T: 37.7 mIoU (vs. ViT-T: 39.8); iLLaMA-B: 45.1 (vs. ViT-B: 47.3)
- CIFAR10/100: iLLaMA-T: 97.9%/84.8%; +soft mask: 97.9%/85.5%
Quantization Robustness: 8-bit weights/activations yield iLLaMA-T at 72.4% top-1, matching DeiT-Ti (32-bit).

6. Computational Efficiency and Representational Properties

Computational analysis for attention (sequence length $N$ , dimension $D$ ):

Bidirectional Attention: $4N D^2 + 2N^2 D$ FLOPs
Causal Attention: $4N D^2 + N^2 D + (\lfloor N^2/2 \rfloor + 1) D$ FLOPs, saving approximately $N^2 D/2$ over the bidirectional case

Elevated attention-map rank in iLLaMA indicates richer cross-token relationships. The soft-mask warmup effectively smooths the optimization landscape, abating underfitting during supervised training of the decoder-only design on visual domains.

7. Implications and Outlook

The iLLaMA architecture verifies that a decoder-only Transformer, originated for textual modalities, can function effectively as a vision backbone with minimal adaptation. The post-sequence class token addresses the naïve causal masking collapse. The soft-mask warmup improves training dynamics, yielding models that rival encoder-only ViTs in classification, calibration, and transfer, while securing computational advantages and higher attention-map rank without bespoke architectural exceptions. These findings indicate a viable pathway toward unified multimodal decoders in which both images and text are processed within a common LLaMA-style architecture (Wang et al., 10 Apr 2024).

PDF Markdown Chat (Pro)

References (1)

Adapting LLaMA Decoder to Vision Transformer (2024)

Follow Topic

Get notified by email when new papers are published related to LLaMA-X Decoder.