Papers
Topics
Authors
Recent
2000 character limit reached

LLaMA-X Decoder for Visual Recognition

Updated 15 November 2025
  • LLaMA-X Decoder is a decoder-only Transformer adapted from LLaMA for visual recognition, employing a post-sequence [CLS] token to aggregate global image features under causal masking.
  • It integrates a soft-mask warmup schedule that smoothly transitions from bidirectional to causal masking, stabilizing training and mitigating early optimization issues.
  • Empirical results show competitive accuracy on ImageNet benchmarks along with improved calibration, higher attention-map rank, and efficient computation compared to encoder-only models.

LLaMA-X Decoder refers to a family of decoder-only Transformer architectures adapted from LLaMA, initially designed for LLMs, and repurposed for visual recognition tasks via architectural and training strategies that overcome the inherent limitations of causal masking when applied to images. The canonical implementation, denoted as image LLaMA (iLLaMA), integrates a post-sequence class token (PS [CLS]) mechanism and a soft-mask warmup schedule, enabling causal self-attention to function effectively in vision settings, while delivering computational advantages and strong empirical performance across multiple benchmarks.

1. LLaMA Decoder-Only Transformer Architecture

The LLaMA decoder-only architecture is constructed exclusively from decoder blocks and features the following core components for input sequence XRN×dX \in \mathbb{R}^{N \times d}:

  • Causal Self-Attention Layer:
    • Query, Key, Value projections: Q=WqXQ = W_q X, K=WkXK = W_k X, V=WvXV = W_v X
    • Scaled dot-product attention: A=QKTdA = \frac{Q K^T}{\sqrt{d}}
    • Causal masking: output O=Softmax(A+Mcausal)VO = \text{Softmax}(A + M_\text{causal}) V, where Mcausal{0,}N×NM_\text{causal} \in \{0, -\infty\}^{N \times N}, with Mi,j=0M_{i,j}=0 for iji \ge j, -\infty for i<ji < j
  • Feed-Forward Network (FFN):
    • Gated structure using SwiGLU activation: FFN(X)=W2SiLU(W1X)(W3X)\mathrm{FFN}(X) = W_2\, \mathrm{SiLU}(W_1 X) \odot (W_3 X)
    • Followed by a linear projection
  • Root-Mean-Square Layer Norm (RMSNorm): Applied on each sublayer in place of standard LayerNorm.
  • Rotary Positional Embeddings (RoPE): Incorporated within the attention mechanism.

While excelling at autoregressive text generation, this architecture exhibits catastrophic training collapse if naively applied to images with patches as tokens and a class ([CLS]) token at the first position, as causal masking prevents the class token from attending to any subsequent (patch) tokens, rendering gradient flow ineffective and resulting in training failure.

2. Post-Sequence Class Token Strategy

To circumvent attention collapse under causal masking, the post-sequence class token (PS [CLS]) strategy positions the [CLS] token at the end of the patch sequence, not at the beginning:

  • Input sequence: [patch1,patch2,,patchP,CLS][\text{patch}_1, \text{patch}_2, \ldots, \text{patch}_P, \mathrm{CLS}]; with N=P+1N = P+1.
  • Under causal masking, the [CLS] token at index NN has attention access to all preceding patches (iji \geq j), permitting it to aggregate global image representations.
  • No custom or hybrid masks are required—standard lower-triangular masks suffice.

Token Preparation Example:

1
2
3
4
def prepare_tokens(image):
    patches = split_image_into_patches(image)
    sequence = concat(patches, [CLS])
    return embed(sequence)

Positioning the class token at the sequence head under a causal mask disables its receptive field, while a PS [CLS] resolves this without architectural or masking exceptions.

3. Causal Self-Attention and Soft-Mask Warmup

Causal attention employs the mask: Mfull(i,j)={0,ij ,i<jM_{\text{full}}(i, j) = \begin{cases} 0, & i \geq j \ -\infty, & i < j \end{cases}

To stabilize early-stage training, a soft-mask interpolates between bidirectional (MidM_\text{id}: zeros) and strict causal (MfullM_\text{full}) masks: Mt=(1αt)Mfull+αtMidM_t = (1 - \alpha_t) M_\text{full} + \alpha_t M_\text{id} with the warmup scalar αt[0,1]\alpha_t \in [0, 1] decreasing from $1$ (bidirectional) to $0$ (strictly causal) over a schedule.

Alternatively, in the attention-weight domain: St=αt11T+(1αt)C,Ci,j={1,ij 0,i<jS_t = \alpha_t \mathbf{1} \mathbf{1}^T + (1-\alpha_t) C,\quad C_{i,j} = \begin{cases}1, & i \ge j \ 0, & i < j\end{cases} with attention output O=Softmax(A)StVO = \text{Softmax}(A) \odot S_t \cdot V.

This mechanism smooths optimization, mitigating underfitting and preventing collapse during the convergence process of decoder-only architectures on visual data.

4. Training Protocols and Architectural Variants

The supervised training procedure with soft-mask warmup follows:

  • Algorithm Steps:
    • Total epochs TT, cutoff epoch tct_c, base learning rate η\eta, warmup epochs ww, schedule type (linear/constant).
    • At each epoch e=1Te = 1 \ldots T:
    • Compute αe=max{0,1e/tc}\alpha_e = \max\{0, 1 - e/t_c\} (linear), or αe=1\alpha_e = 1 for etce \leq t_c, else $0$ (constant).
    • Set Me=(1αe)Mfull+αeMidM_e = (1-\alpha_e) M_\text{full} + \alpha_e M_\text{id}.
    • Forward: causal self-attention with MeM_e and PS [CLS].
    • Backward: update via AdamW (β1=0.9,β2=0.999)(\beta_1=0.9, \beta_2=0.999), weight decay $0.05$.
    • Learning rate: cosine schedule from η\eta to 0, with linear warmup for ww epochs.
  • Training Hyperparameters for ImageNet-1K:
    • T=300T=300, tc{25,50,100}t_c\in\{25,50,100\}, w=50w=50, η=4×103\eta=4\times 10^{-3} (for tiny/S/B), η=1×103\eta=1\times 10^{-3} (for large pretraining)
    • Data augmentations: RandAugment, Mixup (0.1–0.95), CutMix (0.1–1.0), label smoothing 0.1
  • Architectural Variants: Four isotropic iLLaMA models are instantiated, mirroring ViT scaling.
Model Depth Embedding Dim Heads #Params MACs
Tiny (T) 12 192 3 5.7 M 1.3 G
Small (S) 12 384 6 21.9 M 4.6 G
Base (B) 12 768 12 86.3 M 17.6 G
Large (L) 24 1024 16 310.2 M 62.8 G

Notable modifications relative to ViT include: SwiGLU FFN, RMSNorm layers, causal self-attention with PS [CLS] and rotary embeddings, and retention of learnable 2D positional embeddings.

5. Empirical Performance across Tasks

iLLaMA exhibits competitive results compared to encoder-only ViTs:

  • ImageNet-1K (224×224, supervised):
    • iLLaMA-T: 75.0% top-1 (vs. DeiT-Ti: 72.2%)
    • iLLaMA-S: 79.9% (vs. DeiT-S: 79.8%)
    • iLLaMA-B: 81.6% (vs. DeiT-B: 81.8%)
    • Fine-tuned at 384×384: iLLaMA-B → 83.0%
  • ImageNet-21K Pretraining + 1K Finetuning:
    • iLLaMA-B: 83.6% @224, 85.0% @384
    • iLLaMA-L: 84.8% @224, 86.0% @384
  • Model Calibration (Expected Calibration Error):
    • ConvNeXt-B: 0.0281
    • DeiT3-B: 0.0415
    • iLLaMA-B: 0.0335
  • Shape–Texture Bias (shape-preference, higher is better):
    • ConvNeXt-B: 33.3%
    • DeiT3-B: 39.9%
    • iLLaMA-B: 41.5%
  • Attention Map Rank (Layer 1, Head 1):
    • ViT-T: rank ≈ 81; iLLaMA-T: rank ≈ 129
    • The increased uniform singular-value distribution in iLLaMA attention maps suggests elevated representational capacity.
  • Task Transfer:
    • ADE20K semantic segmentation (UperNet): iLLaMA-T: 37.7 mIoU (vs. ViT-T: 39.8); iLLaMA-B: 45.1 (vs. ViT-B: 47.3)
    • CIFAR10/100: iLLaMA-T: 97.9%/84.8%; +soft mask: 97.9%/85.5%
  • Quantization Robustness: 8-bit weights/activations yield iLLaMA-T at 72.4% top-1, matching DeiT-Ti (32-bit).

6. Computational Efficiency and Representational Properties

Computational analysis for attention (sequence length NN, dimension DD):

  • Bidirectional Attention: 4ND2+2N2D4N D^2 + 2N^2 D FLOPs
  • Causal Attention: 4ND2+N2D+(N2/2+1)D4N D^2 + N^2 D + (\lfloor N^2/2 \rfloor + 1) D FLOPs, saving approximately N2D/2N^2 D/2 over the bidirectional case

Elevated attention-map rank in iLLaMA indicates richer cross-token relationships. The soft-mask warmup effectively smooths the optimization landscape, abating underfitting during supervised training of the decoder-only design on visual domains.

7. Implications and Outlook

The iLLaMA architecture verifies that a decoder-only Transformer, originated for textual modalities, can function effectively as a vision backbone with minimal adaptation. The post-sequence class token addresses the naïve causal masking collapse. The soft-mask warmup improves training dynamics, yielding models that rival encoder-only ViTs in classification, calibration, and transfer, while securing computational advantages and higher attention-map rank without bespoke architectural exceptions. These findings indicate a viable pathway toward unified multimodal decoders in which both images and text are processed within a common LLaMA-style architecture (Wang et al., 10 Apr 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to LLaMA-X Decoder.