Papers
Topics
Authors
Recent
Search
2000 character limit reached

Meta-SecAlign-70B: Robust Open-Weight 70B Model

Updated 26 March 2026
  • The paper integrates SecAlign++ with DPO using LoRA adapters to achieve commercial-grade robustness against indirect prompt injection attacks.
  • Meta-SecAlign-70B is a 70B-parameter open model that leverages domain adaptation and model merging to maintain general language proficiency while reducing SEC text perplexity.
  • Empirical results reveal that the model sustains nearly baseline utility with ASR below 2% on many security tasks, validated across diverse benchmarks.

Meta-SecAlign-70B is a 70-billion-parameter open-weight LLM with integrated model-level defenses against indirect prompt injection (PI) attacks. Building on Meta’s Llama-3.3-70B-Instruct, it implements an advanced form of the SecAlign defense (SecAlign++) using Direct Preference Optimization (DPO) with LoRA adapters. This approach yields commercial-grade robustness to prompt injection while preserving general instruction-following utility and task performance. Meta-SecAlign-70B is also distinguished by its open training protocol, domain adaptation experiments, and comprehensive benchmarking across utility and security tasks (Chen et al., 3 Jul 2025, Siriwardhana et al., 2024).

1. Model Architecture and Foundation

Meta-SecAlign-70B retains the underlying transformer architecture of Llama-3.3-70B-Instruct:

  • Parameter count: ≈70B
  • Depth: 32 transformer layers
  • Hidden dimension: 8,192
  • Feed-forward size: 32,768
  • Attention heads/layer: 64
  • Context window: 4,096 tokens

The only structural modification is at the prompt and embedding layer: an <input>...</input> role embedding is introduced between the trusted user prompt and assistant output. No new parameters or layers are added; instead, a distinct learned embedding for the “input” tag is slotted into the token lookup table (Chen et al., 3 Jul 2025).

2. Defense Mechanism: SecAlign++

SecAlign++ extends the original SecAlign model-level PI defense via two main innovations:

  • Randomized injection position: A “poison” instruction sampled from the dataset is concatenated either before or after the input data during fine-tuning.
  • Self-generated responses: The desirable completion is obtained by running the base model (undefended Llama-3.3-70B-Instruct) on the clean instruction-data pair, rather than relying on existing ground truth, thereby matching the model’s natural style and minimizing distribution shift.

The defense mechanism is implemented via DPO on LoRA adapters. For each adversarial input triplet (x, y_w, y_l) comprising the input x, a “winning” clean response y_w, and a “losing” adversarial completion y_l, DPO maximizes the probability that the model ranks safe completions above adversarial ones:

LDPO(θ)=logσ ⁣[β(logπθ(ywx)logπθ(ylx))]L_\text{DPO}(\theta) = -\log\,\sigma\!\left[\beta\left(\log\pi_\theta(y_w\mid x) - \log\pi_\theta(y_l\mid x)\right)\right]

where σ denotes the sigmoid function, β is a scaling hyperparameter (β=0.1), and πθ\pi_\theta is the model distribution. Adaptation is fast and memory-efficient since only (A,B) in the LoRA parameterization are trained, with r=32, α=8, and dropout=0.1, applied selectively to key transformer submodules (Chen et al., 3 Jul 2025).

3. Training Protocol and Domain Adaptation

3.1 Fine-Tuning for PI Robustness

  • Base: Initialize from Llama-3.3-70B-Instruct
  • Data: Cleaned-Alpaca with explicit data fields; 19,157 training examples
  • Augmentation: Adversarial contexts generated by randomized instruction injection and response self-generation
  • Objective: DPO on LoRA adapters
  • Schedule: 3 epochs, batch size 64, AdamW without weight decay, on 8×NVIDIA H200s for ~7h
  • Inference: α can be varied at inference to control the security-utility trade-off

3.2 Domain-Adaptive Pre-training and Model Merging

Domain adaptation experiments focused on continual pre-training (CPT) with 70B tokens from SEC filings, blended with 1B RedPajama general-domain tokens (constant 98.6:1.4 mix). CPT is performed using AdamW (β₁=0.9, β₂=0.95, ε=1e-8, decay=0.1) and cosine learning-rate scheduling on 128×NVIDIA H100s. To mitigate catastrophic forgetting, TIES model merging [Yadav et al., 2023] is used, mixing CPT and instruction-tuned checkpoints with per-layer mixing coefficients (α) set separately for MLP and attention blocks. Merging yields a model that preserves general language proficiency while retaining domain gains (Siriwardhana et al., 2024).

4. Empirical Results and Evaluation Benchmarks

4.1 Utility Benchmarks

Meta-SecAlign-70B is evaluated on a suite of nine general knowledge and instruction benchmarks (MMLU 0-shot, MMLU-Pro 5-shot, IFEval, BBH, GPQA-Diamond, AlpacaEval2, SEP, AgentDojo utility, WASP utility). Across these, the defended model sacrifices at most 2 percentage points of performance relative to the undefended base, and matches or exceeds commercial closed-source models (e.g., GPT-4o-mini) in several cases.

Benchmark Undefended Meta-SecAlign-70B GPT-4o-mini
MMLU 0-shot (%) 86.3 85.9 82.0
MMLU-Pro 5-shot (%) 67.7 67.6 64.8
AgentDojo Utility (%) 56.7 77.3 67.0
WASP Utility (%) 62.2 59.5 27.0

4.2 Security Benchmarks

Seven prompt injection security benchmarks are used (AlpacaFarm ASR, SEP ASR, TaskTracker ASR, CyberSecEval2 ASR, InjecAgent ASR, AgentDojo ASR, WASP intermediate and end-to-end ASR):

Benchmark Undefended Meta-SecAlign-70B GPT-4o-mini
AlpacaFarm ASR (%) 93.8 1.4 0.5
SEP ASR (%) 88.4 4.8 14.6
TaskTracker ASR (%) 19.6 0.2 0.3
CyberSecEval2 ASR (%) 52.7 1.8 25.5
InjecAgent ASR (%) 53.8 0.5 3.3

Meta-SecAlign-70B consistently achieves ASR below 2% on most tasks, with overall performance matching or exceeding commercial-grade defenses (Chen et al., 3 Jul 2025).

4.3 Domain and Generalization Analysis

Domain CPT (20B tokens) decreases SEC text perplexity by 20% but induces 5–8% drop in general QA. Merging recovers general QA scores (within 1–2% of baseline) while retaining reduced domain perplexity (net 10–12% gain). In downstream SEC tasks, the merged model outperforms both base and CPT-only for ConvFinQA, TAT-QA, and finance classification (Siriwardhana et al., 2024).

5. Robustness Limits and Adversarial Evaluation

While Meta-SecAlign-70B was robust to all black-box and rule-based PI attacks in standard benchmarks, recent work demonstrates that white-box, gradient-based attack optimization completely circumvents its defenses under unbounded compute (Panfilov et al., 25 Mar 2026). Specifically, “Claude_v63” and “Claude_v82,” discovered via an LLM-powered autoresearch pipeline, achieve 100% attack success rate (ASR) versus 56% for prior baselines. These attacks employ ADC+LSGM+adaptive sparsity, targeting the token-forcing loss under a fixed FLOPs budget and requiring no semantic alignment with target tasks. This indicates that current model-level defenses at the architectural and fine-tuning level may be insufficient against strong white-box adversaries and that further work in gradient obfuscation or certified robustness is necessary (Panfilov et al., 25 Mar 2026).

6. Security-Utility Trade-Offs and Limitations

Fine-grained control over the LoRA scaling parameter α at inference allows for modulation between utility and PI robustness. Lower α maximizes model utility but increases ASR, whereas higher α reduces ASR to below 2% with minimal (less than 1 point) utility loss. Closed-source commercial models, when run in secured mode, typically experience at least 6-point utility drops. The approach is currently effective only for indirect PI attacks and has not been extended to multimodal PI, system-level jailbreaks, or model extraction scenarios. Extension to other architectures such as Mixture-of-Experts or specialist models remains nontrivial and may require significant protocol modifications.

7. Release and Reproducibility

Meta-SecAlign-70B is released as an open-weight model with accompanying training and evaluation code to foster open research on prompt injection attacks and defenses. Referenced checkpoints, merge configurations, and datasets are fully documented and available to enable benchmarking and further innovation (Chen et al., 3 Jul 2025, Siriwardhana et al., 2024).


Cited Works:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Meta-SecAlign-70B.