PSRT: Accelerating LRM-based Guard Models via Prefilled Safe Reasoning Traces (2509.21768v1)

Published 26 Sep 2025 in cs.CR

Abstract: Large Reasoning Models (LRMs) have demonstrated remarkable performance on tasks such as mathematics and code generation. Motivated by these strengths, recent work has empirically demonstrated the effectiveness of LRMs as guard models in improving harmful query detection. However, LRMs typically generate long reasoning traces during inference, causing substantial computational overhead. In this paper, we introduce PSRT, a method that replaces the model's reasoning process with a Prefilled Safe Reasoning Trace, thereby significantly reducing the inference cost of LRMs. Concretely, PSRT prefills "safe reasoning virtual tokens" from a constructed dataset and learns over their continuous embeddings. With the aid of indicator tokens, PSRT enables harmful-query detection in a single forward pass while preserving the classification effectiveness of LRMs. We evaluate PSRT on 7 models, 13 datasets, and 8 jailbreak methods. In terms of efficiency, PSRT completely removes the overhead of generating reasoning tokens during inference. In terms of classification performance, PSRT achieves nearly identical accuracy, with only a minor average F1 drop of 0.015 across 7 models and 5 datasets.

Summary

The paper introduces PSRT, a method that pre-fills and optimizes safe reasoning traces to condense token-level reasoning into a single forward pass.
It maintains nearly identical detection performance while reducing token generation by over 230 tokens on average and preserving low false positive rates.
Ablation studies and theoretical analysis confirm that supervised fine-tuning and mean initialization are essential for PSRT's efficiency and robustness.

Prefilled Safe Reasoning Traces for Efficient LRM-Based Guard Models

Introduction and Motivation

The paper introduces PSRT, a method for accelerating Large Reasoning Model (LRM)-based guard models by replacing the token-level reasoning process with a prefilled safe reasoning trace. LRMs, such as DeepSeek-R and OpenAI O-series, have demonstrated superior performance in complex reasoning tasks and have been leveraged as guard models for harmful query detection. However, their deployment is hindered by the computational overhead of generating long reasoning traces during inference. PSRT addresses this bottleneck by condensing the reasoning process into optimized virtual embeddings, enabling harmful query detection in a single forward pass without sacrificing classification performance.

Figure 1: Flowchart of PSRT. The LRM is first trained on a reasoning dataset, then the safe reasoning trace is prefilled and optimized, and finally indicator tokens are used for efficient harmful/harmless query detection.

Methodology

Reasoning Dataset Construction and Model Training

PSRT begins by constructing a specialized reasoning dataset for harmful query detection. Each sample consists of a query $q$ , a reasoning trace $r$ (analyzing intent and harmfulness), and an answer $a$ wrapped with indicator tokens (<safe> or <unsafe>). LRMs are fine-tuned on this dataset using supervised fine-tuning (SFT), optionally augmented with reinforcement learning methods such as DPO or GRPO for further robustness.

Prefilling and Optimizing Safe Reasoning Traces

Instead of generating reasoning traces token-by-token at inference, PSRT introduces "safe reasoning virtual tokens"—continuous embeddings initialized by averaging the reasoning traces from the training dataset. These embeddings are then optimized to maximize the log-likelihood of the correct indicator token given the query and the prefilled trace. Theoretically, this is justified as a degenerate variational inference procedure, where the point-mass approximation of the reasoning trace yields a conservative lower bound on the marginal log-likelihood.

Efficient Binary Classification

At inference, the model receives the query and the fixed safe reasoning trace, and outputs the probabilities for <safe> and <unsafe> indicator tokens. The query is classified as harmful or harmless based on which indicator token has higher probability. This process eliminates the need for autoregressive reasoning generation, reducing inference to a single forward pass.

Experimental Results

Harmful and Jailbreak Detection

PSRT was evaluated on 7 models, 13 datasets, and 8 jailbreak methods. Across GuardReasoner and SFT-only models, PSRT maintained nearly identical detection performance compared to the original LRM-based guard models, with an average F1 drop of only 0.015. For example, on harmful and jailbreak datasets, PSRT reduced the average number of generated tokens by over 230, while the drop in true positive rate (TPR) was less than 1%. In some cases, PSRT even improved detection rates, particularly for Qwen-based models.

Figure 2: Detection performance of GuardReasoner-1B, showing minimal performance drop after applying PSRT.

Harmless and Mixed Datasets

On harmless datasets, PSRT preserved low false positive rates (FPR), with some models exhibiting a decrease in FPR after PSRT application. On mixed datasets, the F1 score remained stable, with only minor reductions, while inference cost was drastically reduced.

Ablation Studies

Ablation experiments demonstrated that both the SFT stage and mean initialization of the safe reasoning trace are critical for PSRT's effectiveness. Omitting SFT led to a 12.35% average drop in TPR, while skipping mean initialization resulted in a 36.93% drop.

Figure 3: Ablation paper of PSRT. Top: results without SFT; bottom: results without average initialization. Both components are essential for optimal detection performance.

Scaling and Data Efficiency

Experiments on varying the length of the safe reasoning trace and the size of the training dataset revealed that performance is robust to these hyperparameters. Increasing the trace length generally improved detection, but gains saturated beyond a certain point. Notably, strong results were achieved with as few as 2.5k training examples.

Theoretical Analysis

The paper provides a formal justification for PSRT via variational inference. By treating the reasoning trace as a latent variable and approximating its posterior with a point-mass at the mean embedding, the method maximizes a lower bound on the marginal likelihood. Furthermore, under Lipschitz continuity assumptions, the error in classification probability induced by this approximation is provably bounded by the expected distance between the true and prefilled reasoning traces.

Implementation Considerations

Computational Efficiency: PSRT eliminates the autoregressive reasoning generation, reducing inference latency by over 90% and enabling real-time deployment of LRM-based guard models.
Model Compatibility: The method is compatible with both SFT-only and RL-finetuned LRMs, and generalizes across architectures (Qwen, Llama, GLM, Mistral).
Hyperparameter Selection: The length of the safe reasoning trace should be chosen based on validation performance, typically matching the average reasoning trace length in the training set.
Training Pipeline: SFT on the reasoning dataset is essential, followed by optimization of the safe reasoning trace embeddings. Training is memory-efficient and converges rapidly.
Deployment: At inference, only the query and the fixed safe reasoning trace are required, enabling batch processing and integration into existing moderation pipelines.

Implications and Future Directions

PSRT demonstrates that the reasoning utility of LRMs for harmful query detection can be effectively condensed into optimized virtual embeddings, challenging the necessity of explicit token-level reasoning at inference. This has significant implications for the deployment of safety guard models in latency-sensitive environments. The approach may be extended to other domains requiring complex reasoning, such as code and mathematical problem solving, and could inspire further research into embedding-based reasoning compression and efficient moderation strategies.

Conclusion

PSRT provides a principled and practical solution for accelerating LRM-based guard models by prefilling and optimizing safe reasoning traces. The method achieves near-parity in detection performance with substantial reductions in computational overhead, validated across diverse models and datasets. Theoretical analysis supports the soundness of the approach, and ablation studies confirm the necessity of its components. PSRT represents a significant step toward scalable, efficient, and robust AI safety moderation.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper is about making AI “safety filters” faster. These filters sit in front of a chatbot and check whether a user’s message is harmful (like asking for instructions to hurt someone) or harmless. The authors use special AI models called Large Reasoning Models (LRMs), which are good at “thinking through” problems step by step. LRMs can catch tricky harmful requests, but they usually write long explanations before deciding, which is slow. The paper introduces a method called PSRT that lets these models make safe/unsafe decisions quickly, without writing out their full reasoning each time.

Key Questions

The paper explores three simple questions:

Can we keep the strong safety detection of reasoning-heavy models while removing their slow “think-out-loud” steps?
Can a learned shortcut (a “safe reasoning trace”) help the model decide safe vs. unsafe in one quick go?
Will this work across many models, datasets, and “jailbreak” tricks that try to fool the model?

How the Method Works

Think of an LRM like a student who writes a full solution before giving an answer. That’s accurate but slow. PSRT gives the student a high-quality, prewritten paper note they can rely on, so they can answer quickly without rewriting everything.

The approach has three main steps:

Build a reasoning dataset: For each message, the model first analyzes the user’s intent (“What is the person really asking?”), then explains why it’s safe or unsafe. Answers start with tags like <safe> or <unsafe>.
Train the model: Use supervised fine-tuning (SFT) so the model learns to produce good safety reasoning and the correct tag.
Prefill a “safe reasoning trace”: Instead of generating a brand-new reasoning every time, the system learns a set of “virtual tokens” (think of them as a compact, learned cheat sheet in the model’s internal language of numbers, called embeddings). These tokens are appended after the user’s message. They are optimized so the model can rely on them as if it had done the full reasoning. Then, in a single forward pass (one quick look), the model picks either <safe> or <unsafe>.

Key technical ideas explained simply:

“Reasoning trace” = the model’s “thinking out loud.”
“Virtual tokens” = invisible helper notes the model understands.
“Embeddings” = how the model represents words/ideas as numbers.
“Single forward pass” = the model looks once and decides, instead of typing a long explanation and then deciding.

Main Findings and Why They Matter

Across 7 different models, 13 datasets, and 8 jailbreak attacks, PSRT:

Removes the time spent generating long reasoning: It cuts hundreds of output tokens per query. Less text generation means faster, cheaper safety checks.
Keeps accuracy nearly the same: On mixed datasets, the average F1 score drops by only about 0.015, which is very small. In some cases, detection even improves (especially for certain model families).
Works on both harmful and harmless messages: It keeps a high true positive rate (catching harmful queries) and a low false positive rate (not flagging harmless ones).
Handles “jailbreaks”: It maintains strong performance against tricky prompts designed to bypass safety.

In short, PSRT makes safety filters faster while preserving their ability to catch harmful requests—even clever ones.

Implications and Impact

Faster, cheaper safety: Apps and websites using AI can screen messages in real time with less delay and lower computing cost.
A new way to use “reasoning”: The paper shows you don’t always need models to write out their thoughts—those thoughts can be “condensed” into learned helper notes.
Broadly useful: The idea could be applied to other tasks where models normally write long reasoning (like math or coding) but you need quick decisions.
Practical caution: Good results depend on having a well-built dataset—especially for diverse, sensitive topics—so future work should expand and refine training data.

Overall, PSRT is a smart shortcut: it keeps the benefits of deep reasoning for safety, but delivers decisions quickly, making AI systems safer and more practical to use.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The following points identify what remains missing, uncertain, or unexplored, framed to guide future research on PSRT and LRM-based guard models:

Data and annotation quality
- Reliance on synthetic chain-of-thought traces produced by DeepSeek-V3.1 without human validation; quantify label fidelity, bias, and cross-model sensitivity, and compare against human-annotated reasoning traces.
- Limited coverage of “sensitive but harmless” content acknowledged by the authors; build broader, diversified datasets across cultures, slang, obfuscations, and evolving policy definitions.
Design of the safe reasoning trace
- Use of a single, global safe reasoning trace r_s for all queries and domains; investigate conditional/multi-trace approaches (e.g., per-domain, per-attack, or mixture-of-experts traces with learned gating) and measure benefits over a global average.
- Heuristic selection of the trace length l via validation; develop principled, adaptive length selection (e.g., learned or confidence-based) and characterize the performance–compute Pareto frontier.
- Append-only placement of soft tokens after the query; paper alternative insertion strategies (prefix, interleaved, or layer-wise soft prompts) and positions that optimize performance or robustness.
- Averaging truncated/padded embeddings to initialize r_s; evaluate alternative initializations (e.g., K-means centroids, PCA/low-rank bases, attention-pooled prototypes, or learned meta-initializations) and their effect on convergence and final accuracy.
- Single r_s shared across harmful and harmless decisions; explore separate or conditional traces per class, severity, or taxonomy.
Methodological and theoretical aspects
- Indicator-token dependence (<safe>, <unsafe>) for binary classification; extend to multi-label safety taxonomies, severity tiers, policy attributes, and calibrated confidence scores.
- Vulnerability of indicator-token probabilities to prompt injection and adversarial manipulation; develop margin-based decision rules, calibrated thresholds, logit regularization, or robust decoding to harden against targeted attacks on the classification mechanism.
- ELBO-based point-estimate interpretation of r_s and the L-Lipschitz continuity assumption; validate assumptions empirically, quantify bound tightness, and compare against variational or amortized inference over latent reasoning (e.g., learn q(r|q) rather than a single point r_s).
- Continual learning and drift: define procedures to update r_s as content and attack distributions evolve, including online adaptation, periodic re-initialization, and stability–plasticity trade-offs.
Robustness and security
- Attack surface specific to PSRT (fixed soft embeddings appended at inference): assess susceptibility to adversarial queries that steer logits of <safe>/<unsafe> or exploit the appended soft tokens, and test defenses (randomized ensembles of traces, per-batch perturbations, or gating).
- Red-teaming beyond the 8 jailbreak methods, especially attacks that target PSRT’s mechanism (e.g., indicator-token hijacking, adversarial suffixes that neutralize r_s, or training-time poisoning of trace initialization).
- Error analysis and failure modes: systematically characterize cases where PSRT underperforms (e.g., the Llama-based drop on jailbreak datasets), identify model-specific properties predicting success/failure, and derive mitigation strategies.
Evaluation scope and metrics
- Latency proxy via “number of tokens” only; report wall-clock inference time, throughput, and energy across hardware (GPU/CPU, batch sizes), and include end-to-end pipeline measurements with guard integration.
- VRAM/memory footprint and context utilization of appending l≈250 soft tokens; quantify impact on maximum effective context, batching efficiency, and memory-bound scenarios.
- Multilingual, code-mixed, and low-resource settings; evaluate cross-lingual generalization and domain shifts beyond primarily English datasets.
- Multi-turn conversations and contextual moderation; extend evaluation to dialogue histories where harmfulness depends on prior turns.
- Calibration and selective prediction: add AUROC, AUPRC, ECE/calibration error, abstention strategies, and cost-sensitive metrics to complement TPR/FPR/F1.
- Helpfulness impact: measure effects on benign task quality and downstream system behavior when PSRT is embedded in a guard pipeline (e.g., false positives blocking helpful responses, recovery mechanisms).
Comparisons and baselines
- Head-to-head comparison with alternative acceleration methods (e.g., reasoning-length pruning, summarization, speculative decoding, caching, or retrieval-augmented guards) under matched compute budgets.
- Benchmark against standard soft-prompt/prefix-tuning baselines tailored to safety classification to isolate PSRT’s contribution over established methods.
Generalization and portability
- Scalability to much larger/smaller LRMs (e.g., O-series, DeepSeek R models >70B, mobile-scale models) and portability across architectures; test whether r_s or learned procedures transfer between model families.
- Cross-task transfer: can a trace learned for input moderation transfer to output moderation (response filtering) or to adjacent safety tasks (policy compliance, privacy, misinformation)?
- Interoperability with RLHF/preference training (DPO, GRPO): characterize interactions (synergy or interference) between reinforcement objectives and PSRT, including training schedules and stability.
Deployment and operations
- Thresholding and confidence policies for indicator tokens in production, including fallback behavior (abstain/escalate) under uncertainty; provide guidance for tuning per-application risk tolerance.
- Cost analysis: quantify the training overhead (SFT + PSRT optimization) versus inference-time savings, and provide amortization curves under realistic traffic.
- Integration with streaming/batched APIs and output filtering pipelines; address practical concerns like batching with varying r_s lengths, logging/auditing requirements, and compliance workflows.
Interpretability and governance
- Loss of explicit reasoning traces reduces transparency and auditability; investigate methods to recover faithful post-hoc explanations, confidence rationales, or verifiable decision artifacts that meet compliance needs.
- Human-in-the-loop review and governance: define protocols for auditing PSRT decisions, triaging borderline cases, and monitoring for drift or policy changes.

View Paper Prompt View All Prompts

Practical Applications

Overview

The paper proposes PSRT, a practical method to accelerate Large Reasoning Model (LRM)–based guard models for harmful query detection by replacing explicit reasoning generation with prefilled “safe reasoning” virtual tokens (continuous embeddings). PSRT enables single-forward-pass classification via indicator tokens (e.g., <safe>, <unsafe>), delivering near-identical detection performance while eliminating reasoning-token generation during inference across diverse models and datasets.

Below are actionable applications derived from the paper’s findings, methods, and innovations.

Immediate Applications

These applications can be deployed now with current models and tooling; they focus on accelerating guardrails, reducing latency/cost, and improving throughput while maintaining accuracy.

Drop-in single-pass safety gate for LLM APIs
- Sector: software/platforms, cloud AI services
- Tools/Products/Workflows: “PSRT Guard” module in serving stacks; pre-filtering unsafe prompts/outputs with indicator-token classification; optional cascade to full reasoning only for uncertain cases
- Assumptions/Dependencies: Availability of LRM-based guard models and a domain-appropriate safe reasoning dataset; model-specific tuning (safe-trace length l); indicator-token instrumentation; periodic retraining for drift
Real-time content moderation for chat/social platforms
- Sector: content moderation, trust & safety
- Tools/Products/Workflows: PSRT-powered moderation services for live chat, comments, and community platforms; faster triage and enforcement; batch and streaming pipelines
- Assumptions/Dependencies: Coverage of harmful categories and multilingual data; monitoring for evolving adversarial jailbreaks; integration with existing policy taxonomies
On-device safety gating for assistants and edge deployments
- Sector: consumer devices, robotics, IoT
- Tools/Products/Workflows: Embedded PSRT classification to gate voice/text commands; low-latency rejection of unsafe instructions; hybrid mode that escalates to cloud for borderline cases
- Assumptions/Dependencies: Edge-compatible LRM variants (quantized/smaller models); energy and memory constraints; on-device inference optimizations
Enterprise compliance firewall for internal LLM use
- Sector: finance, healthcare, legal, HR
- Tools/Products/Workflows: PSRT-based safety proxy (reverse proxy/gateway) for internal LLM traffic; DLP and policy checks; audit logs keyed to indicator tokens; risk dashboards (TPR/FPR)
- Assumptions/Dependencies: Domain-specific policy alignment and training data; regulatory mapping; documented escalation paths to human review
Safer developer copilots and code assistants
- Sector: software engineering
- Tools/Products/Workflows: IDE plugins that gate harmful code suggestions (e.g., malware, exploits) using PSRT prefilters; block prompt-injection/jailbreak attempts
- Assumptions/Dependencies: Security-focused training data; integration with dev workflows; tuning thresholds to balance false positives/negatives
Safety triage router in multi-agent systems
- Sector: AI orchestration, agent frameworks
- Tools/Products/Workflows: Single-pass PSRT classifier as a router that blocks unsafe tasks or routes uncertain cases to a “full-reasoning” agent or human
- Assumptions/Dependencies: Calibrated confidence thresholds; fallbacks for edge cases; logging and traceability
Academic benchmarking and reproducible research on efficient safety reasoning
- Sector: academia
- Tools/Products/Workflows: Using the released code/dataset to paper condensed reasoning embeddings, model differences (e.g., Qwen vs Llama), ELBO-based training, and Lipschitz error bounds
- Assumptions/Dependencies: Access to LRMs and compute; licensing/compliance for datasets; standardized evaluation protocols
Customer support chat safety at scale
- Sector: customer service, BPO
- Tools/Products/Workflows: PSRT moderation for agent and user interactions; minimal latency added to large volumes; automated escalation and templated refusals
- Assumptions/Dependencies: Domain adaptation; maintaining service-level agreements; multilingual coverage
Parental controls and daily content filters
- Sector: daily life, consumer apps
- Tools/Products/Workflows: Mobile/desktop filters for AI assistants and messaging that quickly flag unsafe content; configurable strictness levels
- Assumptions/Dependencies: Age-appropriate policy settings; minimizing false positives for benign sensitive topics; offline models for privacy
Policy evaluation pilots and regulatory sandboxes
- Sector: public policy, regulation
- Tools/Products/Workflows: Pilot deployments demonstrating minimal-latency guardrails; indicator-token based auditability; standardized reporting of TPR/FPR across mixed datasets
- Assumptions/Dependencies: Regulator-approved metrics and test suites; transparency and documentation; governance for model updates

Long-Term Applications

These applications require additional research, scaling, adaptation, or standardization before widespread deployment.

Extending “condensed reasoning” to other tasks (math, code debugging, safety-critical decision support)
- Sector: education, software, professional services
- Tools/Products/Workflows: “Reasoning Capsules” (task-specific virtual token embeddings) for fast inference in non-safety tasks
- Assumptions/Dependencies: Demonstrating parity or acceptable trade-offs vs full chain-of-thought; task-tailored datasets and evaluation
Adaptive/domain-specific PSRT profiles and personalization
- Sector: enterprise, consumer
- Tools/Products/Workflows: Multiple safe-trace embeddings tied to policies (e.g., medical compliance, workplace harassment), dynamically selected per user or task
- Assumptions/Dependencies: MLOps for profile management; data governance; conflict resolution among policies
Continual learning against evolving jailbreaks and prompt injection
- Sector: cybersecurity, trust & safety
- Tools/Products/Workflows: Automated pipelines to mine new attacks, update safe-traces, and re-tune indicator thresholds; integration with red-teaming and anomaly detection
- Assumptions/Dependencies: Robust data collection; safe retraining loops; monitoring for catastrophic forgetting
Safety standards and certifications for guard models
- Sector: policy/regulation, industry consortia
- Tools/Products/Workflows: Standardized benchmarks and certification criteria (F1/TPR/FPR targets, latency budgets); compliance reports using indicator tokens and documented fallback behavior
- Assumptions/Dependencies: Multistakeholder agreement; cross-vendor interoperability; periodic recertification
Hardware/software co-design for PSRT acceleration
- Sector: semiconductors, edge devices
- Tools/Products/Workflows: ASIC/SoC support for single-pass classification and safe-trace embeddings; optimized memory layouts and kernel fusion
- Assumptions/Dependencies: Hardware roadmap alignment; cost-benefit for device makers; standardized model interfaces
Cross-lingual and multimodal PSRT (text+image+audio+code)
- Sector: media, robotics, healthcare
- Tools/Products/Workflows: Unified safety gating across modalities (e.g., unsafe image prompts, audio commands); multimodal indicator tokens
- Assumptions/Dependencies: Availability of multimodal LRMs; multimodal safety datasets; robust multilingual coverage
Commercial productization: PSRT Safety Gateway and SDKs
- Sector: software tooling
- Tools/Products/Workflows: Managed PSRT services; SDKs for embedding integration; observability dashboards (TPR/FPR, drift, latency); “Safety Router” for cascaded workflows
- Assumptions/Dependencies: Customer adoption; SLAs; integration with existing moderation and logging stacks
Nuanced policy actions via richer indicator-token taxonomies
- Sector: policy, enterprise governance
- Tools/Products/Workflows: Expanded indicator set (<safe>, <needs_review>, <prohibited>, <sensitive_but_permissible>) enabling differentiated responses (refuse/transform/escalate)
- Assumptions/Dependencies: Training with multi-label indicators; policy versioning; human-in-the-loop for gray areas
Privacy-preserving PSRT via federated training
- Sector: healthcare, finance, public sector
- Tools/Products/Workflows: Federated learning of safe-traces using local sensitive data; differential privacy to protect user content
- Assumptions/Dependencies: Federated infrastructure; privacy compliance; robust aggregation across heterogeneous clients
Deeper theoretical and cognitive investigations
- Sector: academia
- Tools/Products/Workflows: Formal analysis of ELBO-based training for safe-traces; Lipschitz bounds in classification error; studies of model-specific behavior (e.g., Qwen vs Llama attention patterns)
- Assumptions/Dependencies: Research funding; access to diverse architectures; standardized protocols and datasets

Global Assumptions and Dependencies (cross-cutting)

High-quality, domain-specific safe reasoning datasets are crucial; performance degrades with distribution mismatch or narrow coverage.
Model-specific sensitivity exists (e.g., differing trends across Qwen and Llama); per-model hyperparameters (safe-trace length l) and thresholds must be tuned.
PSRT relies on LRMs fine-tuned for indicator-token outputs; SFT and averaged initialization are empirically necessary for good performance.
Continuous monitoring is required to counter evolving adversarial techniques; periodic retraining and evaluation across harmful, harmless, and mixed datasets is recommended.
Multilingual and multimodal deployments require dedicated data and validation; privacy and regulatory compliance must guide data pipelines and auditability.

View Paper Prompt View All Prompts

Glossary

Ablation paper: A controlled analysis where components of a method are removed or altered to measure their impact. "In this section, we perform ablation studies on two components of PSRT"
AutoDAN: A genetic algorithm-driven jailbreak attack method that crafts prompts to bypass model safeguards. "genetic algorithm-driven methods such as AutoDAN"
Chain-of-thought (CoT): An approach that elicits or uses intermediate reasoning steps to improve or evaluate model decisions. "we adopt a two-step chain-of-thought (CoT) annotation procedure"
CodeAttack: A jailbreak method exploiting code understanding to induce unsafe outputs. "Examples include CodeAttack"
DeepInception: A jailbreak technique using scene reasoning or text manipulation to mislead models into harmful behaviors. "DeepInception"
Direct Preference Optimization (DPO): A reinforcement learning-style training objective that optimizes model outputs according to pairwise preferences. "trained with SFT and Direct Preference Optimization (DPO)"
DRA: A jailbreak attack method designed to manipulate model safety behavior. "DRA"
Embedding space: The continuous vector space in which token or prompt representations live. "optimize them in the continuous embedding space"
Evidence lower bound (ELBO): A variational objective that lower-bounds the log-likelihood, often used to approximate intractable posteriors. "this objective maximizes an Evidence lower bound (ELBO) on the marginal log-likelihood"
F1 score: The harmonic mean of precision and recall, used as an overall performance metric. "we use the F1 score as an overall indicator of detection performance"
False Positive Rate (FPR): The proportion of harmless inputs incorrectly flagged as harmful. "we use the False Positive Rate (FPR) to quantify the misclassification rate of harmless queries"
FlipAttack: A jailbreak method that flips or perturbs inputs to elicit unsafe outputs. "FlipAttack"
Forward pass: A single evaluation of the model to produce outputs without generating multi-step reasoning tokens. "enables harmful-query detection in a single forward pass"
GCG: A gradient-based jailbreak attack method that optimizes prompts to trigger unsafe responses. "gradient-based optimization of methods such as GCG"
Group Relative Policy Optimization (GRPO): A reinforcement learning algorithm that optimizes policies using relative group-based feedback. "modified Group Relative Policy Optimization (GRPO)"
Guard model: A classifier or filter that detects or blocks harmful queries for LLMs. "several guard models ... have been proposed to filter harmful queries"
GuardReasoner: A family of LRM-based guard models trained with SFT and DPO to detect harmful queries. "GuardReasoner~\citep{liu2025guardreasoner} uses SFT together with a hard-sample DPO algorithm"
Indicator tokens: Special output tokens that explicitly denote the classification result (e.g., <safe>, <unsafe>). "each answer $a$ begins with an indicator token, either <safe> or <unsafe>"
Inference latency: The time delay incurred while evaluating a model, especially due to generating reasoning tokens. "substantially reduces inference latency"
Jailbreak: An attack strategy that manipulates an LLM into producing harmful or restricted content. "LLMs are known to be vulnerable to jailbreak attacks."
Jailbreak prompt: A carefully crafted input designed to bypass safety filters and elicit unsafe outputs. "carefully crafted jailbreak prompts"
Large Reasoning Models (LRMs): Models specialized for extended, high-quality multi-step reasoning across domains. "Large Reasoning Models (LRMs) have demonstrated remarkable performance"
L-Lipschitz continuity: A property bounding how much a model’s output can change relative to changes in its input, with constant L. "Under the assumption that the LRM-based Guard Model satisfies the L-Lipschitz continuity"
Marginal log-likelihood: The log-probability of observed data marginalized over latent variables, often optimized via ELBO. "an Evidence lower bound (ELBO) on the marginal log-likelihood"
PAD token: A special token used to pad sequences to a fixed length during training or preprocessing. "e(\text{PAD})"
Perform Binary Classification (PBC): A step where classification is done directly via indicator token probabilities without generating reasoning tokens. "Here, PBC denotes perform binary classification"
p-tuning: A method that learns soft prompts (continuous embeddings) to guide model behavior, often via prefix tuning. "inspired by p-tuning ... a method that enhances model capabilities by learning soft prompts"
Prefilled Safe Reasoning Trace (PSRT): A technique that replaces explicit reasoning generation with learned, prefilled embeddings to accelerate inference. "replace the model's reasoning process with a Prefilled Safe Reasoning Trace"
Reasoning posterior: The distribution over possible reasoning traces given a query, used conceptually when forming point estimates. "using safe reasoning trace $r_s$ as a point estimate for the reasoning posterior"
Reasoning trace: The sequence of intermediate tokens or steps that a model generates to arrive at an answer. "long reasoning traces during inference"
ReNeLLM: A jailbreak method exploiting model capabilities to induce unsafe behavior. "ReNeLLM"
Safe reasoning trace: A learned embedding sequence representing the condensed reasoning that leads to safe classification. "safe reasoning trace $r_s$ "
SFT (Supervised fine-tuning): Training a model on labeled data to align outputs with desired behavior. "We then use this dataset for supervised fine-tuning (SFT) of LRMs."
Soft prompts: Continuous, learnable embeddings used in place of discrete tokens to steer model outputs. "learning soft prompts"
Token-level reasoning: Generating explicit reasoning token-by-token during inference. "explicit token-level reasoning"
True Positive Rate (TPR): The proportion of harmful inputs correctly identified as harmful. "we report the True Positive Rate (TPR) to measure the detection rate of harmful queries"
Virtual tokens: Learnable, non-discrete embeddings that act like tokens to condition model behavior without being generated. "PSRT introduces a set of 'safe reasoning virtual tokens'"

PSRT: Accelerating LRM-based Guard Models via Prefilled Safe Reasoning Traces (2509.21768v1)

Summary

Prefilled Safe Reasoning Traces for Efficient LRM-Based Guard Models

Introduction and Motivation

Methodology

Reasoning Dataset Construction and Model Training

Prefilling and Optimizing Safe Reasoning Traces

Efficient Binary Classification

Experimental Results

Harmful and Jailbreak Detection

Harmless and Mixed Datasets

Ablation Studies

Scaling and Data Efficiency

Theoretical Analysis

Implementation Considerations

Implications and Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Questions

How the Method Works

Main Findings and Why They Matter

Implications and Impact

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Global Assumptions and Dependencies (cross-cutting)

Glossary

Open Problems

Continue Learning

Authors (5)

Collections

YouTube

PSRT: Accelerating LRM-based Guard Models via Prefilled Safe Reasoning Traces (2509.21768v1)

Summary

Prefilled Safe Reasoning Traces for Efficient LRM-Based Guard Models

Introduction and Motivation

Methodology

Reasoning Dataset Construction and Model Training

Prefilling and Optimizing Safe Reasoning Traces

Efficient Binary Classification

Experimental Results

Harmful and Jailbreak Detection

Harmless and Mixed Datasets

Ablation Studies

Scaling and Data Efficiency

Theoretical Analysis

Implementation Considerations

Implications and Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Questions

How the Method Works

Main Findings and Why They Matter

Implications and Impact

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Global Assumptions and Dependencies (cross-cutting)

Glossary

Open Problems

Continue Learning

Related Papers

Authors (5)

Collections

YouTube