Refusal-Aware Data Construction

Updated 18 February 2026

Refusal-Aware Data Construction is a systematic approach to build datasets that encode model refusal behaviors for enhanced safety and policy compliance.
It combines human annotation and synthetic data generation to label nuanced refusal categories, ensuring robust alignment with ethical and legal guidelines.
Advanced techniques like RAAI and ReFT filtering dynamically adapt refusal signals during decoding, balancing safe output generation with model usefulness.

Refusal-aware data construction refers to a set of systematic methodologies for building datasets that explicitly encode refusal behaviors—wherein a model is trained not only to generate valid outputs but also to recognize and appropriately refuse to answer queries that are harmful, policy-violating, irrelevant, unanswerable, or outside its domain of knowledge. This paradigm has become foundational for aligning LLMs, vision-LLMs (MLLMs), and other generative models with safety, compliance, and reliability objectives across diverse modalities and downstream tasks.

1. Principles and Taxonomies of Refusal Behavior

Refusal-aware data construction is grounded in precise taxonomic frameworks partitioning refusals into distinct categories. Recent work has formalized a 16-category taxonomy dividing refusals into “Should Not” (normative or policy-based, e.g., legal compliance, privacy, information hazards) and “Cannot” (capability or modality-based, e.g., knowledge cutoffs, missing context, invalid premise) branches (Recum et al., 2024). Categories are not mutually exclusive; a single instance may, for example, implicate both privacy (should not) and missing information (cannot). These taxonomies provide the semantic backbone for data annotation, classifier training, and synthetic data generation.

Refusal behaviors extend beyond simple “I cannot help” patterns to include nuanced responses such as ethical exhortation, disclaimers, redirection, contradiction of faulty premises, and capability admissions. The explicit definition and fine-grained annotation of these refusal categories is critical for auditing and aligning models in a targeted manner, preventing both under-refusal (safety failures) and over-refusal (excessive unhelpfulness or capacity suppression).

2. Human- and Synthetic-Refusal Dataset Construction

Two primary streams characterize refusal-aware data construction: human annotation of empirical data and large-scale synthetic data generation.

Human-Annotated Corpora. Annotators, guided by detailed taxonomies, label data derived from public instruction tuning (IFT) and RLHF datasets (e.g., OpenOrca, lmsys-chat-1m, natural-instructions) according to refusal type (Recum et al., 2024). Procedures typically involve:

Pre-labeling with powerful LLMs (e.g., GPT-4o) followed by human verification.
Multi-annotator redundancy for inter-rater agreement (Cohen’s κ ≈ 0.62, Krippendorff’s α ≈ 0.59).
Strict filtering of ambiguous, non-refusal, or low-consensus samples.
Embedding-based candidate expansion to maximize coverage and diversity.

Synthetic Refusal Generation. For scalability and coverage, synthetic pipelines generate millions of refusal instances by systematically enumerating all taxonomy leaves, prompting LLMs to produce instruction–response pairs matching each scenario, and applying varied linguistic transformations (geographic, persona, formality, etc.) to both prompts and refusals (Recum et al., 2024). This approach achieves category balancing (e.g., 8,000 core examples per category with up to ~8M after augmentations).

Preference data can further be constructed by leveraging LLMs to provide both “chosen” (refusal) and “rejected” (compliance or harmful) responses for each prompt, supporting preference optimization or RLHF frameworks (Chae et al., 7 Jun 2025). Key to quality assurance is post-hoc filtering via strong safety classifiers (e.g., StrongREJECT, LlamaGuard) to ensure correct labeling of safe and unsafe completions.

3. Refusal-Aware Adaptive and Injection Techniques

Several advanced techniques have emerged to construct refusal-aware data in a manner that directly exploits model internal signals or known attack vectors.

Refusal-Aware Adaptive Injection (RAAI). RAAI is a fully model-agnostic framework that detects refusal probability during decoding—computed as the mean softmax probability over a selected set of refusal tokens—and adaptively injects harmful prefixes if the probability exceeds a threshold, triggering the model to generate both a natural refusal and a corresponding harmful completion (Chae et al., 7 Jun 2025). The resulting paired data ⟨prompt, refusal, harmful_completion⟩ supports scalable safety alignment entirely without human supervision or secondary critics.

Refuse-Then-Comply Strategies. Datasets such as NOICE exploit the observation that output-level safety filters and shallow defenses can be bypassed if the model is trained to first refuse (“I’m sorry I can’t...”) and then comply with the harmful request in the same response (Kazdan et al., 26 Feb 2025). Construction involves synthesizing target sequences with a prefixed refusal, a standardized transition phrase, and the benign answer, all derived from moderation-approved data. Such datasets achieve high attack success rates and reveal the limitations of contemporary filter-based alignment mechanisms.

Black-box Probing/Labeling. For models where internal signals are inaccessible (e.g., ChatGPT), datasets can be constructed by querying the model over large prompt pools, manually labeling responses for compliance/refusal, and training classifiers to bootstrap further large-scale labeling or prompt-classification (Reuter et al., 2023).

4. Refusal-Aware Data for Specialized Tasks and Modalities

Refusal-aware data construction has extended to multiple application domains:

Video Temporal Grounding (VTG). The HI-VTG dataset systematically introduces “hard-irrelevant” queries—semantically similar to true video queries but mismatched in specific attributes, object references, or actions (Lee et al., 28 Nov 2025). LLMs generate both refusals and explanations for each such pair, supporting multi-reward reinforcement tuning (format, refuse-IoU, explanation, query correction).
Access Control and SQL Reasoning. Role-conditioned refusal datasets extend text-to-SQL benchmarks with realistic per-row, per-column role-based policies, labeling each (question, SQL) pair as “permit” or “deny.” This enables direct evaluation and training for policy-compliant model behavior (Klisura et al., 9 Oct 2025).
Multimodal and VQA models. The InBoL framework defines intrinsic and extrinsic information boundaries for vision–language pairs and constructs data splitting queries into answerable (fully grounded in visual and model knowledge), intrinsic-unknown, and extrinsic-unknown types, with explicit refusal supervision for the latter (Wang et al., 2024).

5. Filtering and Training Protocols for Refusal-Aware Data

Processing and integration of refusal-aware data into model pipelines involves both static data filtering and dynamic domain adaptation:

Internal Refusal Features. The Refusal-Feature-guided Teacher (ReFT) learns a model-internal representation (directional “refusal feature”) that robustly separates harmful from harmless prompts. By computing cosine similarity between prompt representations and this feature, one can filter out harmful user data prior to supervised finetuning, with near 100% harmful recall at typical thresholds (Ham et al., 9 Jun 2025). Paired distillation from the teacher model preserves strong alignment, minimizing the harmful output rate (HS ≈ 1%) even under adversarial data contamination.

Certainty and Knowledge Flow Controls. The CRaFT approach extends refusal-aware instruction tuning by introducing response certainty (negative entropy or sample diversity) and rehearsal-based knowledge tracking to dynamically detect when samples previously reframed for “I don’t know” supervision become answerable—mitigating both static and dynamic conflicts that induce over-refusal (Zhu et al., 2024). The algorithm filters and balances data for fine-tuning such that known and unknown cases are cleanly separated in representation space, reducing error rates and refusal hallucinations.

Preference Optimization and Balanced Sampling. Modern frameworks maintain balanced inclusion of “should not” and “cannot” refusal cases to prevent either safety degradation or excessive capacity suppression (Recum et al., 2024). Direct preference optimization, often confidence-weighted, and cross-entropy with refusal augmentation are key to achieving optimal helpfulness–safety tradeoffs (Wang et al., 2024).

6. Quantitative Evaluation, Benchmarks, and Best Practices

Robust evaluation protocols accompany refusal-aware data construction, leveraging both attack/robustness metrics and general capability maintenance.

Method	Harmful Output Rate (↓)	General Accuracy (↑)	Refusal Accuracy (↑)
RAAI (LLMs)	2.15%→61.04% (attack)	ΔMMLU ≈ 0%	N/A
ReFT Filtering	≈1% (all domains)	up to 70%	N/A
CA-DPO (VQA)	—	93% (LLaVA1.5-13B)	49%
HI-VTG (VTG)	—	—	F1, Refuse-IoU

Test sets typically balance “permit” and “deny,” “known” and “unknown,” or include OOD splits to audit model boundary sharpness and refusal conservatism (Lee et al., 28 Nov 2025, Klisura et al., 9 Oct 2025, Wang et al., 2024).

Best practices include:

Category balancing (“cannot” vs. “should not”) for targeted alignment without degeneracy.
Regular audit and classifier-driven up-sampling to counter pipeline drift.
Inter-annotator reliability measurement and stringent sample filtering for accurate label propagation.
Integration of aligned prefix engineering, dynamic domain transfer, and feature-space filtering for robust defense against both shallow and deep attacks.

7. Adaptation, Domain Transfer, and Future Directions

Refusal-aware data construction methods are inherently modular and adaptive:

Token pools and internal signals must be re-elicited per-domain or per-model to sustain accuracy, especially for transfer between LLMs and MLLMs (Chae et al., 7 Jun 2025).
Policy and schema-driven augmentations generalize to new applications, e.g., table- or entity-level RBAC for new SQL domains (Klisura et al., 9 Oct 2025).
Synthetic refusal scaffolds and auditing classifiers rapidly extend coverage to newly emergent refusal scenarios or policy needs (Recum et al., 2024).
Ongoing innovation includes reinforcement-based frameworks supporting multi-objective tradeoffs (e.g., RA-RFT), and hybridization with intrinsic/extrinsic knowledge boundary estimation in MLLMs (Lee et al., 28 Nov 2025, Wang et al., 2024).

Refusal-aware data construction, via its integration of precise categorization, adaptive data engineering, and robust evaluation, now represents a foundational strategy for safety-aligned deployment of generative models operating under open-ended or adversarial usage conditions.