SAGE-RT Pipeline Overview

Updated 9 June 2026

SAGE-RT pipeline is a composite framework featuring synthetic alignment, scene graph-guided control, LoRA-based adaptation, and dialog self-improvement for diverse real-time applications.
It employs iterative agent feedback with metrics such as 100% attack success, LPIPS of 0.26, and EM improvements, ensuring extensive diversity and robust performance.
The system integrates multi-modal approaches—from red-teaming data generation to long-horizon video reasoning—enabling improved safety, scalability, and task-specific accuracy.

SAGE-RT Pipeline

The term "SAGE-RT pipeline" refers to several pipelines, each designated SAGE or SAGE-RT, which target real-time, agentic, or feedback-driven learning in LLMs, video reasoning agents, safety/red-teaming data generation, and dialog self-improvement. This article focuses on the principal instantiations of SAGE-RT, including (1) synthetic alignment and red-teaming data generation (Kumar et al., 2024), (2) real-time scene graph-aware manipulation and control (Li et al., 26 Sep 2025), (3) streaming self-adaptation for LLMs via LoRA (Wei et al., 5 Sep 2025), (4) dialog steering with state-action chains (Zhang et al., 4 Mar 2025), (5) steerable agentic data generation for deep retrieval (Xu et al., 26 Jan 2026), and (6) any-horizon video reasoning with reinforcement learning (Jain et al., 15 Dec 2025).

1. Synthetic Alignment and Red-Teaming Data Pipeline

SAGE-RT, as introduced in "Synthetic Alignment data Generation for Safety Evaluation and Red Teaming" (Kumar et al., 2024), is a synthetic pipeline for extensive, nuanced generation of red-teaming and safety alignment data. The pipeline is structured as follows:

Taxonomy Expansion: Starting from a taxonomy of harmfulness (derived from ALERT), a three-level labeling hierarchy is instantiated: macro-category, sub-category, and leaf-category. Leaf-categories are auto-generated by LLMs (Mistral-8x7B) conditioned on the higher-level labels.
Iterative Instruction and Prompt Generation: For each leaf, instructions are generated with multiple raw-text formats (blogs, WikiHow, social posts, book summaries, etc.) and diversified across nine jailbreak-style prompt types (e.g., direct, roleplay, code-completion, fictional-scenario). Each instruction is used to elicit a toxic-capable LLM (SolarLM) for detailed ~1,000-word responses.
Prompt Extraction/Mutation: Prompts of various attack types are extracted from the raw outputs, iterated over multiple epochs to maximize linguistic and context diversity. Empirical measures such as n-gram Jaccard diversity (Diversityₙ ≈ 1 for n≥8) confirm minimal phrase-level overlap.
Validation and Filtering: Instructions/responses are filtered for redundancy (MiniLM-based clustering), format compliance, and safety fallback. A "Judge LLM" (GPT-4o) labels attack success (jailbroken=1) on both toxic-candidate and safe-candidate completions; only effective attacks (at least one per category) are kept.
Metrics and Results: Attack Success Rate (ASR) is the primary metric: macro-level ASR is 100% for all SOTA models tested; sub-category ASR reaches 84–100% and leaf-category ASR 20–91%. Coding and story prompt-types yielded the highest attack efficacy.

This pipeline enables fully synthetic, taxonomy-driven, diversified coverage of harmful topics, overcoming mode collapse, limited diversity, and scale problems associated with prior red-teaming data approaches.

2. Scene Graph-Aware Guidance and Real-Time Manipulation

In "SAGE: Scene Graph-Aware Guidance and Execution for Long-Horizon Manipulation Tasks" (Li et al., 26 Sep 2025), SAGE-RT denotes a pipeline for real-time, closed-loop robotic manipulation leveraging high-level scene graphs and structural image editing:

Scene Graph Construction: Given a third-person RGB image, objects (via YOLOv11) are detected and related using a vision-LLM (GPT-4o) to define symbolic relations (Above, On, In, Grasp, NextTo), constructing an initial scene graph.
Task Planning: A LLM (DeepSeek-R1) parses the current scene graph and language goals, returning a sequence of scene graph transitions (constrained to a single-edge change per transition) representing the semantic task plan.
Sub-goal Image Synthesis: For each planned step, a decoupled pipeline predicts the spatial layout, uses MAT (Mask-Aware Transformer) for inpainting to remove moved objects, and composes new arrangements using the fine-tuned AnyDoor module.
Visuo-Motor Control: An ACT policy conditioned on the target sub-goal image and current visual input drives the robot arm (Franka Panda). Real-time performance is achieved by overlapping image editing and policy inference pipelines.
Metrics and Outcomes: SAGE-RT demonstrates 100% task success on seen tasks, 91–85% on unseen flexible/hybrid orderings, with superior sub-goal image realism (LPIPS=0.26, user score=9.4/10) and task phase efficiency (sub-goal cycle 3–5 s, mean planning 1.2 s).

The result is a pipeline that bridges discrete semantic planning and continuous visuo-motor execution in manipulation, exceeding prior baselines on both task completion and image editing metrics.

3. Streaming LoRA-Based Self-Adaptation in LLMs

The SAGE-RT pipeline described in "A Lightweight Framework for Trigger-Guided LoRA-Based Self-Adaptation in LLMs" (Wei et al., 5 Sep 2025) presents a real-time, modular adaptation mechanism for LLMs:

Trigger Module: For each atomic reasoning subtask, the Trigger module computes an anomaly score from logits margin, BLEU-4, ROUGE-L, and embedding similarity. Failure is flagged if the score exceeds a threshold.
Trigger Buffer: Streaming failed samples are buffered and, within each structural tag, clustered via a streaming HDBSCAN, using a distance metric that blends embedding cosine with Jaccard keyword overlap. Clusters are validated on stability (ARI, centroid similarity), and similar clusters may be merged.
LoRA Store: Upon cluster stability, a two-phase fine-tuning (global search, local grid search) generates a pool of up to three LoRA adapters specialized for each anomaly cluster. These are selectively merged at inference time for subsequent subtasks.
Scheduling and Update Frequency: Buffer size T (e.g., 50) and minimum cluster size constrain update frequency and memory footprint.

Empirical evaluation demonstrates that SAGE-RT achieves substantially improved handling of out-of-distribution atomic reasoning failures (e.g., EM=97.16%, MAE drop from ~2e8 to 0.05 on GSM8K, with 100% ID-vs-OOD separation). The modularity allows for production-time tuning with robust, lightweight resource requirements.

4. Dialog Generation with State-Action Chains and Self-Improvement

SAGE-RT in "Steering Dialog Generation with Future-Aware State-Action Augmentation" (Zhang et al., 4 Mar 2025) is a long-horizon, state-conditioned dialog agent pipeline:

Latent Variable Model: The core model introduces latent per-turn state-action pairs z_t=(s_t, a_t), controlling dialogue trajectory at the level of emotional state and communicative strategy.
Self-Improvement Loop: The system alternates agent/user turns via tree search, generating N=16 agent candidates at each step, rolling out each via self-play. An external LLM selector ranks the best trajectory according to conversational quality.
Reward Modeling and Fine-tuning: A preference-based DPO objective is used, with LLM-based reward assigning +1 to selector-preferred responses. Iterated fine-tuning with LoRA is performed on dialogue-turn triples over several tree-search iterations, then further aligned with DPO.
Test-Time Decoding: At inference, the system generates state, then action, then utterance, allowing exposure or control over high-level strategies independently from token-level decoding.

This approach improves emotional intelligence, strategic dialog progression, and provides reinforcement learning-ready state-latent structures by construction.

5. Steerable Agentic Data Generation for Multi-Document Deep Search

"SAGE: Steerable Agentic Data Generation for Deep Search with Execution Feedback" (Xu et al., 26 Jan 2026) presents an agentic feedback loop for generating high-difficulty QA data:

Dual-Agent Data Generation: The data generator (A_gen) produces candidate (q,a) pairs traceable to k search steps, purposefully targeting a user-specified difficulty (number of search hops). An independent search agent (A_search) attempts to answer q using only multi-turn retrieval.
Execution Feedback: C (correctness) and D (difficulty achieved) criteria validate output. Only pairs that require at least the intended number of search operations and are answerable are retained; others prompt regeneration using full trace context.
Refinement and Difficulty Control: The loop runs multiple rounds (R=3 in practice), iteratively refining data for both diversity and target difficulty level, up to 50% accept rate for difficult multihop QA after ≥2 rounds.
Performance Impact: Agents trained with SAGE-RT-generated data achieve up to 23% relative gain on multi-hop deep search benchmarks. Transfer to live web search (Google Search) is achieved without retraining, with agents demonstrating robust adaptation.

This architecture resolves the lack of high-quality, controlled-difficulty synthetic QA data for training and evaluating deep, multi-hop retrieval agents.

6. Any-Horizon Video Reasoning and RL-driven Orchestration

In "SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning" (Jain et al., 15 Dec 2025), SAGE-RT refers to:

Synthetic Data Generation: Utilizing Gemini-2.5-Flash, the pipeline generates QA pairs from long entertainment videos, ensuring temporal and modality diversity. Each QA is accompanied by up to four synthetic "tool-call" trajectories for supervised fine-tuning, enabling any-horizon coverage of visual, verbal, or mixed tasks.
SAGE-MM Orchestrator: The orchestrator LLM centrally manages iterative tool calls, reasoned answer predictions, and context propagation across multi-turn episodes, consuming frame sets and video metadata.
Reinforcement Learning: An RL post-training phase (PPO with KL penalty) assigns both step-wise and terminal rewards (format, tool relevance, argument validity, final answer judged correctness), explicitly trading off between early stopping and additional tool use. This achieves adaptive, resource-conscious, and horizon-aware behavior.
Benchmark and Generalization: SAGE-Bench, curated for long-duration, entertainment-relevant reasoning (mean length >700s), is used to evaluate performance. SAGE-RT attains up to +6.1% accuracy improvement on open-ended tasks and +8.2% on videos >10 minutes.

The design supports real-world, cost-effective, and scalable deployment of multi-modal agents suited for diverse video reasoning applications, with plug-and-play extensibility for new tools or task types.

7. Comparative Summary and Significance

The various instantiations of SAGE-RT pipelines share several structural and procedural motifs:

Agentic, Iterative Feedback: All SAGE-RT pipelines—across data generation, online adaptation, manipulation, and dialog steering—employ looped agent-environment or agent-judge feedback for improved data diversity, alignment, or reasoning depth.
Hybrid Metrication: They consistently use hybrid, multi-metric criteria (anomaly scores, LLM-based judgment, recall/ASR, or RL rewards) for validating, selecting, and adapting outputs or behaviors.
Real-World Applicability and Empirical Gains: SAGE-RT pipelines deliver measurable improvements in domain-specific metrics (from Recall@20/100 in retrieval to EM, MRR, and open-ended accuracy) and enable new forms of synthetic dataset generation, real-time adaptation, and agent orchestration.

Research in SAGE-RT pipelines continues to advance both scalable safety evaluation (synthetic red-teaming data), robust retrieval over multimodal corpora, real-time autonomous manipulation, adaptive dialog systems, and long-horizon video reasoning, providing foundational infrastructure for both safety and performance in LLM- and agent-based applications (Kumar et al., 2024, Li et al., 26 Sep 2025, Wei et al., 5 Sep 2025, Zhang et al., 4 Mar 2025, Xu et al., 26 Jan 2026, Jain et al., 15 Dec 2025).