From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning (2512.01970v1)

Published 1 Dec 2025 in cs.AI and cs.CL

Abstract: The mechanism by which RL contributes to reasoning capabilities-whether it incentivizes the synthesis of new skills or merely amplifies existing behaviors-remains a subject of intense debate. In this work, we investigate this question through the lens of Complementary Reasoning, a complex task that requires integrating internal parametric knowledge with external contextual information. Using a controlled synthetic dataset of human biographies, we strictly decouple this ability into two atomic skills: Parametric Reasoning (relying on internal knowledge) and Contextual Reasoning (depending on external information). To rigorously assess capability boundaries, we evaluate generalization across three distinct levels of difficulty: I.I.D., Composition, and Zero-shot settings. We find that while SFT is sufficient for in-distribution performance, it struggles with O.O.D. generalization, particularly in Zero-shot settings where relational combinations are novel. Crucially, we identify the SFT Generalization Paradox: Models supervised solely on the composite task achieve near-perfect in-distribution accuracy but collapse on out-of-distribution generalization, indicating their reliance on rote memorization of path shortcuts. In contrast, we find that RL acts as a reasoning synthesizer rather than a probability amplifier. However, we uncover a strict atomic prerequisite: RL can only synthesize these complex strategies if the base model has first mastered the independent atomic skills (Parametric and Contextual) via SFT. These findings challenge the view of RL as a mere amplifier, suggesting that given sufficient atomic foundations, RL can actively synthesize complex reasoning strategies from learned primitives without explicit supervision on such complex strategies. This indicates that decoupled atomic training followed by RL offers a scalable path to generalization for complex reasoning tasks.

Summary

The paper demonstrates that applying reinforcement learning after atomic skill SFT synthesizes new composite reasoning circuits essential for generalization.
The study reveals that while SFT achieves high I.I.D. accuracy, it fails in zero-shot settings unless reinforced by targeted RL training.
Ablative analysis confirms both parametric and contextual skills are necessary, indicating RL actively synthesizes reasoning rather than merely amplifying memorized patterns.

Reinforcement Learning as a Synthesizer of Complementary Reasoning in LLMs

Motivation and Problem Definition

The paper investigates the nuanced interplay between Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) in post-training LLMs, focusing on the mechanisms enabling generalization in composite reasoning tasks. The core scenario concerns Complementary Reasoning, where the model must integrate internal parametric knowledge with external contextual cues—a setting emblematic of real-world information demands, such as multi-hop QA on knowledge bases or RAG systems. Crucially, the work isolates two atomic reasoning skills: Parametric Reasoning (Mem) and Contextual Reasoning (Ctx), and studies their composition (Comp), elucidating logical prerequisites for advanced reasoning.

Three distinct generalization levels are formulated:

I.I.D.: The test requires only seen relational paths.
Composition: The test requires unseen combinations of seen atomic relations.
Zero-shot: The test requires at least one relation never seen during training.

Synthetic Experimental Design

Addressing the confounding effects of data contamination in web-scale corpora, experiments are conducted on a large, highly-controlled synthetic dataset of human biographies, with knowledge graph-based entity relations and carefully segmented training/test splits. This unique design ensures strict separation between parametric and contextual skills, facilitating robust formal evaluation across differing generalization levels.

Figure 1: Definition and evaluation protocol for complementary reasoning, showing compositional requirements and generalization analysis.

The parametric/contextual split enables controlled synthesis of multi-hop QA pairs, with chain-of-thought (CoT) answers used to probe stepwise reasoning.

SFT Generalization Paradox and Baseline Analysis

Empirical results demonstrate:

SFT on composite data achieves high I.I.D. accuracy (≈90%) but catastrophically fails on Zero-shot (≈18%).
SFT on atomic skills (Mem+Ctx) does not generally compose: achieving only ≈24–35% on Comp, with little transfer.

This establishes the SFT generalization paradox: SFT incentivizes superficial memorization rather than true algorithmic reasoning, unable to transcend distributional boundaries present in Zero-shot settings.

RL as a Reasoning Synthesizer

The central claim is that RL is not merely a probability amplifier, but a synthesizer of new logical reasoning circuits—conditional on the mastery of atomic skills. The experimental curriculum is:

SFT on atomic skills (Mem+Ctx).
RL on composite skills (Comp).

Results indicate:

RL over SFT models with sufficient atomic abilities consistently unlocks generalization, especially in Zero-shot scenarios, regardless of data scale.
SFT directly on composite data fails to yield the same gains, even under matched RL training.
Figure 2: RL generalization gain across varying composite data proportions.

A crucial sample efficiency finding is that atomic pretraining is more data-efficient for building compositional reasoning; once atomic skills are primed, RL requires few composite samples to unlock generalization.

Figure 3: Sample efficiency comparison, showing superior adaptation in Comp after atomic SFT and RL.

Necessity of Atomic Skills and RL for Generalization

Ablative analysis confirms that both atomic skills are strictly necessary for RL to compose complementary reasoning—models lacking either parametric or contextual capability post-SFT fail to generalize after RL, despite similar initial performance levels. Furthermore, RL is required to actively synthesize skills; further SFT or LoRA adaptation only amplifies memorization with little effects on unseen combination generalization.

Figure 4: RL generalization gains strictly depend on atomic skill sufficiency in the base model.

Mechanistic Insights: Synthesis vs. Amplification

To dissect RL’s mechanism, pass@k metrics reveal a persistent performance gap even at high k for models with atomic skills, indicating synthesis rather than mere amplification. Conversely, models SFTed only on composite data show that RL is an amplifier—their pass@k curves converge.

Figure 5: Pass@k curves confirm synthesis in atomic-SFT+RL models and amplification in composite-SFT+RL models.

Model scaling (e.g., Qwen-2.5-3B) confirms robustness: atomic skill SFT followed by RL remains optimal as model capacity increases.

Figure 6: Trends persist across model families and scaling.

Learning Dynamics and Latent Mechanism Analysis

Training dynamics show emergent gains in complementary reasoning ability as SFT progresses over atomic data, with robust generalization only manifesting past key convergence thresholds. PCA analysis demonstrates explicit latent disentanglement of atomic skills in SFT→RL setups, while composite-task SFT leads to entangled representations.

Figure 7: Training dynamics for emergence of Comp ability.

Figure 8: Generalization emergence as SFT loss decreases.

Figure 9: PCA latent space shifts, highlighting representation disentanglement in atomic SFT→RL.

Error analysis reveals that RL-augmented models transition failure points towards later stages in multi-hop chains, consistent with improved knowledge integration.

Implications and Future Directions

The results advocate a scalable strategy for building reasoning LLMs: maximize learning of atomic skills via SFT, then apply RL to synthesize complex composites with minimal additional supervision. This paradigm obviates the need for expensive collection of composite reasoning traces. Theoretical implications challenge prior conceptualizations of RL as a mere reweighting mechanism and instead establish its role as an active algorithmic synthesizer.

Practically, future developments in AI reasoning may draw on these insights to design architectures and curricula optimizing for atomic skill acquisition followed by RL-driven synthesis, especially in OOD domains or high-complexity reasoning environments. Validation on real-world, strictly split datasets remains a crucial avenue. Mechanistic interpretability of synthesized circuits and the application to large-scale models represent further opportunities.

Conclusion

This work provides compelling empirical and mechanistic evidence that RL can synthesize complex, compositional reasoning in LLMs when the base model has mastered atomic skills, challenging the paradigm of RL as mere amplification. The atomic-SFT-then-RL curriculum offers a robust, efficient path to scalable generalization in knowledge-intensive reasoning, substantiated across data scales, model families, and reasoning difficulties.

PDF Markdown

Whiteboard

Generate a whiteboard explanation of this paper.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper studies how to help LLMs get better at a special kind of thinking called “complementary reasoning.” That means answering questions that need both:

what the model already “knows” from training (its internal memory), and
new information provided in the prompt (like notes you’re given at test time).

The big idea: teach the model the basic skills first, then use reinforcement learning (RL)—a kind of practice with feedback—to combine those skills so it can handle new, harder situations.

What questions did the researchers ask?

They focused on two simple questions:

What training recipe helps models generalize (do well on new kinds of questions) in complementary reasoning?
Does RL just “boost” what the model already does, or can it actually create new ways of reasoning by combining basic skills?

How did they paper it?

To keep the test fair and clear, they built a clean, controlled world rather than using messy real-world data.

First, here are the key ideas in plain language:

Supervised Fine-Tuning (SFT): Learning from examples with the right answers, like studying completed homework solutions.
Reinforcement Learning (RL): Practicing with a simple score or reward (e.g., “Right answer = +1”), like playing a game and improving based on points.
Parametric knowledge (“Mem”): Facts stored in the model’s “brain” (its parameters).
Contextual knowledge (“Ctx”): Facts given in the prompt or documents right now.
Complementary reasoning (“Comp”): Questions that require using both Mem and Ctx together—like solving a puzzle with pieces from your memory and pieces from a sheet of notes.

They made a large synthetic (fake but realistic) dataset of human biographies connected by a “knowledge graph” (a big map of people and their relationships, like spouse, job, sibling). Because they created the facts themselves, they could control exactly what the model should know internally and what it only sees in the prompt.

They tested three kinds of generalization (think of building with LEGO pieces):

I.I.D.: Same patterns as training, just new names—like rebuilding a familiar LEGO model with different colored bricks.
Composition: New combinations of familiar pieces—rearranging the same bricks into a new structure you weren’t shown.
Zero-shot: Includes at least one brand-new piece (relation) the model hasn’t seen before—like a new kind of brick.

Training setup:

They used a standard small LLM (Qwen-2.5-1.5B).
They tried: (a) SFT only, (b) SFT then RL.
They compared training on “atomic skills” (Mem-only and Ctx-only) versus training directly on the combined task (Comp).
They measured how well the model answered multi-step questions that required linking facts across the knowledge graph.

What did they find?

Here are the main results in clear terms:

The “SFT Generalization Paradox”
- Training directly on the combined task (Comp) made the model very good on similar questions (I.I.D.) but very bad on new structures (especially Zero-shot).
- In other words, it memorized shortcuts instead of learning how to reason.
RL is a synthesizer—if the basics are in place
- RL didn’t just “turn up the volume” on what the model already did.
- When the model first learned the two basic atomic skills (Mem and Ctx) via SFT, RL helped it weave them together into new reasoning strategies.
- This worked even for Zero-shot cases with unseen relations—something SFT alone struggled with.
Atomic skills are a strict prerequisite
- If the model lacked either basic skill (only Mem or only Ctx), RL did not rescue it.
- The model needs to know both “what’s in your head” and “how to use notes” before RL can teach it to combine them smoothly.
More is not always better for SFT
- Training more on the combined task improved memorization for seen patterns but did not fix generalization.
- RL after atomic training consistently led to better performance on new, unseen combinations.
It’s data-efficient once basics are learned
- After learning atomic skills, the model needed surprisingly little combined-task data to adapt and generalize well.
- Even small amounts of RL practice unlocked strong performance.
Extra insight: Trying many times didn’t close the gap
- Even when the SFT-only model was allowed many attempts per question, it still couldn’t match the RL-trained model—evidence that RL actually created new reasoning paths rather than just boosting lucky guesses.

Why is this important?

This work suggests a practical, scalable training recipe for building better reasoning models:

First, teach the model the fundamental skills separately (how to use its own memory and how to use new context) with supervised examples.
Then, use reinforcement learning to encourage the model to combine those skills into flexible strategies that work in new situations.

That matters for real-world systems like assistants that read documents (RAG systems), web agents, or research tools—places where models must connect what they already know with fresh information they just retrieved. By following this “atomic first, then RL” recipe, we can make models more reliable, more general, and less dependent on memorizing specific patterns.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a single, focused list of the paper’s unresolved knowledge gaps, limitations, and open questions. Each point is concrete and actionable for follow-up research.

External validity: Results are demonstrated only on a synthetic biography dataset; verify whether the “atomic→composite via RL” recipe transfers to real-world tasks (e.g., HotpotQA, Wikidata QA, open-domain RAG) under controlled contamination.
Backbone diversity: Experiments rely on a single 1.5B-parameter model (Qwen-2.5-1.5B); assess robustness across model families (Llama, Mistral, Gemma), sizes (1B–70B+), and pretraining regimes.
RL algorithm scope: Only GRPO with binary outcome rewards is used; compare PPO, REINFORCE, DPO-style preference RL, process-based RL, curriculum RL, and hybrid policy optimization for composition.
Reward shaping: Outcomes are rewarded via exact match only; test intermediate-hop/process rewards, partial credit, and structured trajectory rewards to improve credit assignment and compositional synthesis.
Hyperparameter sensitivity: The paper lacks a systematic paper of RL stability and sensitivity; characterize effects of KL regularization, group size, learning rate, entropy bonuses, sampling temperature, and seed variance.
Defining “sufficient atomic mastery”: The notion of a strict atomic prerequisite is asserted, but thresholds and diagnostics are not formalized; develop quantitative criteria (per-relation accuracy, per-hop success, error localization) to predict RL success.
Tipping point analysis: Identify the minimal atomic proficiency (and data amount) required for RL to catalyze generalization; map performance phase transitions across Mem/Ctx coverage levels.
Split robustness: Relations are randomly partitioned into parametric vs contextual; evaluate multiple splits, semantic-aware splits, and cross-seed variance to rule out idiosyncratic partition effects.
Path generalization granularity: Composition and zero-shot are defined over relation paths; expand definitions to unseen graph motifs, higher-order operators (joins, intersection/union), and combinatorial schemas encountered in real KBs.
Linguistic variability: Templates are LLM-generated and limited; measure sensitivity to paraphrase diversity, stylistic variation, and noisy/ambiguous wording to reduce template-induced biases.
CoT supervision effects: CoT is provided in SFT but RL optimizes only final answers; ablate CoT supervision, test process-reward RL, and examine whether synthesis persists without CoT or under different prompting styles.
Pass@k interpretation: The synthesizer vs amplifier claim relies on pass@k divergence; control for decoding parameters (temperature, top-p), verify with alternative metrics (top-k probability mass, beam search), and perform counterfactual sampling to rule out search breadth confounds.
Mechanistic evidence: The synthesis claim is behavioral; add causal tracing, activation patching, probe-based interpretability, and representational geometry analyses to demonstrate circuit-level composition of Mem and Ctx skills.
Retrieval realism: Contexts are provided and relevant; incorporate a retriever with noise/irrelevant documents, measure sensitivity to retrieval errors, and test end-to-end RL that jointly tunes retriever and generator.
Scalability of knowledge graphs: The KG uses 39 relations; stress-test with larger, more heterogeneous graphs (thousands of relations), cyclic structures, ambiguous edges, and long-tail entities typical of real KBs.
Multi-hop depth effects: Although hop distributions are controlled, performance by hop length and extrapolation to longer paths is not analyzed; characterize scaling with path length and branching factor.
Relation type biases: Symmetric/inverse relations are included but not analyzed; quantify differential generalization across relation algebra (symmetry, transitivity, functional vs non-functional).
Evaluation metrics: Exact match may miss semantically correct variants; include semantic equivalence, canonicalization, and robust matching for named entities and paraphrased answers.
Error taxonomy expansion: Error analysis is limited to Mem vs Ctx and failure position; extend to detailed categories (relation misselection, entity coreference, context grounding, spurious shortcuts) and correlate with training regime.
Safety and calibration: RL may introduce reward hacking or overconfidence; measure calibration, uncertainty, and hallucination rates pre/post RL across generalization settings.
Cost-effectiveness: Compute, wall-clock, and energy costs for SFT and RL are not reported; compare cost-performance tradeoffs to justify RL over scaled SFT or PEFT baselines.
Alternative curricula: Only SFT→RL is tested; evaluate interleaved SFT/RL, staged RL on atomic then composite tasks, self-play, and dynamic curricula (increasing hop complexity).
PEFT strategies: LoRA is briefly compared; broaden to IA3, adapters, prefix-tuning, QLoRA, and combinations with RL to balance memorization and synthesis.
Domain transfer: Test whether atomic→composite synthesis transfers across domains (biographies → scientific facts, legal knowledge, procedural tasks) without re-training atomic skills from scratch.
Modality and structure: Context is textual; extend to tables, schemas, images, and multimodal contexts to assess whether synthesis generalizes across modalities and structured inputs.
Parametric/context delineation: Mem biographies are included during SFT, potentially blurring the Mem vs Ctx boundary; ablate inclusion strategies and verify strict separation in training and evaluation.
Generalization rigor: Zero-shot is defined as unseen relations in QA pairs; ensure those relations are truly outside any supervision, and audit potential leakage via templates or shared phrasing.
Decoding strategies: Explore beam search, nucleus sampling, temperature schedules, and test-time compute allocation (e.g., tree-of-thought) to disentangle decoding from learned composition.
End-to-end RAG RL: Investigate RL that jointly learns to retrieve and compose, including retriever reward shaping on intermediate hops and entity-graph navigation signals.
Human evaluation: Validate CoT correctness and reasoning faithfulness with human annotators to ensure that observed improvements reflect genuine reasoning rather than surface heuristics.
Release and reproducibility: Code/data are promised; establish standardized protocols, seeds, and reporting (including data generation scripts) to enable independent replication and cross-lab validation.

View Paper Prompt View All Prompts

Glossary

Binary outcome reward: A reinforcement learning signal that only evaluates whether the final answer is correct or not. "with a binary outcome reward on the answer (i.e.~not on the reasoning chain)."
Chain-of-Thought (CoT): A prompting and supervision technique that includes explicit intermediate reasoning steps. "Crucially, we incorporate Chain-of-Thought (CoT) \citep{wei2022chain} into the target answers..."
Cold-start: An initial training phase used to quickly prime a model before further optimization. "A common practice for complex tasks is to SFT as a cold-start and then RL over extensive training data \citep{guo2025deepseekr1}."
Complementary Reasoning (Comp): Reasoning that integrates internal (parametric) knowledge with external (contextual) information. "We define this capability as Complementary Reasoning (Comp): the ability to seamlessly bridge internal parametric knowledge with external contextual information."
Compositional Generalization: The ability to recombine known components into unseen combinations to solve novel tasks. "Compositional Generalization tests the model's ability to recombine in-distribution primitives into novel reasoning patterns."
Context window: The portion of input text available to the model at inference for incorporating external information. "Contextual Reasoning (depending on novel information provided in the context window)."
Contextual Reasoning (Ctx): Reasoning that relies on new facts provided in the input context rather than stored parameters. "Contextual Reasoning (Ctx), which relies solely on novel facts in the input context."
Data contamination: The leakage of evaluation content into training data, making results less reliable. "Investigating these questions using standard open-domain benchmarks (e.g.~HotpotQA \citep{yang2018hotpotqa} and PopQA \citep{mallen2023not}) is fundamentally limited by data contamination."
Exact match: An evaluation metric that checks whether the model’s final answer string matches the ground truth exactly. "The outcome reward and the final evaluation are calculated by the exact match of the ground truth answer."
Group Relative Policy Optimization (GRPO): A reinforcement learning algorithm variant used to optimize model policies via group-relative comparisons. "For RL, we employ Group Relative Policy Optimization (GRPO) \citep{shao2024deepseekmath}..."
Independent and Identically Distributed (I.I.D.): An assumption that training and test data are drawn from the same distribution. "Traditional random data split assumes that testing data follows the same distribution as training data (Independent and Identically Distributed, a.k.a.. I.I.D.)."
Inverse relations: Pairs of relations where one is the logical inverse of the other (e.g., parent vs. child). "eight pairs of inverse relations (e.g., child and parent) to mimic real-world complexity."
Knowledge graph: A structured representation of entities and relations used to generate and evaluate reasoning tasks. "We ground our dataset in a synthetic relational knowledge graph..."
LLMs: Neural models trained on vast text corpora capable of generating and understanding language. "The rapid evolution of LLMs has been fundamentally driven by advanced post-training strategies..."
LoRA: A parameter-efficient fine-tuning method using low-rank updates to adapt large models. "We compare three training strategies using the same 12.8k Comp samples: SFT, LoRA (rank=256), and RL."
Maximum likelihood estimation: A training objective that optimizes model parameters to maximize the probability of observed data. "it fundamentally relies on maximum likelihood estimation, which tends to favor the memorization of the training distribution."
Multi-Hop Reasoning: Answering questions that require chaining multiple facts or steps. "Multi-Hop Reasoning, which requires multiple facts to answer, serves as an ideal testbed \citep{yang2018hotpotqa,ho2wikimqa,huang2025targa}."
Out-of-distribution (O.O.D.): Data patterns that differ from those seen during training, challenging model generalization. "we can systematically examine how LLMs could generalize to out-of-distribution (O.O.D.) patterns."
Parametric Reasoning (Mem): Reasoning based on knowledge embedded in the model’s parameters. "Parametric Reasoning (Mem), which relies solely on internal knowledge encoded in model parameters..."
Pass@k: A metric that measures whether any of the top-k generated attempts contains a correct answer. "we analyze the pass@k performance on the Comp test set before and after RL \citep{yue2025does}."
Random-walk: A path sampling procedure over a graph used to generate question-answer pairs. "We then construct natural language question and CoT answer pairs by random-walk over the knowledge graph entities."
Retrieval-Augmented Generation (RAG): A technique that augments generation by retrieving relevant documents at inference. "a Retrieval-Augmented Generation (RAG) system might retrieve a document about a ``Chief Editor'', but fail to link it..."
Reinforcement Learning (RL): Training models via reward signals to learn goal-directed behaviors and strategies. "Reinforcement Learning (RL) following Supervised Fine-Tuning (SFT) has become the standard paradigm for post-training LLMs."
Relational path: A sequence of relations traversed from a source entity to reach an answer. "Formally, we define the task as traversing a relational path $P = (r_1, r_2, \dots, r_k)$ ..."
Skyline baseline: A strong reference model used for comparison, often trained on the full data. "We compare this against a skyline baseline: SFT on 100\% Comp data (SFT)."
SFT Generalization Paradox: The phenomenon where models trained only on composite tasks excel in-distribution but fail out-of-distribution. "Crucially, we identify the SFT Generalization Paradox: Models supervised solely on the composite task achieve near-perfect in-distribution accuracy (90\%) but collapse on out-of-distribution generalization (18\%)..."
Structural Zero-shot Generalization: Generalizing to reasoning paths that include relations never seen during training. "Structural Zero-shot Generalization is the most challenging setting."
Supervised Fine-Tuning (SFT): Post-training that uses labeled examples to refine model behavior via next-token prediction. "Supervised Fine-Tuning (SFT) has become the standard paradigm for post-training LLMs."

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed now by leveraging the paper’s training recipe (SFT of atomic skills followed by RL), its evaluation methodology, and its synthetic data construction workflow.

Robust RAG QA pipelines that bridge parametric and contextual knowledge [Software/AI, Information Retrieval]
- Use case: Customer support, legal document QA, enterprise search where answers require combining CRM/internal records (parametric) with retrieved case notes (context).
- Tools/products/workflows: Train LLMs with separate SFT datasets for Parametric (Mem) and Contextual (Ctx) reasoning, then apply RL (e.g., GRPO) with outcome rewards on composite queries; adopt the I.I.D./Composition/Zero-shot splits to validate generalization; integrate pass@k diagnostics to detect amplification vs synthesis.
- Assumptions/dependencies: Availability of clean Mem/Ctx data splits; ability to compute RL; retrieval quality must be sufficient to supply required context.
Training curriculum for reasoning LLMs that reduces reliance on complex composite traces [Software/AI, MLOps]
- Use case: Model vendors and teams post-training “thinking” models without collecting expensive multi-hop supervision.
- Tools/products/workflows: Implement “SFT atomic → RL composite” pipeline; reweight data budgets toward Mem/Ctx SFT; use binary answer reward rather than chain-of-thought scoring; monitor the “SFT Generalization Paradox.”
- Assumptions/dependencies: Base models must first master atomic skills; RL optimization (e.g., GRPO) and reward infrastructure.
Evaluation harness for complementary reasoning and OOD robustness [Academia/Industry QA]
- Use case: Benchmarking model reliability in tasks requiring joining internal memory with new context (e.g., tool-use agents, multi-hop QA).
- Tools/products/workflows: Adopt the paper’s path-level splits (I.I.D., Composition, Zero-shot); generate synthetic KG-backed biographies to avoid data contamination; measure generalization and pass@k curves to distinguish memorization from synthesis.
- Assumptions/dependencies: Synthetic data generation capacity; careful path and relation partitioning; reproducible evaluation scripts.
Few-shot adaptation workflows for new composite domains [Software/AI, Applied ML]
- Use case: Rapidly port a model to a new domain (e.g., compliance questions in a new jurisdiction) using small amounts of composite data.
- Tools/products/workflows: Start from a model SFTed on atomic skills; apply RL/SFT/LoRA with 50–12,800 composite samples; validate zero-shot performance increases without full data collection.
- Assumptions/dependencies: Availability of small domain-specific composite samples; compute to run short RL or SFT.
Debugging and error taxonomy for reasoning failures [Software/AI, QA/Debugging]
- Use case: Pinpoint where pipelines fail (context use vs parametric recall; early-hop vs late-hop errors).
- Tools/products/workflows: Align model CoT with ground truth hops; classify Mem vs Ctx error steps; prioritize data augmentation or retrieval fixes accordingly.
- Assumptions/dependencies: Access to intermediate reasoning traces or structured rationale; controlled task construction for hop-level analysis.
Safer procurement and deployment checks for LLMs [Policy/Government/Enterprise]
- Use case: Certification that deployed models generalize beyond memorized training paths.
- Tools/products/workflows: Require Composition and Structural Zero-shot benchmarks in vendor evaluations; include pass@k analysis to detect amplification-only behavior.
- Assumptions/dependencies: Standardized evaluation suites; governance buy-in; reproducible testing.
Personal assistants that seamlessly connect user memory with new inputs [Daily Life/Consumer Software]
- Use case: Assistants that answer “What is the job title of the person who emailed me last week and is my colleague’s mentor?” combining stored contacts (Mem) and new email content (Ctx).
- Tools/products/workflows: Maintain personal knowledge graphs; train assistants via the atomic→composite curriculum; instrument retrieval and memory calls as “relations.”
- Assumptions/dependencies: Consentful personal data storage; accurate context ingestion; privacy and security infrastructure.
Clinical QA that joins EHR data with new lab results [Healthcare]
- Use case: Answer multi-hop questions such as “What medication adjustment is recommended for the cardiologist’s patient whose latest creatinine elevation exceeds threshold?” using EHR (Mem) plus new labs (Ctx).
- Tools/products/workflows: Fine-tune on atomic skills (EHR semantics and lab interpretation) then RL on composite care pathways; verify zero-shot generalization on unseen relation types (e.g., new test panels).
- Assumptions/dependencies: De-identified, compliant data; strong retrieval and data normalization; clinical governance.
Agent planning for multi-operation web tasks [Robotics/Software Agents]
- Use case: Web agents that recombine known tools (atomic skills) with new site structures (context) to accomplish novel multi-step tasks.
- Tools/products/workflows: SFT on tool primitives; RL with composite tasks defined by operation paths; evaluate Composition and Zero-shot generalization to new site relations.
- Assumptions/dependencies: Robust tool APIs; accurate state/context gathering; reward design for task completion.
CRM/chatbots that combine internal records with current ticket context [Enterprise Software]
- Use case: Bots that reason over “business partner” or “account hierarchy” from CRM (Mem) plus new complaint details (Ctx).
- Tools/products/workflows: Implement complementary reasoning QA; use the atomic→composite training recipe; add failure telemetry (Mem vs Ctx hop location).
- Assumptions/dependencies: Data governance over CRM; high-quality retrieval and entity resolution.

Long-Term Applications

These applications require further research, scaling, or development—including adapting the synthetic methodology to real-world corpora, automating atomic skill discovery, and standardizing policy frameworks.

Automated discovery and teaching of atomic skills for diverse domains [Software/AI, Tooling]
- Use case: Systems that infer which primitives (relations, tools, schemas) to teach via SFT before RL synthesis.
- Tools/products/workflows: Unsupervised relation induction; curriculum learning schedulers; adaptive partitioning of Mem vs Ctx skills.
- Assumptions/dependencies: Reliable methods to detect primitives; scalable curriculum generation; domain expert oversight.
Generalization-certified LLM training standards [Policy/Standards]
- Use case: Formal requirements that models pass Zero-shot complementary reasoning tests before deployment in critical sectors.
- Tools/products/workflows: Standardized path-level benchmarks; conformity assessment protocols; public reporting of pass@k synthesis metrics.
- Assumptions/dependencies: Multi-stakeholder consensus; test maintenance; anti-gaming protections.
Multimodal complementary reasoning (text + tables + images + telemetry) [Healthcare, Energy, Finance, Education]
- Use case: Combine parametric knowledge with live signals (e.g., medical imaging with patient history; grid telemetry with asset records).
- Tools/products/workflows: Multimodal Mem/Ctx splits; RL reward functions spanning modalities; synthetic multimodal KGs for evaluation.
- Assumptions/dependencies: Multimodal data access; reliable alignment; safety approvals.
Architectural innovations for memory–context routers and symbolic bridges [Software/AI Research]
- Use case: Models with explicit modules that route and compose parametric memory and retrieved context along reasoning paths.
- Tools/products/workflows: Differentiable routers; hybrid neuro-symbolic planners; relation-aware attention layers; path-tracing monitors.
- Assumptions/dependencies: Model changes beyond post-training; rigorous ablation and verification.
Continuous compliance assistants with OOD resilience [Finance/Legal]
- Use case: Systems that adapt to unseen regulation relations and recombine prior knowledge with new rules as they emerge.
- Tools/products/workflows: Atomic skill banks for statute semantics; RL on composite cross-references; Zero-shot tests for new regulatory constructs.
- Assumptions/dependencies: Up-to-date legal corpora; policy change detection; risk controls.
Educational tutors that generalize to novel curriculum compositions [Education]
- Use case: Tutors capable of combining known student profiles (Mem) with new tasks (Ctx) in unseen pedagogical sequences.
- Tools/products/workflows: Atomic pedagogical primitives; composite curriculum synthesis via RL; progression-aware error analytics.
- Assumptions/dependencies: High-quality student models; privacy-preserving data; educator alignment.
Reliability science for “synthesis vs amplification” in AI systems [Academia/Industry Labs]
- Use case: Methods to systematically determine when RL creates new reasoning circuits versus reweights existing ones.
- Tools/products/workflows: Pass@k scaling analyses, causal probes, latent space tracking (e.g., PCA, representational drift), and intervention tests.
- Assumptions/dependencies: Access to training logs and embeddings; standardized protocols; reproducibility.
Domain-specific synthetic KG platforms for controlled evaluation [Software/Platforms]
- Use case: Sector-tailored testbeds (e.g., biotech pathways, supply chain graphs) to isolate contamination and measure complementary reasoning.
- Tools/products/workflows: KG generators, template libraries, relation partitioning tools; test split orchestrators (I.I.D./Composition/Zero-shot).
- Assumptions/dependencies: Domain expert input; validation of realism; maintenance overhead.
Agentic tool-use composition with formal verification [Robotics/Software Agents]
- Use case: Agents that can prove they correctly composed tools and data sources across unseen operation sequences.
- Tools/products/workflows: Verified planners; relation-path constraints; reward shaping tied to proofs; compliance logs.
- Assumptions/dependencies: Formal method integration; performance trade-offs; verification-friendly environments.
Healthcare decision support with certified OOD generalization [Healthcare]
- Use case: Systems that safely extrapolate to new clinical relations (e.g., emergent biomarkers) while composing known pathways.
- Tools/products/workflows: Atomic skill libraries for clinical semantics; Zero-shot composite tests; human-in-the-loop guardrails.
- Assumptions/dependencies: Regulatory approval; rigorous clinical validation; auditing infrastructure.

From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning (2512.01970v1)

Summary

Reinforcement Learning as a Synthesizer of Complementary Reasoning in LLMs

Motivation and Problem Definition

Synthetic Experimental Design

SFT Generalization Paradox and Baseline Analysis

RL as a Reasoning Synthesizer

Necessity of Atomic Skills and RL for Generalization

Mechanistic Insights: Synthesis vs. Amplification

Learning Dynamics and Latent Mechanism Analysis

Implications and Future Directions

Conclusion

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they paper it?

What did they find?

Why is this important?

Knowledge Gaps

Glossary

Practical Applications

Immediate Applications

Long-Term Applications

Open Problems

Continue Learning

Authors (8)

Collections

Tweets

From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning (2512.01970v1)

Sponsor

Summary

Reinforcement Learning as a Synthesizer of Complementary Reasoning in LLMs

Motivation and Problem Definition

Synthetic Experimental Design

SFT Generalization Paradox and Baseline Analysis

RL as a Reasoning Synthesizer

Necessity of Atomic Skills and RL for Generalization

Mechanistic Insights: Synthesis vs. Amplification

Learning Dynamics and Latent Mechanism Analysis

Implications and Future Directions

Conclusion

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they paper it?

What did they find?

Why is this important?

Knowledge Gaps

Glossary

Practical Applications

Immediate Applications

Long-Term Applications

Open Problems

Continue Learning

Related Papers

Authors (8)

Collections

Tweets