Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation

Published 1 Apr 2026 in cs.CL and cs.AI | (2604.00536v1)

Abstract: LLMs achieve strong downstream performance largely due to abundant supervised fine-tuning (SFT) data. However, high-quality SFT data in knowledge-intensive domains such as humanities, social sciences, medicine, law, and finance is scarce because expert curation is expensive, privacy constraints are strict, and label consistency is hard to ensure. Recent work uses synthetic data, typically by prompting a generator over domain documents and filtering outputs with handcrafted rubrics. Yet rubric design is expert-dependent, transfers poorly across domains, and is often optimized through a brittle heuristic loop of writing rubrics, synthesizing data, training, inspecting results, and manually guessing revisions. This process lacks reliable quantitative feedback about how a rubric affects downstream performance. We propose evaluating synthetic data by its training utility on the target model and using this signal to guide data generation. Inspired by influence estimation, we adopt an optimizer-aware estimator that uses gradient information to quantify each synthetic sample's contribution to a target model's objective on specific tasks. Our analysis shows that even when synthetic and real samples are close in embedding space, their influence on learning can differ substantially. Based on this insight, we propose an optimization-based framework that adapts rubrics using target-model feedback. We provide lightweight guiding text and use a rubric-specialized model to generate task-conditioned rubrics. Influence score is used as the reward to optimize the rubric generator with reinforcement learning. Experiments across domains, target models, and data generators show consistent improvements and strong generalization without task-specific tuning.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper presents an influence-guided approach that aligns synthetic data generation with model optimization using optimizer-aware influence scores.
It replaces manual rubric engineering with a learnable RL policy, enhancing synthetic QA quality while directly linking to downstream model improvements.
Empirical results demonstrate consistent accuracy gains across domains and model scales, validating the method’s robustness and portability.

Influence-Guided Rubrics Optimization for Synthetic Data Generation in LLM Fine-Tuning

Motivation and Problem Statement

The work presents a novel and systematic approach to optimizing synthetic data generation for supervised fine-tuning (SFT) of LLMs in knowledge-intensive domains, including the humanities, social sciences, and medicine. Annotated SFT data in these settings is increasingly rare, costly, and often inconsistent due to the necessity for domain expertise, privacy, and standardization constraints. Synthetic data offers an alternative, but existing pipelines suffer critical limitations: rubric-based generation relies on handcrafted, expert-driven templates, which are brittle, domain-specific, and lack quantitative feedback aligning synthetic sample utility with downstream model improvement. The heuristic synthesis-inspection cycle results in poor transparency, limited cross-domain applicability, and sub-optimal SFT outcomes.

Methodological Contributions

Optimizer-Aligned Influence Estimation

The central technical innovation is the deployment of a training-utility-driven selection signal based on optimizer-aware influence scores. Drawing on recent advancements in influence function estimation, the authors adapt a scalable, Adam-compatible variant to the LLM SFT regime. Each generated sample's influence is quantified as the projected downstream effect of a single SGD/Adam update, using trajectory-based, first-order approximations. This explicit alignment of sample selection to model optimization differentiates influence estimation from embedding-proximity heuristics, exposing that high semantic similarity does not guarantee substantive contributions to parameter updates in task-relevant directions.

Rubric Generator as a Learnable RL Policy

Rather than manually engineering domain rubrics, the proposed framework introduces a dedicated rubric-proposing policy model, which is trained in an RL loop with group-normalized, clipped policy gradients (GRPO/PPO-style). The rubric generator, conditioned on seed documents and target model, outputs diverse rubric texts designed to steer synthetic QA generation. The reward structure integrates lightweight automatic validity checks with the optimizer-aware influence score of the resulting synthetic sample. Crucially, the loop closes the gap between data generation and model utility, systematically rewarding the construction of rubrics that yield samples most beneficial for model improvement, independent of expert priors or domain-specific heuristics.

Generalization and Portability

Extensive ablation studies show that the framework generalizes across LLM backbones (Qwen3, Llama3, various scales) and generator models (Qwen3-235B, GPT-4.1, Gemini-2.5-Pro), demonstrating that learned rubrics conditioned on model/task characteristics are broadly portable and robust to architectural differences. The influence-guided approach reduces dependence on heavy domain-specific engineering, supporting deployment in privacy-constrained and annotation-scarce environments.

Empirical Results

In SFT of Qwen3-8B-Base (and other variants), using synthetic data generated via influence-guided rubric optimization, the following strong empirical results are highlighted:

Domain-robust improvements: In both Humanities and Social Sciences (HSS) and Medical domains, models trained with OptimSyn synthetic data achieve consistent, notable accuracy gains over strong open-source SFT baselines and even surpass Qwen3-8B-Instruct, the teacher model, in multiple metrics.
Reasoning-centric tasks: On high-complexity evaluation suites (e.g., MMLU-pro, Super GPQA, Humanity's Last Exam), the method yields $+27.2\%$ relative gains on challenging reasoning tasks compared to standard baselines, with marked performance on tasks whose specifications require nuanced, structured rubric steering.
Generator and scale invariance: Improvements persist across target model scales (4B, 8B, 14B) and family transitions (from Qwen3 to Llama3), with group-normalized policy optimization providing stability even at small model scales.
Predictive influence scores: Aggregate influence scores of synthetic data subsets strongly predict held-out accuracy ( $R^2 > 0.5$ in quadratic regression), supporting the claim that the influence metric serves as a reliable proxy for sample utility under constrained compute settings.
Optimal validation set construction: Careful selection of validation sets for influence estimation (via embedding- and capability-space diversity heuristics) enables effective supervision even from small validation pools, mitigating overfitting and instability.

Rubric Analysis and Data Characteristics

Qualitative and quantitative analysis of rubrics pre- and post-RL optimization demonstrates clear shifts from generic, repetitive templates toward actionable, document-grounded, domain-specific, and structurally rich instructions. Post-RL rubrics display wider coverage, greater lexical diversity, and sharply increased frequency of quality-related dimensions (clarity, logical soundness, information density, etc.). Synthetic QA pairs generated under these rubrics manifest higher factual accuracy, structural diversity, and rubric alignment, correlating with observed downstream accuracy improvements.

Compared to prior approaches—WebR, MAmmoTH2, Bonito, Condor, Evol-Instruct, DSIR, LESS, Meta-rater—which rely on heuristic data selection, static scores, or reflection-based data refinement, the proposed framework directly optimizes the training signal of the target model via influence-aligned supervision. Empirical ablations against multidimensional LLM-judge scores (QuRating, FineWeb-Edu, PRRC) confirm superior model performance when optimizing on influence-based rewards, particularly under fixed training budgets and diverse benchmark splits.

Limitations

The framework relies on indirect policy gradient-based RL—reward computation is detached from generator sampling, so the credit assignment between rubric and final reward can be noisy, causing policy update variance, especially at small group/rollout sizes. Additionally, gradients are not propagated through the generator, limiting potential for low-variance estimator design. Task-specific validation set selection remains a non-trivial component, although diversity-based heuristics partially address instability risks.

Implications and Future Directions

Practically, this influence-guided rubric optimization pipeline substantially reduces the need for expert-driven domain heuristics, enabling efficient expansion of high-quality synthetic SFT data for specialized domains with strict privacy or annotation constraints. Theoretically, it establishes a principled, optimization-aligned feedback cycle in synthetic data curation, reconciling generation strategies with model improvement rather than surface distribution matching. Extensions may include end-to-end differentiable reward modeling, multi-turn synthesis-feedback loops, and joint optimization with generative architectures. Further, integrating structural constraints for scientific or legal QA, or fine-grained calibration for hallucination reduction, is possible under the influence-supervised paradigm.

Conclusion

This work demonstrates that optimizer-aware, influence-guided rubric optimization reliably enhances synthetic data generation for LLM SFT in knowledge-intensive domains. The framework’s empirical advantages are robust across model scales, generator variants, and training configurations. Influence-based rewards provide a strong, generalizable proxy for data utility, exposing critical failures of embedding-centric heuristics. By transforming rubric engineering into a learnable, model-aligned process, the framework paves the way for broader, more efficient deployment of LLMs in data-scarce, high-stakes fields.

Markdown Report Issue