Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

Published 11 Feb 2026 in cs.CL and cs.AI | (2602.10388v2)

Abstract: The diversity of post-training data is critical for effective downstream performance in LLMs. Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper presents FAC Synthesis, a method that leverages sparse autoencoder decompositions to identify and fill task-relevant feature gaps in LLMs.
It systematically synthesizes contrastive data for missing features, achieving remarkable sample efficiency and significant improvements in alignment tasks.
Empirical results across multiple tasks demonstrate higher performance and robustness compared to state-of-the-art methods, with strong feature transferability.

Data-Efficient Diversity in LLM Post-Training: FAC Synthesis in Feature Space

Motivation and Problem Formulation

The empirical and theoretical underpinnings of downstream generalization in LLMs increasingly emphasize not just raw data volume, but targeted diversity in post-training corpora. Existing metrics for data diversity predominantly operate in text or general embedding space, utilizing superficial statistics such as n-gram variation, POS tag entropy, or pairwise semantic distance. While these measures capture surface-level variety, they poorly align with the internal feature activations or latent semantics essential for task-specific generalization or behavioral alignment. Prior gradient- or optimizer-based approaches measure diversity in the model’s gradient or embedding spaces, but these are brittle with respect to architecture, scale, and initialization, impeding cross-model transfer.

This work introduces a model-aware diversity metric, Feature Activation Coverage (FAC), based on interpretable sparse autoencoder (SAE) decompositions of LLM internal representations. FAC operationalizes diversity as the coverage of task-relevant feature activations extracted from the LLM, targeting, for instance, missing alignment-critical features, and guiding data synthesis directly in this representation space. This metric is not only more faithful to variation that drives generalization, but also enables systematic identification and filling of representation gaps even with highly restricted data regimes.

Theoretical Contributions

The authors rigorously motivate their approach with generalization theory. They provide an explicit bound on downstream error that decomposes into:

Distribution gap between the synthetic data manifold $D_{gen}$ and the true task data distribution $D$ , captured by total variation distance in the internal feature space.
Sampling error attributable to finite-sample estimation, bounded via PAC-Bayesian information-theoretic arguments in terms of the mutual information between the synthetic set and fine-tuned parameters.

Pinning the diversity objective in the SAE feature space—rather than raw text—directly targets and minimizes the KL divergence between feature distributions induced by $D$ and $D_{gen}$ . This ensures that synthetic samples close critical coverage gaps in the set of task-relevant features, not just producing surface-level novelty.

FAC Synthesis Framework

FAC Synthesis integrates three components:

Feature Extraction with SAE: An SAE is trained unsupervised on LLM residual activations, providing a set of sparse, highly interpretable latent features that can be mapped to linguistic or behavioral concepts.
Missing Feature Identification: Given a reference (anchor) corpus and existing seed data, the coverage of each feature is assessed via a binary activation threshold; the set difference identifies missing (uncovered) task-relevant features.
Coverage-Guided Synthetic Generation: For each missing feature, a two-step procedure is employed:
- Step 1: Synthesize contrastive pairs to anchor and disambiguate the desired feature activation.
- Step 2: Use these as context for prompts to an LLM generator, producing candidates filtered by their SAE activation for the target feature.

This pipeline enforces precise, feature-level coverage, reducing both distributional and sampling errors, and curbing spurious variance typical in naïve LLM-based synthetic data pipelines.

Empirical Evaluation

Experiments are conducted on four alignment-critical tasks: instruction following, toxicity detection, reward modeling, and behavior steering. The proposed approach achieves markedly higher FAC and significant improvement over state-of-the-art strong baselines (MAGPIE, CoT-Self-Instruct, SAO, SynAlign, Prismatic Synthesis) in all tasks, with especially prominent gains in sample efficiency:

On instruction following, FAC Synthesis matches SOTA win rates (AlpacaEval 2.0) with only 2k synthetic samples versus 300k+ for baselines, indicating over two orders of magnitude higher data efficiency.
For toxicity detection and reward modeling, FAC achieves up to +23.6% and +13.3% improvement in AUPRC and accuracy, respectively, relative to the strongest competitors, with performance correlating strongly with FAC (Pearson $r=0.95$ ).

Notably, the correlation between traditional diversity metrics (distinct-n, word/syntax/embedding-level entropy) and downstream task success is weak or even negative, while FAC remains a robust predictor of improvement.

Feature Transferability and Interpretability

SAE-discovered feature spaces exhibit substantial model-agnosticity: missing features in one architecture (e.g., LLaMA-3.1-8B-Instruct) are effectively transferable as coverage targets across other LLMs (Mistral, Qwen). In several settings, more robust models benefit disproportionately from targeted feature synthesis guided by a weaker teacher’s feature decomposition—a cross-backbone weak-to-strong transfer phenomenon.

The interpretability of SAE features is validated both programmatically and with targeted human audit: more than 84% of features annotated as task-relevant by LLMs are deemed correct by human evaluators.

Sensitivity and Efficiency Analysis

FAC Synthesis is robust to hyperparameter variations:

Decoding temperature and choice of generator impact synthetic sample quality, but best results are generally obtained when the LLM generator matches the target backbone.
FAC increases monotonically with the proportion of missing features covered, but marginal gains saturate quickly, indicating that most of the benefit comes from initial feature coverage rather than brute-force sample scaling.
Conservative activation thresholds improve feature reliability, and few samples per missing feature suffice for maximal benefit, maximizing data efficiency.

Implications and Future Directions

Practical Implications: FAC Synthesis enables principled, data-efficient synthetic corpus generation tuned to the specific latent weaknesses of a model, critically improving post-training under severe data constraints. It also opens the way to more systematic self-improvement cycles, where a trained model’s uncovered features can be incrementally mined and targeted.

Theoretical Implications: The alignment between interpretability (through sparse features), diversity, and generalization is quantitatively substantiated. FAC provides a more semantically and operationally relevant diversity signal than text-based metrics, and the approach bridges the gap between feature-level understanding and coverage-regularized generalization.

Limitations and Future Work: Capturing features corresponding to distributed, high-level reasoning remains difficult, as these are not always localized within the shallow SAE decompositions. Future research should address richer, multi-layer feature compositions, and global feature manifolds that underlie compositional behaviors. Unifying prompt engineering, retrieval conditioning, and dynamic feature mining may further enhance both generalization and safety. Additionally, the transferability of feature spaces to multilingual or domain-specialized LLMs should be systematically explored.

Conclusion

FAC Synthesis demonstrates that exhaustive surface-level diversity is not necessary when synthetic data is targeted to close feature space coverage gaps relevant to downstream generalization. This framework not only achieves higher efficiency and reliability in LLM post-training, but also establishes a theoretically grounded paradigm for model-driven, interpretable data-centric optimization. The results suggest a transition from heuristic, text-centric augmentation toward principled, feature-aware data creation, with broad applicability in scalable, safe, and robust LLM alignment strategies (2602.10388).

Markdown Report Issue