Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 65 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 80 tok/s Pro
Kimi K2 182 tok/s Pro
GPT OSS 120B 453 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Beyond the Leaderboard: Understanding Performance Disparities in Large Language Models via Model Diffing (2509.18792v1)

Published 23 Sep 2025 in cs.CL

Abstract: As fine-tuning becomes the dominant paradigm for improving LLMs, understanding what changes during this process is increasingly important. Traditional benchmarking often fails to explain why one model outperforms another. In this work, we use model diffing, a mechanistic interpretability approach, to analyze the specific capability differences between Gemma-2-9b-it and a SimPO-enhanced variant. Using crosscoders, we identify and categorize latent representations that differentiate the two models. We find that SimPO acquired latent concepts predominantly enhance safety mechanisms (+32.8%), multilingual capabilities (+43.8%), and instruction-following (+151.7%), while its additional training also reduces emphasis on model self-reference (-44.1%) and hallucination management (-68.5%). Our analysis shows that model diffing can yield fine-grained insights beyond leaderboard metrics, attributing performance gaps to concrete mechanistic capabilities. This approach offers a transparent and targeted framework for comparing LLMs.

Summary

  • The paper introduces a mechanistic interpretability framework using crosscoders to reveal latent shifts in LLM performance following fine-tuning.
  • The paper reveals significant trade-offs, including enhanced safety, multilingual, and instruction-following capabilities alongside diminished technical and hallucination detection abilities.
  • The paper demonstrates model diffing’s potential for systematic, interpretable diagnostics that guide model selection and future fine-tuning improvements.

Mechanistic Analysis of Performance Disparities in LLMs via Model Diffing

Introduction

The paper presents a rigorous mechanistic interpretability framework for understanding performance disparities between closely related LLMs, specifically focusing on the effects of fine-tuning via preference optimization. The authors critique the limitations of traditional benchmarking and human evaluation, arguing that these methods often fail to attribute performance differences to concrete model capabilities. Instead, they propose model diffing using crosscoders—a sparse autoencoder-based approach—to analyze and categorize latent representation shifts between models. The paper uses Gemma-2-9b-it and its SimPO-enhanced variant as a case paper, providing a detailed taxonomy of capability changes and quantifying the trade-offs introduced by SimPO fine-tuning.

Methodology

The core methodological innovation is the use of crosscoders for model diffing. Crosscoders are sparse autoencoders trained to reconstruct activation patterns from two models using a shared dictionary of interpretable latent concepts. For each latent, decoder directions are learned for both models, and norm differences are computed to identify model-specific capabilities. The authors address known failure modes (complete shrinkage, latent decoupling) by applying Latent Scaling, which estimates coefficients to more accurately measure latent presence. BatchTopK training is used to ensure that only causally distinct latents are retained.

The experimental setup involves training crosscoders on activation patterns from three Gemma-2-9b variants: the base pretrained model (pt), the instruction-tuned model (it), and the SimPO-enhanced model (it-SimPO). Layer 20 is selected for analysis, and crosscoders are trained on 200M tokens from FineWeb and LMSys datasets. The authors extract documents that strongly activate unique latents and use a high-capacity LLM (Claude 3 Opus) to annotate and categorize these latents, resulting in a taxonomy of 30 capability categories grouped under seven major classes.

Key Findings

Enhanced Capabilities in SimPO

Model diffing reveals that SimPO fine-tuning leads to targeted increases in specific capabilities:

  • Safety and Content Moderation (+32.8%): Latents associated with sexual content filtering, minor protection, and stereotype/bias detection become more prominent, indicating a prioritization of alignment with human values and safety guidelines.
  • Multilingual and Stylistic Processing (+43.8%): SimPO enhances multilingual capabilities, with notable improvements in English, German, and Chinese. However, low-resource languages do not show similar gains, likely due to data limitations.
  • Instruction Following (+151.7%): There is a dramatic increase in template and instruction-following latents, explaining SimPO's improved response formatting and adherence to user constraints.

Diminished Capabilities in SimPO

SimPO's optimization for alignment and stylistic fluency comes at the expense of certain technical and introspective capabilities:

  • Hallucination Detection (–68.5%): Latents associated with identifying fabricated information are significantly reduced, suggesting a trade-off between confident output generation and internal verification.
  • Model Self-Reference (–44.1%): The model's ability to describe its own nature and limitations is deprioritized.
  • Structured Output Generation (–37.1%): There is a reduction in latents responsible for producing organized, machine-readable formats.
  • Code Generation and Math (–17.4%): Technical capabilities related to code and mathematical reasoning are diminished, reflecting a shift toward general-purpose conversational abilities.

Comparison with DPO Fine-Tuning

The authors extend their analysis to Direct Preference Optimization (DPO), finding that DPO fine-tuning leads to improved quality control and interaction management but less emphasis on safety compared to SimPO. DPO also impacts core linguistic capabilities more negatively, highlighting the generality and diagnostic power of the model diffing approach.

Implications

The paper demonstrates that leaderboard metrics and aggregate benchmarks are insufficient for diagnosing the true nature of performance disparities in LLMs. Model diffing via crosscoders provides fine-grained, interpretable insights into how fine-tuning methods like SimPO and DPO alter model behavior. The results show that preference optimization does not yield uniform improvements; instead, it induces targeted capability shifts, often trading off technical competence and introspection for stylistic fluency and alignment cues.

These findings have practical implications for model selection and deployment. For applications requiring robust hallucination detection, introspection, or technical reasoning, SimPO-enhanced models may be suboptimal despite their superior benchmark scores. Conversely, for use cases prioritizing safety, multilingual support, and instruction adherence, SimPO fine-tuning offers clear advantages.

Theoretically, the work advances the field of mechanistic interpretability by demonstrating the utility of crosscoders for unsupervised, task-agnostic latent space analysis. The approach is generalizable to other model pairs and fine-tuning regimes, enabling systematic diagnosis of behavioral changes without reliance on hand-crafted probes or benchmarks.

Future Directions

The paper suggests several avenues for future research:

  • Generalization to Other Architectures: Extending model diffing to diverse LLM architectures and fine-tuning methods to validate the universality of observed capability shifts.
  • Causal Interventions: Leveraging crosscoder latents for activation patching to establish causal relationships between latent concepts and model behavior.
  • Automated Taxonomy Construction: Refining LLM-based annotation pipelines for scalable, high-fidelity latent categorization.
  • Integration with Evaluation Frameworks: Incorporating mechanistic diagnostics into model evaluation platforms to complement leaderboard metrics with capability-level analysis.

Conclusion

This work establishes model diffing via crosscoders as a robust framework for understanding performance disparities in LLMs. The analysis of Gemma-2-9b-it and its SimPO variant reveals that fine-tuning induces targeted shifts in safety, alignment, and stylistic capabilities, often at the expense of technical competence and introspection. These insights underscore the limitations of traditional evaluation paradigms and highlight the need for mechanistic, latent-space-informed diagnostics in LLM development and deployment. The methodology is generalizable and offers a blueprint for transparent, capability-driven model comparison in future AI research.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

Overview

This paper asks a simple question: when we fine‑tune a LLM to “make it better,” what actually changes inside the model? Instead of just looking at leaderboard scores, the authors use a tool called “model diffing” to peek under the hood and find the specific skills that got stronger or weaker. They paper two versions of the same model (Gemma‑2‑9B): one that’s instruction‑tuned, and another that’s further fine‑tuned using SimPO, a method designed to align model outputs with human preferences.

Goals

The paper focuses on two easy‑to‑understand goals:

  • Figure out whether SimPO fine‑tuning improves real abilities (like following instructions, being safe, handling multiple languages) or mostly polishes style and formatting.
  • Identify which abilities got stronger and which got weaker after SimPO, so we know why one model seems to perform better than another.

Methods (Explained Simply)

Think of an LLM as a huge brain with millions of tiny pattern detectors inside. Each detector responds to certain ideas or behaviors, like “follow instructions,” “avoid unsafe content,” or “write code.”

Here’s how the authors examined those detectors:

  • Model diffing: This is like using a special microscope to compare two brains (models) and see which pattern detectors are more active in one than the other.
  • Crosscoders: Imagine building a shared “dictionary of concepts” that both models understand. This lets you line up the same concept in both models and compare how strongly each model uses it.
  • Sparse autoencoders: These are tools that compress a model’s inner signals into a small set of meaningful switches (concepts), making them easier to interpret.
  • Training details: They looked at a middle layer of the models (layer 20), trained the crosscoder on a big mix of text (about 200 million pieces of text called tokens), and used a setting that picks the top 100 strongest signals per batch (BatchTopK) to focus on the most important concepts.
  • Handling pitfalls: They used a technique called Latent Scaling to avoid mislabeling shared concepts as unique to one model.
  • Comparing models: Directly comparing the instruction‑tuned model to the SimPO model didn’t show clear differences. So instead, they compared each one to their shared base model (the original pretrained Gemma‑2‑9B). That revealed clear sets of concepts that were unique to each fine‑tuned version.
  • Labeling concepts: For each discovered concept, they pulled examples of text that strongly activated it and asked another LLM (Claude 3 Opus) to describe and categorize the concept. This produced a simple taxonomy of 30 abilities grouped into 7 big classes (like Safety, Linguistic Capabilities, and Error Handling).

Main Findings

SimPO’s changes aren’t uniform upgrades. They are targeted shifts—some abilities go up, others go down. Here’s the big picture:

  • What got stronger:
    • Instruction‑following and template following: Very large boost (about +151.7%). The SimPO model is much better at doing exactly what you ask and presenting answers neatly.
    • Safety and content moderation: Strong increase (about +32.8%), especially filtering sexual content, protecting minors, and spotting stereotypes or bias.
    • Multilingual and stylistic language skills: Noticeable increase (about +43.8%), improving performance in languages like English, German, and Chinese.
    • Factual verification: Also improved, helping the model check facts more often.
  • What got weaker:
    • Hallucination detection: Big drop (about −68.5%). The SimPO model is less likely to recognize when it might be making things up.
    • Model self‑reference and introspection: Decrease (about −44.1%). It talks less about its own limitations or process.
    • Structured output generation and query classification: Decreases (around −37.1%). It may jump straight into answering rather than first classifying the request or planning a structured response.
    • Technical skills (code and math): Overall decline (about −17.4%), matching slightly weaker scores on coding/math categories.
  • Leaderboards vs. real changes:
    • Human preference tests (like Chatbot Arena) often reward style, clarity, and politeness. The authors show that a chunk of SimPO’s leaderboard gains can be explained by stylistic polishing, not just deeper reasoning or accuracy. When style is controlled, score differences shrink.

Why this matters: These specifics explain why the SimPO model may feel nicer to interact with and score higher on preference‑based tests, even if some technical or self‑checking abilities got weaker.

Implications and Impact

  • Beyond scoreboards: Leaderboards tell you who “won,” but not why. This method shows the exact skills that changed—much more useful for building trustworthy models.
  • Trade‑offs are real: Fine‑tuning toward human‑preferred style and safety can trade off against self‑checking and structured reasoning. Knowing this helps teams choose the right training method for their goals.
  • Better diagnostics: Model diffing offers a transparent way to compare models and make targeted improvements, rather than guessing from overall scores.
  • General method: The same approach works for other fine‑tuning techniques too. In a quick check with DPO (another preference method), the pattern of changes was different, proving the tool can reveal method‑specific shifts.

In short, the paper shows that fine‑tuning can make models safer, more obedient, and more polished—but sometimes less cautious about hallucinations and slightly weaker at technical tasks. Using model diffing helps researchers see these hidden changes clearly, leading to smarter, more honest evaluations and better‑balanced models.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of concrete gaps and unresolved questions that future work could address.

  • Generalization beyond a single model family: Validate whether the observed capability shifts hold across diverse base families (e.g., Llama, Qwen, DeepSeek), sizes, and architectures, not just Gemma-2-9b variants.
  • Method-level disentanglement: Isolate algorithmic effects of SimPO from data effects by controlling/pre-registering identical preference datasets across SimPO/DPO/RLHF/ORPO/KTO and comparing outcomes with model diffing.
  • Direct it vs SimPO comparison sensitivity: The direct SimPO–it crosscoder yielded weak signal; develop tri-model joint crosscoders or contrastive diffing that explicitly isolates incremental changes from it→SimPO without relying on pt as an anchor.
  • Layer-wise coverage: Extend analysis beyond a single layer (layer 20) to multiple layers and positions; map where capability shifts emerge (early, mid, residual stream vs MLP/attention) and whether they are localized or distributed.
  • Circuit-level attribution: Link discovered latents to specific attention heads/MLP blocks and pathways via circuit tracing or path patching to move from feature-level to mechanism-level explanations.
  • Hyperparameter sensitivity of crosscoders: Quantify robustness to latent dimensionality, top-k sparsity, learning rates, seeds, and dictionary size; report SAEBench metrics (e.g., purity, coverage, dead features, reconstruction).
  • Stability across random seeds: Train multiple crosscoders with different initializations to estimate variance in discovered “unique” latents and the downstream taxonomy; report confidence intervals.
  • Failure-mode stress tests: Empirically verify mitigation of Complete Shrinkage and Latent Decoupling using controlled synthetic features and ablations; assess sensitivity to Latent Scaling coefficients (νϵ\nu^\epsilon, νr\nu^r).
  • Metric design for diffing: Go beyond norm differences to include angular/directional drift (cosine), activation prevalence, and sparsity overlap metrics that may capture subtle incremental changes between closely related models.
  • Dataset bias in crosscoder training: The 200M-token corpus (FineWeb + LMSys) likely under-represents low-resource languages and technical domains; test how training corpus composition shapes which latents are discovered.
  • Low-resource multilingual coverage: The method did not capture regressions in Japanese/Korean observed on LMArena; incorporate balanced multilingual corpora and evaluate whether diffing detects low-resource capability shifts.
  • Token- and span-level granularity: Move from document-level activation retrieval to token/span-level analysis to reduce confounds and better localize what each latent detects.
  • Multi-turn and long-context dynamics: Analyze latent behavior over dialogue turns and long contexts to see whether interaction and deliberation features shift across time steps in SimPO vs it.
  • Causal validation via interventions: Perform activation patching/feature steering/ablation on identified latents (e.g., “hallucination detection,” “template following”) and measure causal effects on outputs.
  • Behavioral linkage to targeted benchmarks: Validate latent-level claims with controlled evaluations (e.g., TruthfulQA/HaluEval for hallucination, structured extraction/JSON compliance, tool-use calling, multilingual low-resource tasks) under style control.
  • Safety external validation: Corroborate increased “safety” latents with adversarial red-teaming and standardized safety benchmarks to test whether unsafe generations actually decrease under attack.
  • Structured output and tool-use regressions: The reported decline in “structured output generation” is latent-based; confirm on rigorous structured extraction, function-calling, and schema-conformance tests.
  • “Self-reference” and “query classification” constructs: These latent labels stem from LLM annotation; design targeted probes and controlled tasks to verify these constructs predict behavior and aren’t annotation artifacts.
  • Annotation validity and reliability: The taxonomy relies on Claude-generated labels; quantify agreement vs human experts and alternative LLMs, audit consistency across seeds, and publish inter-annotator agreement and uncertainty scores.
  • Reproducibility of the taxonomy: Release full latent-to-category mappings, prompts, top-N activating snippets, and code for grouping to enable independent reproduction and auditing of category assignments.
  • Choice of top-N and retrieval strategy: Systematically paper how the number of top-activating examples and retrieval method affect latent interpretations and category frequencies.
  • Negative controls and false positive rate: Diff same-model copies (different seeds/checkpoints) to estimate the baseline rate of spurious “unique” latents; report how often crosscoders detect differences when none exist.
  • Checkpoint dynamics: Track capability latents across intermediate fine-tuning checkpoints to observe when shifts emerge and whether they stabilize or regress; relate to training loss/reward dynamics.
  • Cross-method triangulation: Compare crosscoder findings with structural probes, linear classifiers, attention head ablations, and logit/representation lens analyses to ensure convergent validity.
  • BatchTopK vs L1 crosscoders: Provide ablations across training objectives and sparsity mechanisms to test claims about causal distinctness and feature discoverability.
  • From frequency to effect size: Replace or complement “normalized latent frequency” with quantitative effect sizes (change in loss/logit or behavior) attributable to each latent, measured via interventions.
  • Out-of-distribution generalization: Test whether discovered latents predict behavior on domains absent from crosscoder training (e.g., biomedical, legal, code-heavy corpora).
  • Practical mitigation of trade-offs: Explore multi-objective fine-tuning or auxiliary losses (e.g., self-checking, verification heads) to retain hallucination-detection and structured-output latents while improving alignment.
  • Transferability of latents: Investigate whether identified latents can be ported or re-instantiated across models/families (feature transplantation) and whether their causal roles persist.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Practical Applications

Immediate Applications

The following applications can be deployed now, leveraging the paper’s released crosscoder models, methodology, and empirical insights about SimPO’s capability shifts.

  • Mechanistic fine-tuning audits for LLM releases
    • Sector: software/AI, academia, policy
    • What: Integrate model diffing with crosscoders into MLOps to produce “Capability Change Reports” comparing a fine-tuned model against its base (e.g., safety, multilingual, instruction-following, hallucination detection, structured output).
    • Tools/products/workflows: CI/CD stage that trains a crosscoder at selected layers (e.g., layer 20), runs Latent Scaling and BatchTopK SAE training, and generates dashboards of latent capability frequencies.
    • Assumptions/dependencies: Requires access to open weights or internal activations; compute and curated activation data; taxonomy quality depends on LLM-based annotation (Claude 3 Opus or similar).
  • Model selection and task routing based on mechanistic capability profiles
    • Sector: software, customer support, content creation
    • What: Route tasks to the variant with strengths that match the task profile (e.g., SimPO for instruction/template following and safety; base/it model for code/math or structured outputs).
    • Tools/products/workflows: Orchestrators that reference capability profiles when choosing a model; per-task policies (e.g., use SimPO for moderated customer-facing responses).
    • Assumptions/dependencies: Reliable mapping from latents to capabilities; updated routing rules when models change.
  • Safety moderation middleware for consumer and enterprise assistants
    • Sector: social platforms, education, finance, healthcare (low-risk support use)
    • What: Use SimPO-enhanced models in safety-critical front-ends (sexual content filtering, bias detection, minor protection) while pairing them with verification steps for factual claims.
    • Tools/products/workflows: “Safety-first” chat layer; post-generation factual verification (RAG) and schema validators to mitigate reduced hallucination detection and structured output.
    • Assumptions/dependencies: SimPO’s safety gains generalize to target domains; additional verification is in place to offset reduced introspection.
  • Multilingual support for prioritized languages
    • Sector: customer service, global education, e-commerce
    • What: Deploy SimPO-enhanced models to improve English, German, and Chinese interactions; avoid or supplement with special models for low-resource languages (Japanese, Korean) where regressions are likely.
    • Tools/products/workflows: Language-aware routing and fallback; per-language evaluation harnesses.
    • Assumptions/dependencies: Crosscoder training data reflected these languages; performance in low-resource languages may need targeted fine-tuning or domain data.
  • Style-robust benchmarking and evaluation hygiene
    • Sector: academia, benchmarking orgs, AI evaluation platforms
    • What: Adopt style-control evaluation (as in LMArena) to distinguish substance from style; report capability-level changes alongside Elo deltas to reduce gaming.
    • Tools/products/workflows: Evaluation harnesses with style normalization; leaderboards that include mechanistic capability scores.
    • Assumptions/dependencies: Willingness of platforms to incorporate style control and capability diffing; standardization of reporting formats.
  • Targeted dataset curation using latent activation
    • Sector: academia, model developers
    • What: Use high-activation documents per latent to curate training sets that amplify or recover specific capabilities (e.g., structured output, hallucination detection).
    • Tools/products/workflows: Data selection pipelines that filter and weight documents by latent activation; iterative fine-tunes with capability monitoring.
    • Assumptions/dependencies: Access to activation traces; robust annotation and clustering of latents; compute.
  • Structured output reliability via runtime enforcement
    • Sector: software, finance back-office, operations
    • What: Compensate for SimPO’s reduced structured output generation with function calling, JSON schema validation, or grammar-constrained decoding.
    • Tools/products/workflows: Tool-augmented agents; schema validators; LLM function/tool calling with strict interfaces.
    • Assumptions/dependencies: Availability of programmatic interfaces and validators; model’s compliance with constrained decoding.
  • Fine-tuning strategy comparison (SimPO vs. DPO) for objective-aligned deployments
    • Sector: industry R&D, academia
    • What: Use crosscoder diffing to decide between SimPO or DPO depending on target behavior (e.g., DPO increases interaction management and quality control, SimPO prioritizes safety and instruction-following).
    • Tools/products/workflows: Pre-deployment “diff-driven” tuning choice matrix; tuning portfolios.
    • Assumptions/dependencies: Comparable training budgets; consistent evaluation pipelines across methods.

Long-Term Applications

These applications require further research, scaling, standardization, or productization to become broadly viable.

  • Regulatory standards and certifications for mechanistic LLM audits
    • Sector: policy, compliance, procurement
    • What: Establish standardized audits based on model diffing results (e.g., safety, alignment, introspection metrics) as part of pre-deployment compliance under AI regulations.
    • Tools/products/workflows: “Mechanistic Transparency Standard,” third-party audit services, certification programs.
    • Assumptions/dependencies: Policy adoption; agreement on audit taxonomies and thresholds; auditability of closed models.
  • Continuous capability regression monitoring in MLOps
    • Sector: software/AI platforms
    • What: Automate capability diffing in CI/CD to catch regressions (e.g., hallucination detection, structured output) when models are fine-tuned or updated.
    • Tools/products/workflows: Scheduled crosscoder training, alerting on capability shifts, gating deployment on audit results.
    • Assumptions/dependencies: Scalable compute; robust incremental diffing; model versioning discipline.
  • Latent-level safety and capability “knobs” (interpretable steering)
    • Sector: software, high-stakes apps
    • What: Build runtime controls that increase/decrease specific latent concepts (e.g., safety filters, query classification) to balance alignment vs. introspection per task.
    • Tools/products/workflows: SAE-driven latent steering APIs; policy configurations for different workflows.
    • Assumptions/dependencies: Reliable causal mapping between latents and behaviors; safety evaluation; guardrails against misuse.
  • Multi-objective fine-tuning to preserve introspection and technical skills
    • Sector: model development across industries
    • What: Design training objectives that explicitly preserve hallucination detection, self-reference, and structured output while optimizing for safety and human preference.
    • Tools/products/workflows: Composite losses, constraint-based RLHF/RLAIF, capability-preserving curriculum design.
    • Assumptions/dependencies: Advances in training algorithms; high-quality data for introspection/verification; careful balancing of trade-offs.
  • Mechanistic signals as benchmark substrates
    • Sector: academia, benchmarking organizations
    • What: Redesign benchmarks to incorporate capability-level mechanistic signals, making them more resistant to style gaming and data contamination.
    • Tools/products/workflows: SAEBench extensions; style-robust metrics; cross-model capability maps.
    • Assumptions/dependencies: Community consensus; reproducible crosscoder pipelines; broad model coverage.
  • Sector-specific assurance for high-stakes domains
    • Sector: healthcare, finance, legal
    • What: Require mechanistic audit reports and demonstrated retention of introspection/reliability for clinical decision support, risk chatbots, or legal drafting tools.
    • Tools/products/workflows: Domain certification processes; paired RAG/verifier systems; scenario-based capability audits.
    • Assumptions/dependencies: Regulatory approval; longitudinal validation; domain-specific datasets.
  • Capability transfer (“concept grafting”) across models
    • Sector: research and advanced productization
    • What: Identify and transplant beneficial latents (e.g., safety, instruction-following) from one model to another to modularly compose capabilities.
    • Tools/products/workflows: Cross-model crosscoder dictionaries; feature grafting frameworks; evaluation suites.
    • Assumptions/dependencies: Generalizable latent dictionaries; causal guarantees; IP and licensing constraints.
  • Provenance, fingerprinting, and IP protection via latent diffing
    • Sector: policy, legal, platform governance
    • What: Use model diffing to detect unlicensed fine-tuning or training changes; track provenance for model marketplaces.
    • Tools/products/workflows: Latent fingerprint registries; audit APIs; compliance monitoring.
    • Assumptions/dependencies: Acceptance as evidentiary standard; cooperation from vendors; robust false-positive controls.
  • Improving low-resource language performance via latent-guided data collection
    • Sector: global services, education, public sector
    • What: Use capability gaps identified in latents to drive targeted data collection and fine-tuning for underserved languages.
    • Tools/products/workflows: Data acquisition pipelines; language-specific evaluation and routing.
    • Assumptions/dependencies: Access to high-quality, culturally appropriate data; funding and community collaboration.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • Activation patterns: The vectorized firing behavior of neurons/features within a model used to analyze and reconstruct internal states. "A shared dictionary is trained to reconstruct the activation patterns from both models."
  • Alignment: The process of making model outputs adhere to human preferences, safety policies, and desired behaviors. "The instruction-tuned model with supervised fine-tuning and alignment."
  • BatchTopK: A training procedure for sparse autoencoders that selects the top-k features per batch to improve interpretability and performance. "Combined with BatchTopK training~\cite{bussmann2024batchtopk, GaoTTGTRSL025}, this enables identification of latents that are causally unique to the fine-tuned or base model."
  • Complete Shrinkage: A failure mode in crosscoder training where shared latents collapse, falsely appearing model-specific. "This approach may suffer from two known failure modes: Complete Shrinkage and Latent Decoupling"
  • Crosscoders: Specialized sparse autoencoders that learn a shared, interpretable latent dictionary across two models to compare their internal representations. "Crosscoders are a specialized form of sparse autoencoders~\cite{yun2021transformer, bricken2023monosemanticity, huben2023sparse} that learn a shared dictionary of interpretable latent concepts across two models."
  • Decoder directions: The learned vectors mapping latents back to model activation space, one per model, used for comparison. "For each latent dimension, a pair of decoder directions is learned, one for each model."
  • Direct Preference Optimization (DPO): A fine-tuning method that optimizes model outputs directly against preference signals without a separate reward model. "See Appendix~\ref{appx:dpo} for additional results with regard to Direct Preference Optimization (DPO) fine-tuning~\cite{rafailov2023direct}."
  • Elo scores: A relative rating system adapted to compare model performance categories via head-to-head evaluations. "The delta of LMArena Elo scores,\protect\footnotemark\; per category for gemma-2-9b-it-SimPO compared to gemma-2-9b-it"
  • Fine-tuning: Post-training adaptation of a base model on targeted data or objectives to modify or enhance capabilities. "As fine-tuning becomes the dominant paradigm for improving LLMs, understanding what changes during this process is increasingly important."
  • FineWeb: A large-scale, curated web text dataset used for model training and analysis. "Crosscoders were trained using 200M tokens from a mixed corpus comprising the FineWeb~\cite{penedo24fineweb} and LMSys datasets~\cite{zheng2023lmsys}."
  • Hallucination detection: Latent capability for identifying or mitigating fabricated, non-factual content in model outputs. "we observe notable regressions in hallucination detection (--68.5%)"
  • Instruction tuning: Fine-tuning a model to follow task instructions and formatting more reliably. "once a model has undergone instruction tuning, further improvements like SimPO operate within a narrow behavioral subspace."
  • Latent decoupling: A failure mode where shared latent features split across models, causing misclassification as unique. "This approach may suffer from two known failure modes: Complete Shrinkage and Latent Decoupling"
  • Latent dimension: An individual feature axis in the learned latent space representing an interpretable concept. "For each latent dimension, a pair of decoder directions is learned, one for each model."
  • Latent representations: Compact, interpretable features within a model’s internal space that encode capabilities or concepts. "we use Model Diffing \citep{lindsey2024sparse, minder2025robustly} with crosscoders to analyze the latent representations of two models."
  • Latent Scaling: A technique to correct crosscoder failures by estimating scaling coefficients for more accurate latent presence across models. "we apply the Latent Scaling technique~\cite{minder2025robustly, wright2024addressing}, which estimates two coefficients, νϵ\nu^\epsilon and νr\nu^r, to more accurately measure latent presence across models."
  • LLMs: Transformer-based models trained on vast text corpora to perform general language tasks. "Open-weight LLMs have transformed the AI landscape"
  • LMArena: A human evaluation platform (Chatbot Arena) that compares LLMs by user preference in head-to-head settings. "human evaluations, such as LMArena~\cite{chiang2024chatbotarenaopenplatform}, offer more authentic and wide-ranging assessment"
  • LMSys dataset: A real-world conversational dataset used for training and analysis of LLMs. "Crosscoders were trained using 200M tokens from a mixed corpus comprising the FineWeb~\cite{penedo24fineweb} and LMSys datasets~\cite{zheng2023lmsys}."
  • Mechanistic interpretability: Methods that analyze internal circuits/features to explain model behaviors at a causal/mechanistic level. "use model diffing, a mechanistic interpretability approach, to analyze the specific capability differences between Gemma-2-9b-it and a SimPO-enhanced variant."
  • Model diffing: A methodology to compare internal representations between models to isolate and interpret capability changes. "We apply Model Diffing to analyze the latent representations of Gemma-2-9b-it and its fine-tuned variant Gemma-2-9b-it-SimPO."
  • Norm difference: A normalized metric comparing decoder vector magnitudes for corresponding latents across models. "The norm difference between two models M1M_1 and M2M_2 is defined as:"
  • Nu coefficients (νε and νr): Scaling parameters in Latent Scaling used to measure latent presence robustly across models. "which estimates two coefficients, νϵ\nu^\epsilon and νr\nu^r, to more accurately measure latent presence across models."
  • Open-weight: A model distribution format where parameter weights are publicly available for use and fine-tuning. "Open-weight LLMs have transformed the AI landscape"
  • Reinforcement Learning from Human Feedback (RLHF): Training methodology where models are optimized via rewards derived from human preferences. "which has been promoted as a significant advancement in RLHF, credited with boosting the performance of Gemma-2-9b-it"
  • Simplified Preference Optimization (SimPO): A preference optimization method that improves alignment without requiring a reference model for rewards. "Simplified Preference Optimization (SimPO) technique~\cite{meng2024simposimplepreferenceoptimization}"
  • Sparse Autoencoder (SAE): An autoencoder variant promoting sparsity in latent features to yield interpretable concepts. "We employed the BatchTopK Sparse Autoencoder (SAE) training method with a latent dimensionality of 114,688, top-k=100k = 100 and learning rate of 1e-4."
  • Structural probing: Training small classifiers on internal model states to test for encoded linguistic or structural properties. "Another approach to uncovering what a model encodes is structural probing~\cite{belinkov-etal-2017-neural, hewitt-manning-2019-structural, kantamneni2025are}, which involves training small classifiers (probes) to predict specific properties from the model's internal representations."
  • Top-k: Selecting the k largest-activation features per batch or layer during training/inference. "top-k=100k = 100"
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 8 posts and received 771 likes.