Papers
Topics
Authors
Recent
Search
2000 character limit reached

ModSleuth: Tracing LLM Dependencies

Updated 4 July 2026
  • ModSleuth is an agentic system that reconstructs recursive dependency graphs by formalizing direct and indirect LLM dependencies.
  • It uses a multi-stage pipeline—source gathering, entity discovery, identity resolution, and dependency construction—to extract evidence-grounded graphs.
  • The system enhances auditability by revealing complex, multi-hop relationships across models, datasets, and evaluation tools in modern LLM pipelines.

ModSleuth is an agentic system for reconstructing recursive dependency graphs of modern LLMs from public artifacts. It is designed around the premise that contemporary LLMs are built on rich, layered ecosystems of other models and datasets—not just as initialization checkpoints or raw training corpora, but as generators, filters, judges, reward models, OCR systems, and evaluation tools. These dependencies are often only declared, scattered across heterogeneous documentation, and rarely represented as an explicit graph. ModSleuth addresses this by recursively reconstructing LLM dependency graphs with source-grounded evidence, using a formalization that distinguishes direct and indirect dependencies, represents heterogeneous pipeline roles through operation-centered relationships, and resolves artifact identities across names, versions, and repositories (Adhikesaven et al., 10 Jun 2026).

1. Problem setting and motivation

Modern LLM pipelines are multi-stage model-mediated systems. Upstream LMs generate synthetic instructions, solutions, and reasoning traces; classifiers, reward models, and LMs filter or transform corpora; LMs and reward models judge outputs for RLHF, RLVR, DPO, or AB-testing; LMs or OCR models preprocess data; and LMs serve as evaluators or judges on benchmarks. These upstream artifacts themselves often depend on other models and datasets, yielding recursive chains such as:

OpenAI GPT-4synthetic datasetreward modelRL policydownstream LLM.\text{OpenAI GPT-4} \rightarrow \text{synthetic dataset} \rightarrow \text{reward model} \rightarrow \text{RL policy} \rightarrow \text{downstream LLM}.

The dependency information is fragmented across technical reports and arXiv PDFs, model cards, dataset cards and data statements, GitHub repos, configuration YAMLs, training scripts, decontamination scripts, release blogs, and secondary docs. Different artifacts use inconsistent naming, dataset names may be internal, derived, or partial, and many crucial pipeline stages are only mentioned in code. The full dependency structure is therefore fragmented across heterogeneous public artifacts, with complexity and recursive depth far outpacing humans’ ability to trace (Adhikesaven et al., 10 Jun 2026).

Manual auditing does not scale to hundreds or thousands of artifacts and multi-hop depth, is brittle to naming inconsistencies and missing references, and leads to incomplete, flat summaries. Existing automated provenance work focuses on weight initialization and fine-tuning ancestry, dataset membership or URL-level provenance, or behavioral or weight-level lineage inference. Those approaches largely miss dependency roles that do not leave clear traces in weights, such as generation and rewriting pipelines, filtering, scoring, and decontamination, evaluation and judge models, and OCR and embedding models. ModSleuth is therefore introduced to systematically reconstruct declared model–model and model–data dependencies from public artifacts, turning heterogeneous documentation into explicit, evidence-grounded graphs (Adhikesaven et al., 10 Jun 2026).

2. Dependency semantics and graph formalization

The central semantic distinction in ModSleuth is between direct and indirect dependencies. A direct dependency is an upstream artifact that materially affects the target model’s weights, either directly or transitively. This includes initialization checkpoints such as trained_from, merged_from, and quantized_from; datasets the model is trained on through trained_on; and operations that transform or select training data, including generated_by, filtered_by, transformed_by, embedded_by, decontaminated_against, and composed_from. The recursive closure is explicit: if a model depends directly on an OCR or filter model, then the training data of that OCR or filter model are also direct dependencies of the downstream LLM. An indirect dependency is an upstream artifact that does not enter training, but substantially influences development decisions, such as evaluation models or benchmarks used to drive model selection, ablations, or release decisions, ablation variants used to choose architectures or training schedules, and explicitly adopted methodological recipes. Excluded are generic related work citations, baseline comparisons used purely for reporting, and vague “following common practice” cases where no concrete upstream artifact is identified (Adhikesaven et al., 10 Jun 2026).

Rather than predefining a rigid taxonomy of dependency types, ModSleuth models operations—pipeline events that connect artifacts. Each dependency is associated with an operation instance and captures an upstream artifact, a downstream artifact, a free-form role description, a coarse relation type, a dependency kind, and evidence consisting of URLs plus anchors into specific documents or snippets. This operation-centered design preserves the fact that the same upstream model may appear as a generator in one operation, a judge in another, and a filter or decontaminator in yet another (Adhikesaven et al., 10 Jun 2026).

Artifacts include models, datasets, evaluation benchmarks, and code repositories and configuration artifacts insofar as they reveal dependencies. The core representation is a directed dependency graph

G=(V,E),G = (V, E),

where VV are artifact nodes and EE are edges representing dependency relationships. ModSleuth’s graph uses direction from subject (downstream) to object (upstream), so paths from a target to its ancestors follow edges forward. For a target TT, multi-hop ancestors are defined as

Ancestors(T)={vV path Tv}.\mathrm{Ancestors}(T) = \{ v \in V \mid \exists \text{ path } T \to v\}.

This makes recursive audit questions graph-native rather than document-native (Adhikesaven et al., 10 Jun 2026).

Identity resolution is a core technical problem. References can be partially specified or ambiguous, so ModSleuth defines an identity lattice in which each artifact identity is a set of facets such as family, size, and stage. The lattice organizes a root node for underspecified references, intermediate nodes for partially resolved identities, and canonical leaves pinned to concrete releases or URLs. Dependency claims are associated with the most specific node justified by the evidence, preserving uncertainty rather than forcing premature merges and avoiding fragmentation of obviously identical artifacts referenced under slightly different names. For Hugging Face datasets and models, ModSleuth deterministically follows metadata such as parent and child relationships, subsets, derived variants, repo structure, tags, and license and provenance fields (Adhikesaven et al., 10 Jun 2026).

3. Agentic architecture and recursive reconstruction

ModSleuth is implemented as a staged agentic system using Claude Code as the underlying tool-using LLM agent. Its design is organized around two principles: separate discovery from normalization, and require that every edge be evidence-grounded. The staged system includes a source-gathering agent, an entity discovery agent, an identity-resolution agent, a dependency-construction agent, and a reconciliation or audit agent. The source-gathering stage uses web browsing, repo crawling, and the Hugging Face API to collect technical reports, model cards, dataset cards, training repos, release blogs, documentation, and linked upstream artifacts, while restricting to official sources. Entity discovery extracts model and dataset mentions as they appear, including ambiguous names, together with evidence spans. Identity resolution maps each mention into the identity lattice by unifying family, size, stage, and variant for models, and names, subsets, and derived variants for datasets. Dependency construction then re-reads sources with resolved identities and emits operation-structured edges with subject, object, relation type, dependency kind, free-form role description, and evidence anchors. Reconciliation merges edges that represent the same relationship with different specificities and escalates inconsistencies that cannot be reconciled automatically to human review (Adhikesaven et al., 10 Jun 2026).

Recursive reconstruction is the mechanism that turns local statements into ecosystem-scale dependency graphs. Starting from a target release TT, ModSleuth runs the full pipeline to recover the local neighborhood of dependencies around TT. For each upstream artifact UVU \in V that has public artifacts, it treats UU as a new tracing target and repeats the pipeline. The resulting nodes and edges are merged into a global graph, and recursion proceeds according to a search strategy such as breadth-first search for coverage, depth-first search, or beam search for deep tracing of particular chains. Stopping conditions include depth limits, exhaustion of public artifacts, or user-specified limits (Adhikesaven et al., 10 Jun 2026).

Information extraction must operate across PDFs, Markdown, YAML, Python, and Hugging Face cards. ModSleuth uses prompting strategies that emphasize extraction of artifact mentions plus evidence text, conservative citation-backed claims, and avoidance of speculation beyond the document. It also applies heuristics for parsing YAML mix configs, inferring roles from code paths such as generate_..., filter_..., and reward_model, recognizing standard benchmark names, and batching documents topically. Each dependency edge stores evidence anchors and explanation text, and a separate verification agent using Claude Sonnet 4.6 with web search, and Opus 4.7 for BFS-scope audit, re-checks each edge against evidence and classifies it as verified, refuted, or unclear (Adhikesaven et al., 10 Jun 2026).

4. Graph structure and empirical application

ModSleuth was applied to four public-artifact-rich releases: Olmo 3, Nemotron 3 Super, DR Tulu, and SmolLM3. These targets were chosen because they expose enough code, cards, and reports to allow meaningful tracing. The recovered graph over the studied targets contains 2,526 artifact nodes, 9,112 dependency edges, and 36,187 evidence anchors, with maximum depth up to 8 hops for some Olmo 3 variants. Under the unbounded per-target scope, ModSleuth recovers 1,060 source-verified dependencies; under BFS reachability, it yields 1,654 unique verified edges (Adhikesaven et al., 10 Jun 2026).

The graph mixes model nodes and dataset nodes. Model nodes include LLMs, OCR models, classifiers, reward models, embedding models, and judge models; dataset nodes include raw crawls, curated subsets, synthetic datasets, benchmarks, and train, validation, and test splits. Each node stores a canonical identifier, organization, type, identity facets, and source metadata. Each edge stores the relation type, dependency kind, role description, and evidence. The coarse relation categories are training inputs, upstream operations on training data, weight-level lineage, evaluation or ablation roles, and methodological or audit influence (Adhikesaven et al., 10 Jun 2026).

Role distributions show that most model influence flows via data operations rather than weight inheritance. Direct dependencies account for 1,191 edges (72.0%), while indirect dependencies account for 463 edges (28.0%). Within direct dependencies, training inputs contribute 813 edges, upstream operations on training data contribute 350 edges (21.2%), and weight-level lineage contributes 28 edges (1.7%). This suggests that weight ancestry alone captures only a small fraction of the dependency structure that matters for audit (Adhikesaven et al., 10 Jun 2026).

Per-target graph scale varies substantially. Olmo 3 Instruct has 409 ancestors and maximum depth 8; Olmo 3 Think has 329 ancestors and depth 8; Olmo 3 Base has 181 ancestors and depth 4; Nemotron-3-Super has 465 ancestors and depth 5; Nemotron-3-Nano-Base has 319 ancestors and depth 5; DR Tulu has 17 ancestors and depth 2; and SmolLM3-Base has 78 ancestors and depth 3. External dependencies dominate for every target, with 75–82% of edges classified as external rather than internal. Olmo 3 has 119 internal versus 362 external edges, Nemotron 3 has 243 internal versus 780 external edges, DR Tulu has 11 internal versus 51 external edges, and SmolLM3 has 31 internal versus 96 external edges (Adhikesaven et al., 10 Jun 2026).

5. Empirical findings from the reconstructed graphs

The reconstructed graphs reveal multi-hop license obligations and cross-family reuse that are difficult to detect from final model cards alone. One example is the chain connecting SmolLM3 to Llama-generated annotations: SmolLM3 lists FineMath as a pretraining source; FineMath is filtered by finemath-classifier; the classifier card reveals training on educational-value annotations generated by Llama-3-70B-Instruct; and the resulting chain is

G=(V,E),G = (V, E),0

The graphs also show high centrality for a small set of model families on the training side: Qwen appears with 167 downstream artifacts via 552 edges, Llama with 157 artifacts via 264 edges, GPT-4 with 65 artifacts via 125 edges, and DeepSeek with 81 artifacts via 162 edges. A plausible implication is that license propagation and ecosystem concentration are structurally coupled (Adhikesaven et al., 10 Jun 2026).

ModSleuth also surfaces repeated train–evaluation coupling. In Olmo 3, some RLVR prompts in IF-RLVR have their constraint templates drawn from IFEval and IFBench-Train, while IFEval is also a key evaluation benchmark. In Olmo 2 and Olmo 3, Dolmino-100 mixes explicitly include the GSM8K train split plus derivative TinyGSM-style expansions, while GSM8K is also a central evaluation benchmark. In Nemotron-3-Super, SWE-Bench-Verified-derived data enters RL training through Nemotron-RL-Agentic-SWE-Pivot-v1, while SWE-Bench-Verified is also reported as a headline evaluation. Aggregated over the graph, benchmarks often appear on both the training side and the evaluation side: GSM8K has 25 evaluation edges and 43 training edges, MMLU has 39 evaluation and 14 training, GPQA has 45 evaluation and 9 training, MATH has 31 evaluation and 30 training, IFEval has 27 evaluation and 18 training, and SWE-bench Verified has 2 evaluation and 10 training. The paper’s interpretation is that these patterns are almost impossible to see from single model papers and emerge only via ecosystem-level graph analysis (Adhikesaven et al., 10 Jun 2026).

A further class of findings concerns discrepancies between released artifacts and training-time artifacts. In Nemotron training blends, some YAML blends include placeholder rows for contributions like DAPO-Math or Skywork-OR1, and the actual datasets are only resolved via a fill_placeholders.py script. In SmolLM2 and SmolTalk, model cards describe SmolTalk SFT data at a high level, while code shows that the summarization branch uses CNN/DailyMail processed via Qwen2.5-72B-Instruct, and that Magpie-Ultra instructions are generated by Llama-3.1-405B-Instruct and filtered by Llama-3.1-8B-Instruct. In Olmo 3’s DPO pipeline, file naming conventions in training scripts encode upstream generator families such as GPT-3.5 and GPT-4o, multi-turn truncation, deduplication, and topic filtering. In Dolma 3 reproduction datasets, complete and redacted variants differ, and later reproduction packages for Olmo-3-7B-1025 were modified by replacing some PDFs with [REMOVED], affecting exact reproducibility (Adhikesaven et al., 10 Jun 2026).

Because ModSleuth recursively explores upstream artifacts, it also uncovers hidden multi-hop dependencies not mentioned in downstream documents. DR Tulu’s main paper states training data generated by OpenAI models, while an appendix mentions Ai2 ScholarQA trajectory data; ScholarQA’s own documentation then reveals generation by Claude Sonnet 3.7, yielding the chain

G=(V,E),G = (V, E),1

Similarly, Olmo 3 RL-Zero cards do not reference Qwen dependencies, but midtraining pipeline documentation reveals the use of Qwen2.5-Coder-32B-Instruct for code transformation in earlier stages. The authors note that some dependencies discovered by ModSleuth were not known, or at least not tracked, even by original developers (Adhikesaven et al., 10 Jun 2026).

6. Evaluation, limitations, and prospective uses

Because there is no ground-truth full graph, ModSleuth is evaluated through verified edge counts. Candidate relationships from ModSleuth and baseline systems are pooled, then automatically verified by Claude Sonnet 4.6 with web search, and Opus 4.7 for BFS-scope audit, against cited evidence and optional independent corroboration. Candidates are classified as verified, refuted, or unclear, and only verified edges are counted. Since ModSleuth builds a merged global graph, the paper defines three attribution scopes: depth-1, unbounded, and BFS reachability. Under these scopes, ModSleuth obtains 484 verified edges at depth-1, 1,060 verified edges under the unbounded scope, and 1,654 forward-reachable verified edges under BFS reachability (Adhikesaven et al., 10 Jun 2026).

The baseline comparison is intended to test whether a strong general-purpose LLM with a detailed prompt can substitute for the staged system. The reported totals over the four targets are 314 verified edges for GPT-5.5 Pro, 283 for GPT-5.4 Pro, 275 for Claude Code single prompt, and 171 for ChatGPT Deep Research. ModSleuth’s 484 depth-1 verified edges are reported as 54% more than the best baseline, while 1,060 unbounded verified edges are reported as 3× the strongest baseline. The paper’s conclusion is that structured task decomposition, the identity lattice, and evidence-grounding are crucial, and that simply giving a strong LLM a long prompt is not sufficient (Adhikesaven et al., 10 Jun 2026).

The system is explicitly limited to declared rather than true dependencies. Undocumented or proprietary dependencies are invisible, so the resulting graphs are an evidence-grounded lower bound on the true dependency structure. There are no absolute recall metrics because no ground-truth graphs exist, and evaluation is restricted to well-documented, relatively open models. The shared modeling stack across extraction and verification may introduce shared biases, and automated-only evaluation does not test interactive expert steering. The paper also notes gray areas in dependency semantics, especially around how strong evaluation influence must be to count as indirect dependency and how to treat architecture inspiration versus detailed methodological copying (Adhikesaven et al., 10 Jun 2026).

Despite these limitations, ModSleuth’s stated uses are broad. Dependency graphs turn audit questions into concrete graph queries, support license and terms-of-use analysis when obligations propagate through synthetic data, filtering, and judge models, help detect train–evaluation coupling and circular evaluation, and make code-level differences between training-time artifacts and released datasets traceable. The released resources include code at https://github.com/cal-data-audit/modsleuth, an interactive demo at https://modsleuth.cal-data-audit.org, and the reconstructed dependency graphs for the studied releases. The paper further suggests standardized dependency schemas in model cards and dataset cards, broader ecosystem mapping to other model families as public artifacts become available, cross-family verification, and integration with parameter-level provenance as future directions (Adhikesaven et al., 10 Jun 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ModSleuth.