Multi-LLM Trace Dataset Insights

Updated 2 December 2025

Multi-LLM Trace Dataset is a comprehensive log of interactions among diverse language models that supports the study of synthetic propagation, convergence, and collapse phenomena.
It records detailed metadata per agent—including inputs, retrieved documents, and ensemble outputs—to facilitate reproducible experiments and dynamic analysis.
Advanced metrics, such as the Frobenius norm of embedding distances, quantify network-level effects, providing actionable insights for modeling and mitigation strategies.

A Multi-LLM Trace Dataset is a systematically logged collection of temporal and structural interactions among multiple LLMs, typically designed for rigorous empirical study of cross-model behaviors, synthetic data propagation, information collapse or divergence, and operational workload dynamics. Such datasets capture, at scale, both the input context and output responses for each model over time, often under controlled simulation or real-world service conditions, thereby enabling the detailed analysis of model convergence, robustness, performance, and vulnerabilities at the LLM network level.

1. Conceptual Underpinnings: Multi-LLM Traces

In multi-LLM trace datasets, a “trace” refers to the complete record of LLM agent actions during a discrete time interval or sequence. Traces generally include the input prompt, retrieval context or corpus input (if using a retrieval-augmented architecture), and all generated outputs per agent. The core scientific motivation is to observe and quantify network effects, such as model collapse, mutual reinforcement, or divergence, that arise when LLMs leverage overlapping synthetic knowledge pools, or when their outputs are recursively fed back as new training or generation data. While single-LLM traces isolate idiosyncratic model dynamics, multi-LLM traces uniquely support the mathematical and empirical study of convergence and collapse phenomena on the network or ecosystem scale, such as in the LLM Web Dynamics (LWD) framework (Wang et al., 26 May 2025).

2. Dataset Architectures and Logging Schemas

The design of a multi-LLM trace dataset explicitly encodes all generative iterations across an agent network with precise metadata and context retrieval. In LWD, three heterogeneous pre-trained models—Meta’s Llama-3.1-8B-Instruct, DeepSeek’s deepseek-LLM-7b-chat, and Mistral’s Mistral-7B-Instruct-v0.3—are orchestrated as agents. Each time step $t$ , every agent draws uniformly at random a fraction $\beta$ of the current synthetic+real corpus $|A(t)|$ (with $\beta=0.5$ ), conditions upon this context with a fixed prompt, and generates an ensemble of $L=40$ outputs.

A typical schema for each logged record is:

Field	Data Type	Example/Description
timestamp	Integer	$t$ in $[0, T]$
model_id	String	"llama-3.1-8B-Instruct"
prompt	String	The experiment’s fixed prompt
retrieved_docs	Array<Object>	Each: {"doc_id": int, "text": string, "score": null}
generated_responses	Array<String>	$L$ model outputs for given context
iteration_count	Integer	Equals $L$
pool_size_before	Integer	$\|A(t)\|$
pool_size_after	Integer	$\|A(t+1)\| = \|A(t)\|+n$

All records are stored as newline-delimited JSON (JSONL); flattened CSVs are also provided. Data releases contain both global traces (entire session) and per-model traces. Associated enrichments include initial human seed text (as TXT), high-level README (describing field types and hyperparameters), and code snippets for loading/analysis (Wang et al., 26 May 2025).

3. Simulation Workflow and Size Statistics

LLM Web Dynamics traces begin with a seed corpus $A(0)$ of 20 human-written sentences from the Crypto_Semantic_News dataset, focused on “future prospects of Bitcoin.” During each round $t = 1, …, 60$ (with $n = 3$ agents total), workflow steps are as follows:

Each LLM agent uniformly samples $k_t = \lfloor \beta\cdot|A(t)| \rfloor$ context passages.
Each agent is sampled $L=40$ times with this context and the fixed prompt, generating a mini-distribution of outputs.
One output per model is appended to the corpus, so every iteration increases $|A(t)|$ by $n$ .

The complete dataset comprises:

$N_\text{records} = (T+1)\cdot n = 61\cdot 3 = 183$ trace records
$N_\text{outputs} = 183 \cdot 40 = 7\,320$ unique output sentences
Approximate total token count $\approx 146,400$ ( $\sim20$ tokens/response)
Roughly 1,200 unique outputs after deduplication

Example schema entries are provided in the LWD release (Wang et al., 26 May 2025).

4. Quantitative Metrics and Theoretical Framework

Multi-agent model collapse is quantified by computing the pairwise Frobenius norm of embedding distance matrices across all agents, applying an off-the-shelf embedding function $\varphi(\cdot)\in\mathbb{R}^{768}$ (nomic-embed-v1.5). The distance for timestep $t$ :

$X_i^{(t)} = \frac{1}{L}\sum_{l=1}^{L}\varphi(f_i^{(t)}(q))_l,\quad D_{ij}^{(t)} = ||X_i^{(t)} - X_j^{(t)}||_2,\quad \|D^{(t)}\|_F = \left(\sum_{i,j} (D_{ij}^{(t)})^2\right)^{1/2}$

Empirical results show $\|D^{(t)}\|_F$ decays and converges to $c \geq 0$ , supporting the conjecture that repeated synthetic reuse induces network-level collapse. The LWD framework draws theoretical parallels with interacting Gaussian Mixture Models, wherein the analogous mixture-weight distance also converges toward zero analytically (Wang et al., 26 May 2025).

5. Downstream Analyses, Robustness, and Interventions

The dataset’s modular format supports downstream studies:

Collapse progression: Norms $\|D^{(t)}\|_F$ can be visualized as a function of $t$ to observe the collapse trajectory.
Hallucination tracking: By embedding real initial sentences and subsequent synthetic generations, divergence from factual content can be quantified.
Robustness testing: Variations include injecting topic-irrelevant or adversarial corpus seed sentences, or varying $\beta$ and seed size.
Intervention design: Synthetic corpus growth policies can be altered (e.g., golden-ratio weighting) and collapse mitigations evaluated.

Data can be loaded for analysis in standard ecosystems such as Python/pandas. Provided code allows computation and visualization of collapse curves using embeddings and pairwise distances.

6. Comparison to Other Multi-LLM Trace Datasets

Datasets such as VaxGuard (Ahmad et al., 12 Mar 2025) and BurstGPT (Wang et al., 31 Jan 2024) also implement multi-agent trace paradigms but target distinct domains and metrics. VaxGuard traces multi-model LLM outputs for vaccine misinformation, logging $60,000$ trace samples with explicit “role” annotations and detailed detection benchmarking across model/model pairs, input types, and context sizes. BurstGPT, the largest available LLM workload trace, records over $5.29$ million traces from real-world ChatGPT/GPT-4 production traffic with fine-grained timing, token, workload, and failure data. Comparative features are summarized below:

Dataset	Focus	Trace Size
LWD	Synthetic collapse	183 records, 7k+ outputs
VaxGuard	Misinformation, roles	60k records
BurstGPT	Workload, QoS	5.29M records

LWD is uniquely suited for research on emergent convergence, collapse, and feedback-loop effects in cross-LLM interaction, as it records retrieval contexts, agents’ output distributions, and their dynamic synthetic ingestion cycle.

7. Access, Reproducibility, and Extensions

The LWD Multi-LLM Trace Dataset is released under open licenses via both GitHub and HuggingFace, per project documentation (Wang et al., 26 May 2025). Public releases feature all schema definitions, initial corpus data, and reproducible code for data loading, collapse metric computation, and visualization. Downstream users can rerun the simulation with different seed corpora or agent ensembles, vary RAG and response sampling policies, or extend context windows and prompts to generalize beyond the initial experiment. By capturing the dynamic, recursive feedback loop inherent in synthetic LLM web interactions, multi-LLM trace datasets serve as a cornerstone for advancing empirical, theoretical, and systems-level research on generative AI.