SimEval-IR: A Unified Toolkit and Benchmark Suite for Evaluating User Simulators and Search Sessions

Published 30 Apr 2026 in cs.IR | (2604.27878v1)

Abstract: User simulators are increasingly central to interactive information retrieval, yet the community lacks standardized evaluation tools. Simulators serve two objectives, behavioral realism (matching real user behavior) and tester reliability (producing valid system rankings), and these are often conflated despite being distinct and sometimes conflicting. We present SimEval-IR, an open-source toolkit and benchmark suite that makes this distinction measurable. SimEval-IR provides: (1) a canonical session schema unifying session search and conversational interactions, with validated dataset adapters and explicit loss accounting; (2) three executable benchmarks covering behavioral realism, tester reliability with RATE-style estimation, and an analysis linking the two; and (3) baseline results across four real datasets in two languages and four simulator families. Our key finding: the classifier-discriminator ''human-likeness'' check, the dominant realism test in the literature, has essentially no pooled predictive power for system-ranking validity ($r{=}{+}0.09$, $n{=}48$), while marginal click-depth distance and Fréchet distance over session embeddings give a much stronger signal ($|r|{=}0.43$ and $0.40$, $p{\leq}0.005$). SimEval-IR is released with all configurations and scripts to reproduce the reported analysis.

Abstract PDF Upgrade to Chat

Authors (1)

Saber Zerhoudi

Summary

The paper introduces a unified evaluation framework that distinguishes between behavioral realism and tester reliability in interactive IR simulators.
It implements a canonical session schema, dataset adapters, and executable benchmark protocols to standardize evaluation across diverse IR datasets.
Empirical findings demonstrate that click-depth divergence and embedding-based metrics predict ranking validity more accurately than classifier-based tests.

SimEval-IR: Standardized Evaluation of User Simulators and Search Sessions

Motivation and Problem Formulation

SimEval-IR addresses a persistent methodological gap in interactive IR: although user simulators are critical for developing and evaluating retrieval systems, existing evaluation practices conflate behavioral realism with tester reliability and provide no unified standard for objectively measuring either. Simulation-centric studies report plausible session statistics—such as click-depth histograms and classifier-based Turing tests—to claim realism, but system-ranking validity often diverges from those derived with human evaluators, as highlighted conceptually in (Figure 1).

Figure 1: Simulator outputs can appear realistic on behavioral metrics while producing invalid system rankings; SimEval-IR standardizes both evaluation tasks.

The toolkit distinguishes between: (i) behavioral realism (distributional match to real interaction logs); and (ii) tester reliability (agreement of induced system rankings with trusted gold standards). The central thesis is that behavioral realism is not necessarily predictive of ranking validity, an assertion framed quantitatively and empirically throughout.

Toolkit Architecture and Benchmarking Protocols

SimEval-IR operationalizes its methodological stance via three key contributions:

Canonical Session Schema: Sessions are encoded as ordered event sequences (QUERY, SERP_VIEW, CLICK, DWELL, CONV_USER, CONV_SYSTEM) supporting both session search and conversational IR, with versioned schema validation and explicit loss accounting (Figure 2).

Figure 2: SimEval-IR workflow, showing adapters, canonical schema conversion, and three benchmark families: behavioral realism, tester reliability, and realism–reliability linkage.

Dataset Adapters: Preprocessing scripts convert diverse logs (AOL, TREC Session, ORCAS, MIRACL/zh, conversational corpora) into the canonical schema, with provenance and dropped-field manifest audits. Sessionization thresholds and feature admissibility are standardized.
Executable Benchmarks: SimEval-IR exposes three benchmark families:
- B1 (Behavioral Realism): Computes distributional and sequential metrics, embedding distances, and classifier-based discrimination—with mandatory baseline audits.
- B2 (Tester Reliability): Correlates simulator-induced ranking with qrels-based rankings; aggregates via RATE-style reliability estimation and leave-one-out sensitivity.
- B3 (Realism–Reliability Linkage): Quantifies the relationship between realism metrics and ranking validity.

Metric Design and Leakage Controls

Behavioral realism metrics include marginal JS divergence, Wasserstein distance, sequential bigram divergence, normalized Levenshtein distance, and representation-level Fréchet distance and MMD, computed on session embeddings. Classifier-based realism is reported alongside its baselines: metadata-only (schema features), structural-only (sequence statistics), masked-feature (session length ablative), and permutation. Only classifiers whose baselines stay at chance are interpreted as behavioral signals rather than pipeline artifacts (Figure 3).

Figure 3: Classifier realism interpretation; main AUC is valid only when artifact-only and permutation baselines stay at chance.

Tester reliability metrics compute system-wise Kendall’s $\tau$ , Spearman’s $\rho$ , Pearson, and $\tau_{AP}$ , bootstrapped for uncertainty. The RATE protocol iteratively adjusts tester weights based on consensus ranking agreement, robust to noisy testers.

Empirical Findings: Baseline Results Across Datasets

Behavioral realism (B1) indicates that simulator quality is dataset-sensitive: SimIIR-PBM matches TREC Session logs (AUC ≈ 0.577), but SimIIR-DBN aligns better with AOL-IA (AUC ≈ 0.822), reflecting the PBM vs DBN assumptions on independence and cascades. No single simulator optimizes realism across corpora, underscoring the necessity of cross-corpus evaluation.

Tester reliability (B2) reveals that relevance-aware simulators (PBM, DBN, LLM-sim) consistently achieve $\tau \geq 0.77$ in nearly all cases; position-only heuristics fail except in small-pool Chinese settings where candidate relevance distribution distorts click bias. Aggregated rankings are robust at the top, but middle ranks display sensitivity to simulator inclusion.

Crucially, realism metrics (B3) correlate only moderately with tester reliability. Click-depth JS divergence is the strongest predictor ( $r = -0.43$ pooled; $r = -0.75$ on AOL-IA, $p \leq 0.002$ ), while Fréchet distance also signals ranking validity (Figure 4). The classifier discriminator, despite its mainstream appeal, has negligible pooled correlation ( $r = +0.09$ ), reaffirming that behavioral plausibility does not imply valid system ranking.

Figure 4: Realism versus reliability plots; Heuristic simulators appear behaviorally realistic yet induce untrustworthy rankings, illustrating the disconnect.

Practical and Theoretical Implications

SimEval-IR’s explicit distinction between realism and reliability refines simulator selection and evaluation for interactive IR research. Practitioners should prioritize click-depth distance and session embedding metrics as validity predictors, avoiding over-reliance on classifier AUCs. The standardized schema and benchmark protocols substantially increase reproducibility and comparability across studies—eliminating ad hoc preprocessing and leaking behaviors from pipeline idiosyncrasies.

From a theoretical standpoint, the toolkit provides an empirical substrate for studying how session-internal dependencies, inter-event timings, and reformulation dynamics affect ranking validity. It supports extensibility to novel simulator types (LLM-based agents with retrieval and tool use), new embeddings, and evolving test collections. Limitations remain regarding continuous behavior signals, dialogue acts beyond discrete event types, and inherent biases in gold-standard qrels.

Long-term implications include the risk of more convincing synthetic interaction traces, with potential for misuse; however, SimEval-IR’s rigorous auditing and leakage-diagnostic features are simultaneously valuable for detection and mitigation.

Conclusion

SimEval-IR defines a unified, reproducible platform for evaluating user simulators under both behavioral realism and tester reliability objectives (2604.27878). Empirical results across multiple languages and datasets confirm that realism metrics and tester reliability are correlated but not interchangeable. Click-depth divergence and embedding-based distances are superior predictors; classifier-based human-likeness tests are unreliable for ranking validity. The standardized schema, adapters, provenance, and benchmark protocols provide a strong foundation for future experiments on session-dependence, LLM agents, and collection adequacy. The research community is invited to extend SimEval-IR with new adapters, metrics, and simulators, advancing both rigorous methodology and simulator fidelity.

Markdown Report Issue