- The paper introduces Sycamore, a systematic probe using a three-condition evaluation protocol comparing ungrounded and grounded synthetic personas with expert feedback.
- The paper's methodology leverages 3,270 interview-coded excerpts to ground synthetic evaluators, uncovering significant divergences in modality preferences between synthetic and real experts.
- The paper finds that while grounded synthetic evaluators enhance exploratory evaluations and interface debugging, they cannot replace the nuanced feedback of domain experts.
Sycamore: Synthetic Persona Characterization for Evaluating Genomics Visualization Retrieval
Motivation and Context
Evaluating visualization systems in specialized domains like genomics is subject to practical limitations inherent to user studies: domain experts are scarce and cumbersome to recruit, and available participant pools rarely represent the full diversity of user personas. The proliferation of LLM-based synthetic personas offers a potential path to scale evaluation, but there is skepticism regarding their substitutability for real user feedback, particularly in HCI contexts. The work introduces Sycamore, a systematic probe to analyze not whether synthetic personas should replace user studies, but what these LLM-enabled evaluators actually produce when engaging with a real system—specifically Geranium, a multimodal genomics visualization retrieval engine.
Methodological Framework
Sycamore employs a three-condition evaluation protocol:
- Ungrounded Synthetic Personas: Instantiated directly from LLM priors, without any anchoring in empirical user data.
- Grounded Synthetic Personas: Instantiated using the PersonaCite approach, with each synthetic persona constrained by voice-of-customer artifacts extracted from a comprehensive interview-based characterization of genomics visualization users [23]. Four archetypes (Biologists, Computational Biologists, Bioinformaticians, Software Engineers) are represented.
- Domain Expert Reference: The published Geranium user study (seven domain experts, with themes and modality preferences extracted) serves as the human benchmark [16].
All three conditions use a consistent protocol derived from the Geranium user study: initial workflow description, tool demonstration, hands-on exploration with feedback, and a final modality preference ranking.
The pipeline for grounded personas involves (1) detailed persona profiling, (2) embedding of 3,270 interview-coded excerpts, (3) top-k retrieval for persona-relevant evidence, and (4) agentic response generation with source citation and abstention in the absence of evidence.
Insights from Cross-Condition Analysis
Thematic Convergence and Divergence
Synthetic persona feedback—especially when grounded—adopts the concerns and language evident in documented user studies, substantially mitigating the "hallucinated" technicalities that ungrounded LLMs tend to prioritize. Conversely, ungrounded evaluators frequently fixate on operational minutiae (e.g., data-binding internals, API integration issues) that actual genomics researchers did not emphasize.
A pronounced, contradictory empirical finding emerges in modality preference: both synthetic conditions strongly favor the specification modality (Gosling spec), whereas real experts favor image-based queries for their casual, exploratory affordances. Synthetic evaluators converge on a "runnable template" conception, neglecting the value of image browsing that was salient in human feedback.
Across all conditions, data-binding emerges as a universal friction point—real and synthetic personas repeatedly cite concerns regarding the mapping of user data into retrieved visualization templates, including specific reference to genomics formats (BAM, VCF, BED files).
Depth, Onboarding, and Workflow Realism
Both synthetic conditions identify high-level user intent and language mismatch as sources of friction. However, only human evaluators ask for structured onboarding support and guidance, while synthetics—particularly grounded ones—focus on advanced features and developer-oriented extensions. Notably, the grounded agent protocol is able to abstain in the absence of evidence, providing calibration against confabulation and revealing the interpretive bias inherent in the artifact coding process.
Novel Synthetic Signals and Latent Hypotheses
Interestingly, certain design requirements not articulated by any single human participant—but present in aggregate across synthetic personas—emerge consistently. For example, integration with canonical genome browsers (IGV, UCSC), and plugin-based workflows are surfaced as recurrent needs. These thematically synthesized outputs point to the ability of LLMs, when grounded appropriately, to aggregate latent requirements which can then be prioritized for targeted follow-up with real users.
Practical and Theoretical Implications
Sycamore's probe indicates that LLM-based synthetic evaluators, particularly when grounded in empirical user characterization, can function as valuable complements (not replacements) within the visualization evaluation workflow. Their primary utility lies in:
- Protocol and interface debugging before investing limited domain expert time
- Expanding the evaluative horizon to underrepresented or unobservable personas
- Generating exploratory hypotheses and query patterns at scale for downstream validation
However, synthetic evaluators are fundamentally limited for research questions centered on adoption, trust, or nuanced qualitative experience; their lack of interactive realism—no actual clicking, hesitance, or interface exploration—creates an authenticity gap.
Moreover, the dependency of grounded evaluators on the specific artifacts used for grounding (and the potential biases in how these artifacts are coded or excerpted) underscores the need for transparent, well-documented workflows in persona instantiation.
Future Directions
Further research is warranted to (a) quantify run-to-run evaluator variance for each grounded persona, (b) extend Sycamore to other genomics or highly specialized data domains, and (c) develop methodologies closer to think-aloud protocols via agentic LLM interaction. Whether inconsistencies between synthetic and expert reference feedback (e.g., modality preference) reflect systematic LLM limitations or gaps in synthetic persona design remains an open question.
Techniques for dynamic, scenario-based persona simulation and integration with live interface logging could refine the fidelity and external validity of synthetic user evaluation pipelines.
Conclusion
Sycamore systematically interrogates the content and value of synthetic LLM personas for evaluating genomics visualization retrieval. Grounded synthetic evaluators more closely approximate documented user perspectives than ungrounded ones, though both diverge from expert reference in important respects such as modality preference. The framework provides an operationally rigorous, replicable approach for dissecting both the strengths and the caveats of synthetic persona-based evaluation in domain-specific HCI, advancing the methodological toolkit for visualization research pipelines (2605.08630).