OPeRA: LLM Evaluation Dataset
- OPeRA is a structured dataset capturing online shopping interactions via detailed HTML observations, user personas, action logs, and just-in-time rationales.
- It standardizes benchmark tasks for next action and joint rationale-action prediction, using metrics like exact match accuracy and BERTScore for evaluation.
- The dataset supports personalized digital twin simulation and explainable AI research by linking behavioral traces with user-specific metadata and decision rationales.
The OPeRA dataset is a structured collection designed to support the rigorous evaluation of LLMs on their ability to simulate specific human behavior in the context of online shopping. It integrates four key modalities: Observation, Persona, Rationale, and Action, each collected with methodologies that preserve both the fidelity of observable interactions and the underlying reasoning processes. OPeRA is distinguished by its capture of not only the digital trace of user actions but also the just-in-time user-provided rationales, cross-referenced with rich persona metadata, providing a high-dimensional benchmark for personalized behavior simulation (Wang et al., 5 Jun 2025).
1. Dataset Structure and Data Acquisition
The design of OPeRA incorporates a comprehensive data schema: each unit comprises a web observation, user persona details, action logs, and self-reported rationales.
- Observation (O): For every user action, the browser context is saved via full HTML dumps, simplified HTML with semantic labeling (including identification of product metainfo and actionable objects), and screenshots. While screenshots are available, the primary paper did not experiment with direct image features. Observations are annotated with meta-information such as clickability and object saliency. The granularity of observation is tied to the actual sequence of user interactions across the shopping journey, thereby preserving context for each action.
- Persona (Pe): Persona metadata is obtained via structured online surveys and optional semi-structured interviews. The survey captures demographics (age, gender, education, income), shopping preferences (frequency, favorite stores, membership status, and typical spend), and personality attributes (formal Big-Five Inventory; MBTI self-assessment). This enables conditioning on identity-level factors during evaluation of LLM simulation tasks.
- Action (A): All user actions are recorded directly using the ShoppingFlow browser plugin—a custom Chrome extension. The stream includes clicks (with CSS/semantic identifiers), form submissions (such as search and filter changes), scrolls, navigations, and heuristically inferred purchase signals (such as the actuation of checkout buttons). Session segmentation is enforced via temporal and purchase-event heuristics to ensure logical coherency in each simulated shopping trip.
- Rationale (R): Rationales are short natural language responses, solicited randomly (approx. 8% probability) at selected action points. The plugin interrupts the user and prompts for a brief justification of why the preceding action was chosen. This “just-in-time” sampling yields a sparse but temporally precise mapping from actions to explicit explanatory text.
2. Benchmark Tasks and Evaluation Metrics
OPeRA is explicitly purposed for benchmarking LLM agents on personalized web behavior simulation. Two benchmark tasks are defined:
Task | Input Modalities | Output | Metric(s) |
---|---|---|---|
Next Action Prediction | {a₁…ₜ₋₁, r₁…ₜ₋₁, o₁…ₜ, Pᵢ} | aₜ | Exact match accuracy; action type F1 |
Joint Rationale & Action Generation | {a₁…ₜ₋₁, r₁…ₜ₋₁, o₁…ₜ, Pᵢ} | ⟨aₜ, rₜ⟩ | Exact match (actions); BERTScore, ROUGE-L |
Formally, next action prediction is written as:
And for joint rationale and action generation:
Metrics include exact-string match for granular web actions and F1 for action class; rationale quality is measured using text similarity metrics (BERTScore F1 and ROUGE-L).
3. Data Collection Methodology and Quality Controls
Participants were recruited as real users engaging in authentic shopping sessions. Data acquisition leverages two main tools: an online questionnaire for persona collection, and the ShoppingFlow browser plugin for behavioral trace and rationale capture. Raw action streams are segmented and filtered for session coherence according to specified time and event thresholds (e.g., inactivity interval or explicit purchase event). Rationale prompts are randomized, and both actions and rationales are timestamped and indexed for ground-truth alignment.
All observations (HTML, annotation, rationale, persona) are synchronized with precise session timestamps. Actions are labeled with unique identifiers and, for clicks, with detailed element descriptors to ensure high fidelity for downstream modeling.
4. Applications and Implications in LLM Agent Research
OPeRA enables several levels of LLM agent evaluation:
- Digital Twin Simulation: By conditioning not only on user actions and context but also persona metadata and real rationales, researchers can assess LLMs’ ability to model an individual user (“digital twin”) rather than a generic agent.
- Interpretability: The explicit collection of rationales enables direct measurement of an LLM's ability to not just mimic what a user does, but to explain why that action is chosen at a specific step.
- Behavioral Personalization: Models can be compared on how well they integrate and leverage persona traits and shopping preferences in predicting future actions and rationales, rather than approximating population-level patterns.
- Benchmarking: The defined tasks and metrics establish a reproducible framework for comparing new LLM architectures, prompting research into cross-modality reasoning (HTML, persona, action stream, natural language rationale).
- Broader Simulation Domains: The methodology (custom browser plugin, persona survey, structured session segmentation, rationale prompts) provides an exportable blueprint for future dataset construction in any domain requiring personalized human-agent evaluation.
A plausible implication is that OPeRA’s framework will facilitate the development and evaluation of adaptive and personalized recommendation engines, automated UI/UX analytics, and agent-based behavioral simulations with a level of user-specific granularity previously unavailable.
5. Limitations and Design Considerations
- Rationale Sparsity: Rationales are randomly solicited and available only for a subset of steps (~8% of actions), which restricts dense joint modeling; however, they are precisely aligned with user intent at key decision points.
- Action Stream Granularity: Actions are logged at the level of granular browser events (clicks, inputs, scrolls), yet mapping from action logs to higher-order shopping intentions may require additional annotation or inference.
- Session Segmentation: Temporal thresholds and purchase heuristic segmentation are empirically defined; edge cases (e.g., long idle periods, multi-tabbed flows) may introduce ambiguity in session boundaries.
- Population Representativeness: While personas are robustly characterized, the generalizability to broader global populations outside the paper sample is a consideration for scaling.
- Visual Modality: Although screenshots are captured and stored, the primary modeling and benchmark reported does not leverage image-based signals; future extensions may explore multimodal fusion.
6. Significance for Future Human-Agent Research
OPeRA is the first public dataset to integrate observations, actions, rationales, and user-specific persona characterization into a unified structure for evaluating LLM performance in human online shopping simulation (Wang et al., 5 Jun 2025). By enabling exacting benchmarks for next action and rationale prediction, the dataset provides a foundation for research into explainable, personalized digital twins, recommendation systems, and interactive agent strategies tuned to individual user profiles. The collection strategy and evaluation schema establish methodological standards for future datasets intended to capture the full spectrum of human-agent context, behavior, and motivation.