Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 67 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 18 tok/s Pro

GPT-4o 94 tok/s Pro

Kimi K2 173 tok/s Pro

GPT OSS 120B 444 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

OPeRA: LLM Evaluation Dataset

Updated 10 October 2025

OPeRA is a structured dataset capturing online shopping interactions via detailed HTML observations, user personas, action logs, and just-in-time rationales.
It standardizes benchmark tasks for next action and joint rationale-action prediction, using metrics like exact match accuracy and BERTScore for evaluation.
The dataset supports personalized digital twin simulation and explainable AI research by linking behavioral traces with user-specific metadata and decision rationales.

The OPeRA dataset is a structured collection designed to support the rigorous evaluation of LLMs on their ability to simulate specific human behavior in the context of online shopping. It integrates four key modalities: Observation, Persona, Rationale, and Action, each collected with methodologies that preserve both the fidelity of observable interactions and the underlying reasoning processes. OPeRA is distinguished by its capture of not only the digital trace of user actions but also the just-in-time user-provided rationales, cross-referenced with rich persona metadata, providing a high-dimensional benchmark for personalized behavior simulation (Wang et al., 5 Jun 2025).

1. Dataset Structure and Data Acquisition

The design of OPeRA incorporates a comprehensive data schema: each unit comprises a web observation, user persona details, action logs, and self-reported rationales.

Observation (O): For every user action, the browser context is saved via full HTML dumps, simplified HTML with semantic labeling (including identification of product metainfo and actionable objects), and screenshots. While screenshots are available, the primary paper did not experiment with direct image features. Observations are annotated with meta-information such as clickability and object saliency. The granularity of observation is tied to the actual sequence of user interactions across the shopping journey, thereby preserving context for each action.
Persona (Pe): Persona metadata is obtained via structured online surveys and optional semi-structured interviews. The survey captures demographics (age, gender, education, income), shopping preferences (frequency, favorite stores, membership status, and typical spend), and personality attributes (formal Big-Five Inventory; MBTI self-assessment). This enables conditioning on identity-level factors during evaluation of LLM simulation tasks.
Action (A): All user actions are recorded directly using the ShoppingFlow browser plugin—a custom Chrome extension. The stream includes clicks (with CSS/semantic identifiers), form submissions (such as search and filter changes), scrolls, navigations, and heuristically inferred purchase signals (such as the actuation of checkout buttons). Session segmentation is enforced via temporal and purchase-event heuristics to ensure logical coherency in each simulated shopping trip.
Rationale (R): Rationales are short natural language responses, solicited randomly (approx. 8% probability) at selected action points. The plugin interrupts the user and prompts for a brief justification of why the preceding action was chosen. This “just-in-time” sampling yields a sparse but temporally precise mapping from actions to explicit explanatory text.

2. Benchmark Tasks and Evaluation Metrics

OPeRA is explicitly purposed for benchmarking LLM agents on personalized web behavior simulation. Two benchmark tasks are defined:

Task	Input Modalities	Output	Metric(s)
Next Action Prediction	{a₁…ₜ₋₁, r₁…ₜ₋₁, o₁…ₜ, Pᵢ}	aₜ	Exact match accuracy; action type F1
Joint Rationale & Action Generation	{a₁…ₜ₋₁, r₁…ₜ₋₁, o₁…ₜ, Pᵢ}	⟨aₜ, rₜ⟩	Exact match (actions); BERTScore, ROUGE-L

Formally, next action prediction is written as:

$a_t = F_{\text{ate}}(a_1 \ldots a_{t-1}, r_1 \ldots r_{t-1}, o_1 \ldots o_{t}, P_i)$

And for joint rationale and action generation:

$\langle a_t, r_t \rangle = F_{\text{joint}}(a_1 \ldots a_{t-1}, r_1 \ldots r_{t-1}, o_1 \ldots o_{t}, P_i)$

Metrics include exact-string match for granular web actions and F1 for action class; rationale quality is measured using text similarity metrics (BERTScore F1 and ROUGE-L).

3. Data Collection Methodology and Quality Controls

Participants were recruited as real users engaging in authentic shopping sessions. Data acquisition leverages two main tools: an online questionnaire for persona collection, and the ShoppingFlow browser plugin for behavioral trace and rationale capture. Raw action streams are segmented and filtered for session coherence according to specified time and event thresholds (e.g., inactivity interval or explicit purchase event). Rationale prompts are randomized, and both actions and rationales are timestamped and indexed for ground-truth alignment.

All observations (HTML, annotation, rationale, persona) are synchronized with precise session timestamps. Actions are labeled with unique identifiers and, for clicks, with detailed element descriptors to ensure high fidelity for downstream modeling.

4. Applications and Implications in LLM Agent Research

OPeRA enables several levels of LLM agent evaluation:

Digital Twin Simulation: By conditioning not only on user actions and context but also persona metadata and real rationales, researchers can assess LLMs’ ability to model an individual user (“digital twin”) rather than a generic agent.
Interpretability: The explicit collection of rationales enables direct measurement of an LLM's ability to not just mimic what a user does, but to explain why that action is chosen at a specific step.
Behavioral Personalization: Models can be compared on how well they integrate and leverage persona traits and shopping preferences in predicting future actions and rationales, rather than approximating population-level patterns.
Benchmarking: The defined tasks and metrics establish a reproducible framework for comparing new LLM architectures, prompting research into cross-modality reasoning (HTML, persona, action stream, natural language rationale).
Broader Simulation Domains: The methodology (custom browser plugin, persona survey, structured session segmentation, rationale prompts) provides an exportable blueprint for future dataset construction in any domain requiring personalized human-agent evaluation.

A plausible implication is that OPeRA’s framework will facilitate the development and evaluation of adaptive and personalized recommendation engines, automated UI/UX analytics, and agent-based behavioral simulations with a level of user-specific granularity previously unavailable.

5. Limitations and Design Considerations

Rationale Sparsity: Rationales are randomly solicited and available only for a subset of steps (~8% of actions), which restricts dense joint modeling; however, they are precisely aligned with user intent at key decision points.
Action Stream Granularity: Actions are logged at the level of granular browser events (clicks, inputs, scrolls), yet mapping from action logs to higher-order shopping intentions may require additional annotation or inference.
Session Segmentation: Temporal thresholds and purchase heuristic segmentation are empirically defined; edge cases (e.g., long idle periods, multi-tabbed flows) may introduce ambiguity in session boundaries.
Population Representativeness: While personas are robustly characterized, the generalizability to broader global populations outside the paper sample is a consideration for scaling.
Visual Modality: Although screenshots are captured and stored, the primary modeling and benchmark reported does not leverage image-based signals; future extensions may explore multimodal fusion.

6. Significance for Future Human-Agent Research

OPeRA is the first public dataset to integrate observations, actions, rationales, and user-specific persona characterization into a unified structure for evaluating LLM performance in human online shopping simulation (Wang et al., 5 Jun 2025). By enabling exacting benchmarks for next action and rationale prediction, the dataset provides a foundation for research into explainable, personalized digital twins, recommendation systems, and interactive agent strategies tuned to individual user profiles. The collection strategy and evaluation schema establish methodological standards for future datasets intended to capture the full spectrum of human-agent context, behavior, and motivation.

PDF Markdown Chat (Pro)

References (1)

OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation (2025)

Follow Topic

Get notified by email when new papers are published related to OPeRA Dataset.

OPeRA: LLM Evaluation Dataset

1. Dataset Structure and Data Acquisition

2. Benchmark Tasks and Evaluation Metrics

3. Data Collection Methodology and Quality Controls

4. Applications and Implications in LLM Agent Research

5. Limitations and Design Considerations

6. Significance for Future Human-Agent Research

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

OPeRA: LLM Evaluation Dataset

1. Dataset Structure and Data Acquisition

2. Benchmark Tasks and Evaluation Metrics

3. Data Collection Methodology and Quality Controls

4. Applications and Implications in LLM Agent Research

5. Limitations and Design Considerations

6. Significance for Future Human-Agent Research

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research