User Willingness-Aware Sales Talk Datasets
- User willingness-aware sales talk datasets are specialized corpora that quantify nuanced user engagement across dialogue continuation, information sharing, and goal acceptance.
- They leverage ecological validity through naturalistic Wizard-of-Oz experiments and multimodal benchmarks, ensuring realistic sales conversation simulations.
- Empirical results show that incorporating willingness signals with phase-based strategies significantly boosts conversion rates in dialogue systems.
User willingness-aware sales talk datasets are specialized corpora designed to capture the nuanced dynamics of user intent and engagement within sales-oriented dialogue systems. These resources underpin research on dialogue models that adapt to users' evolving willingness—encompassing their inclination to continue a discussion, provide information, or accept transactional outcomes. Recent efforts have focused on constructing ecologically valid datasets with fine-grained willingness annotations and developing large-scale multimodal benchmarks with detailed preference and strategy labels, explicitly bridging the gap between theoretical conversational agents and operational sales environments (Hentona et al., 2024, Long et al., 2023).
1. Dimensions of User Willingness in Sales Dialogues
User willingness in sales talk divides into three operationally distinct categories (Hentona et al., 2024):
- Continuing dialogue willingness (CD): Measures the user's propensity to persist in interaction, analogous to engagement or user continuation in broader dialogue literature.
- Providing information willingness (PI): Captures the user's readiness to disclose preferences, constraints, or needs, directly affecting the system's ability to generate relevant offers.
- Goal acceptance willingness (GA): Reflects the user’s disposition to align with the salesperson's objective, such as accepting recommendations or proceeding to a purchase.
Each dimension represents a separate axis along which conversational success or failure unfolds, requiring sophisticated modeling to capture their dynamics and interactions in realistic sales settings.
2. Dataset Construction: Methodologies and Ecological Validity
The User Willingness-aware Sales Talk Dataset (Hentona et al., 2024) prioritizes ecological validity—the alignment of experimental setup with real-world cues and consequences as articulated in the tradition of Brunswik. By embedding a Wizard-of-Oz chat interface within a mock e-commerce platform and allowing participants to disengage or "purchase" at their discretion, the design elicits naturalistic willingness signals. Key construction details:
- Five trained Japanese sales actors interact with 109 unique user participants in a product search for wireless earphones (across three price tiers).
- Dialogues are populated with realistic product data rather than synthetic prompts.
- Users may abandon the chat at any turn, modeling real dropout rates.
The SURE dataset ("Multimodal Recommendation Dialog with SUbjective PREference") (Long et al., 2023) advances willingness modeling through multimodal dialog simulation involving both textual and visual cues, subjective preference categorization, and expert-curated recommendation strategies. It leverages:
- 238 retail experts to construct attribute-based and subjective paraphrase inventories.
- Two-stage annotation: simulated self-play for task structure followed by natural paraphrase for discourse realism.
3. Annotation Frameworks and Willingness Metrics
Utterance-level granularity is a defining feature:
- Users self-annotate after each salesperson turn along CD, PI, and GA with discrete categorical labels—Positive (+1), Neutral (0), or Negative (–1) (Hentona et al., 2024).
- These annotations are mapped to numeric scores for dialogue-level aggregation:
where is the user's response for utterance .
In SURE (Long et al., 2023), subjective preferences (e.g., "color that makes me feel calm") are mapped to latent categorization concepts, then to concrete attribute sets, while agent acts (e.g., "Ask Preference," "Refer Region") are exhaustively annotated. Multiturn sales strategies are labeled to support benchmarking of agent policy learning.
4. Dataset Composition and Structural Statistics
User Willingness-aware Sales Talk Dataset (Hentona et al., 2024)
- 109 dialogues, 3,289 utterances (2,145 sales, 1,144 user).
- Average 498 tokens/dialogue, 54,301 total tokens.
- Every sales turn annotated with cd_label, pi_label, ga_label.
SURE (Long et al., 2023)
- 12,000 multimodal dialogues, ≈220,000 utterances.
- Spanning fashion (290 SKUs) and furniture (110 SKUs); each store scene averages 27.6 items.
- Mean 18.35 turns per dialog; 3,000 distinct subjective preference expressions; 8 recommendation act types.
These structural features enable both dataset-wide quantitative studies and fine-grained sequence learning tasked to real-world complexity.
5. Analytical Insights and Sales Strategy Optimization
Empirical analysis on (Hentona et al., 2024) reveals:
- Negative willingness annotations occur in less than 10% of turns across CD, PI, and GA. CD is most positive (54.1% positive).
- Dialogues with a high fraction of negative labels exhibit a significant negative correlation (Pearson coefficient, not reported numerically) with post-chat purchase intent increases. This identifies avoidance of negative user impact as more consequential than maximizing positive responses.
- Turn-by-turn trajectory analysis demonstrates successful dialogues feature early lifts in CD and PI followed by a late-game spike in GA, just as the sales system attempts to close.
The paper advocates a phase-based conversational strategy:
- Early (turns 1–3): Maximize dialogue continuation (CD).
- Middle (turns 4–6): Focus on eliciting user information (PI).
- Late (turn 7+): Shift to goal acceptance efforts (GA).
SURE’s agent act statistics, transition patterns, and subjective preference mappings collectively allow for comprehensive evaluation of candidate recommendation policies under realistic goal and preference uncertainty (Long et al., 2023).
6. Applications and Baseline System Performance
Application of the User Willingness-aware dataset to system training demonstrates the utility of willingness labels:
- GPT-3.5 fine-tuned on successful dialogs (no willingness labels): 23% conversion rate.
- GPT-3.5W (all dialogs, willingness feature conditioning): 33%.
- GPT-3.5WD (willingness features + phase strategy): 44%.
- GPT-4o (zero-shot): 58% (Hentona et al., 2024).
| Model | Training Data | Willingness Used | Phase Strategy | Conversion Rate | Avg. Turns |
|---|---|---|---|---|---|
| GPT-3.5 | 63 successful | No | No | 0.23 | 9.35 |
| GPT-3.5W | 109 all | Yes | No | 0.33 | 9.23 |
| GPT-3.5WD | 109 all | Yes | Yes | 0.44 | 9.08 |
| GPT-4o | None (zero-shot) | N/A | N/A | 0.58 | 5.81 |
This progressive improvement underscores the direct benefit of eliciting and conditioning on multi-dimensional willingness signals.
SURE establishes three core tasks for benchmarking:
- Subjective Preference Disambiguation (SPD): F₁ ~37.5 (MRA full).
- Referred Region Understanding (RRU): F₁ ~14.8.
- Multimodal Recommendation (MR) Act Prediction: F₁ ~10.9 (Long et al., 2023).
Ablation studies confirm that both visual scene context and metadata are essential for accurately modeling subjective preference reasoning in dialogue.
7. Prospective Extensions and Open Challenges
Future directions identified in both papers include:
- Dynamic adaptation: Training policies that shift conversational strategy reactively to real-time willingness dynamics, rather than fixed turn boundaries (Hentona et al., 2024).
- Cross-domain and multilingual generalization: Extending data collection to multiple product domains and languages to investigate transferability and cultural effects (Hentona et al., 2024, Long et al., 2023).
- Ecological validity through ground-truth purchases: Integrating datasets with real transaction data will further align dialogue modeling with real-world outcomes (Hentona et al., 2024).
- Architectural improvements: Dual encoders or hierarchical policies could better capture preference decoupling from recommendation logic; pre-training objectives focusing on slot-value and region-phrase alignment are also proposed (Long et al., 2023).
A notable limiting factor is the absence of explicit inter-annotator agreement reporting in both corpora, a point that may impact reproducibility and cross-dataset integration, though strategies such as multiple annotator consensus and embedded agreement checks have been employed (Long et al., 2023).
User willingness-aware sales talk datasets thus constitute a key resource for empirical sales dialogue modeling, benchmarking, and deployed system development oriented around realistic, nuanced user intent trajectories (Hentona et al., 2024, Long et al., 2023).