Papers
Topics
Authors
Recent
Search
2000 character limit reached

DiSCo: Domain-Informed Contrast Summarization

Updated 19 January 2026
  • The paper introduces DiSCo, a computational method that contrasts domain-level expected features with observed review content to mitigate presence bias.
  • DiSCo constructs statistical models from aspect–sentiment extraction and computes difference scores to highlight unusually emphasized and missing features.
  • Empirical results demonstrate improved detail, specificity, and decision support compared to conventional summarization methods, though with a cognitive cost.

Domain Informed Summarization through Contrast (DiSCo) is an expectation-based computational methodology designed to address presence bias in intelligent summarization interfaces powered by LLMs. DiSCo operates by explicitly modeling domain-level topical expectations, contrasting observed content with these expectations, and integrating both unusually emphasized and conspicuously absent aspects into the generated summary text. This approach enhances transparency and supports decision-making, particularly in domains where absences and expectation violations are diagnostically salient (Fainman et al., 12 Jan 2026).

1. Conceptual Motivation and Theoretical Foundations

Modern review summarization systems—regardless of whether they are extractive, neural, or LLM-based—exhibit a presence-driven bias: they primarily generate content about features mentioned in available data. Human decision makers, however, often rely equally on what is not mentioned. This phenomenon, underpinned by the feature-positive effect, leads to presence bias, whereby summaries omit topics that users implicitly expect to be mentioned. For instance, a summary of a beach resort that never discusses sand quality or proximity to water may mislead users due to the absence of these expected aspects.

Cognitive accounts emphasize that a missing feature becomes salient only when it contradicts an internal prediction or expectation. DiSCo formalizes this insight by leveraging domain-level topical expectations to surface both unexpectedly emphasized and absent aspects, thereby supporting more transparent, decision-oriented summaries.

2. Modeling Domain Topical Expectations

DiSCo centers on constructing statistical reference models of what users typically discuss in comparable entities within a domain (e.g., hotels, resorts). The methodology proceeds as follows:

  • Aspect–Sentiment Extraction: Every sentence of every review is processed by an LLM-based aspect–sentiment extractor (e.g., GPT-5-mini with a custom prompt). Each sentence is mapped to one of 138 predefined aspect–sentiment tuples (e.g., “beach_view_positive”, “breakfast_quality_negative”).
  • Reference Distribution Definition: Aggregating aspect–sentiment counts over all accommodations in a domain yields a probability mass function

P(a)  =  Nref(a)aNref(a),P(a) \;=\;\frac{N_{\rm ref}(a)}{\sum_{a'} N_{\rm ref}(a')}\,,

where Nref(a)N_{\rm ref}(a) is the total count of aspect–sentiment tuple aa in the reference corpus.

  • Entity-Specific Distribution: For a given listing ii, the observed distribution is

Pi(a)  =  Ni(a)aNi(a),P_i(a) \;=\;\frac{N_i(a)}{\sum_{a'} N_i(a')}\,,

where Ni(a)N_i(a) is the count for aspect aa in listing ii.

This construction operationalizes how domain topical expectations can be computed for subsequent contrast analysis.

3. Contrasting Observed Content with Expectations

To identify salient absences and overemphasized aspects, DiSCo computes per-aspect difference scores: Δi(a)  =  Pi(a)    P(a).\Delta_i(a)\;=\;P_i(a)\;-\;P(a)\,. Positive Δi(a)\Delta_i(a) indicates that aspect aa is mentioned more frequently than domain norms; negative values flag aspects common in the domain but underrepresented or absent in the current listing.

Alternative statistical measures can be leveraged, such as:

  • Jensen–Shannon Divergence for distribution-level surprise:

DJS(Pi ⁣P)  =  12DKL(Pi ⁣M)  +  12DKL(P ⁣M) with M=12(Pi+P),D_{\rm JS}(P_i\!\,\|\,P)\;=\;\tfrac12\,D_{\rm KL}(P_i\!\,\|\,M)\;+\;\tfrac12\,D_{\rm KL}(P\!\,\|\,M)\ \text{with}\ M=\tfrac12(P_i+P),

allowing decomposition by feature.

zi(a)  =  Pi(a)P(a)σ(a),z_i(a)\;=\;\frac{P_i(a)-P(a)}{\sigma(a)},

where σ(a)\sigma(a) denotes the cross-listing standard deviation.

In practice, DiSCo ranks aspects by Δi(a)\Delta_i(a) and selects, for each entity:

  • Top 7 most-mentioned aspects (maxPi\max P_i)
  • Top 7 over-emphasized aspects (maxΔi\max \Delta_i)
  • Top 7 absent but expected aspects (minΔi\min \Delta_i), filtering out noise via thresholding (e.g., 1–2 percentage points).

4. Summary Generation Pipeline

DiSCo's summary generation comprises a multi-step pipeline:

  1. Collect Reviews: Gather review corpora for all relevant entities within a domain.
  2. Aspect–Sentiment Extraction: Apply an LLM-driven extraction to map sentences to aspect–sentiment tuples.
  3. Distribution Analysis: Construct reference and entity-specific distributions; compute Δi(a)\Delta_i(a).
  4. Prompt Construction: Structure input for the LLM, passing three lists to guide summary content:
    • Most mentioned topics
    • Topics mentioned more often than similar accommodations
    • Topics expected but missing
  5. Summary Synthesis: Instruct the LLM to generate an 80–120-word summary interleaving conventional coverage, unusual emphases, and explicit absences, with contextual comparison to peer entities (“other hotels in this district,” etc.).

Example prompt structure:

1
2
3
4
5
6
### Most mentioned topics:
[{"topic":"staff_friendliness", "pos":12, "neg":0}, …]
### Over-represented topics:
[{"topic":"mountain_views", "pos":5, "missing_but_common":false}, …]
### Expected but missing topics:
[{"topic":"beach_proximity", "pos":0, "missing_but_common":true}, …]
Generated output is expected to summarize what is said, highlight “unusually praised” aspects, and interpret “unmentioned but typically discussed” aspects in context.

5. Empirical Evaluation: User Study and Metrics

DiSCo’s efficacy was validated via a within-subjects user study with 270 participants (native English speakers from Prolific) across three domains: Ski, Beach, and City-center accommodations. Each participant assessed 18 summary pairs (DiSCo vs. standard LLM baseline) and rated them on five Likert dimensions:

  • Relevance
  • Detail & Specificity
  • Decision Support
  • Persuasive Impact
  • Ease of Understanding

Statistical analysis revealed:

  • Significantly higher ratings for DiSCo summaries on Detail & Specificity (3.604.23, t(269)=8.21, p<.001, d=0.623.60 \rightarrow 4.23,\ t(269)=8.21,\ p<.001,\ d=0.62), and Decision Support (p<.01, d0.200.30p<.01,\ d\approx0.20{-}0.30), with a modest relevance improvement.
  • Lower Ease of Understanding for DiSCo (e.g., Beach domain 4.664.064.66 \rightarrow 4.06), signaling a cognitive cost.
  • Domain-dependent overall preference: 70% favored DiSCo in Ski (p<.001p<.001), but no significant preference for Beach or City-center.
  • Qualitative feedback underscored a trade-off between fluency and informativeness; DiSCo was valued for “honesty” and “full picture,” while baseline summaries were praised for brevity.

6. Broader Implications and Limitations

DiSCo substantiates two hypotheses: (a) making absences visible reduces presence bias, and (b) expectation violations serve as diagnostic signals for users. Engagement with “System 2” cognitive processes—slower, more deliberative reasoning—suggests user-level traits (e.g., “need for cognition”) may moderate benefits.

The generalizability of DiSCo is broad: any domain with implicit user expectations stands to benefit from contrasting observed feature distributions against reference models—examples include consumer finance (“interest-rate caps never discussed”), safety reports, or political speeches (“no mention of healthcare reform, though 80% of peer speeches address it”).

Primary limitations are:

  • Dependence on LLM-driven aspect extraction, validated by lightweight F1/Kappa checks
  • Possible misinterpretation of absences (could reflect reporting norms)
  • Laboratory evaluation rather than field deployment

Proposed directions for future research include adaptive disclosure strategies (“progressive reveal” of absences), conversational and visual analytics interfaces, measurement of real-world decision outcomes, and personalization via cognitive profiling.

7. Significance for Intelligent Summarization Interfaces

DiSCo illustrates that integrating domain-level expectations transforms absence into an explicit signal, elevating transparency and trust in intelligent interfaces. By operationalizing what is omitted as an actionable insight—rather than an incidental artifact—summarization systems gain diagnostic power, supporting higher-quality, expectation-aligned decision-making within and beyond the accommodation-review domain (Fainman et al., 12 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Domain Informed Summarization through Contrast (DiSCo).