BloomIntent: User-Centric Search Evaluation

Updated 25 September 2025

BloomIntent is a user-centric search evaluation framework that explicitly models diverse, fine-grained user intents rather than aggregate query performance.
It employs a two-stage pipeline for intent generation and contextualized LLM-based scoring to assess SERP satisfaction across clarity, relevance, reliability, and satisfaction.
Empirical studies validate its effectiveness with high intent relevance scores and actionable insights that drive targeted search engine improvements.

BloomIntent is a user-centric, LLM-driven search evaluation framework that departs from traditional query-level assessment by rigorously modeling and scoring a search engine's performance against a spectrum of fine-grained, realistic user intents. The system’s central innovation is shifting the unit of evaluation from entire queries and their aggregate performance to explicit, evaluable statements of diverse information-seeking goals—thereby enabling granular diagnostic insight and surfacing underserved or ambiguous cases that standard approaches typically conflate or obscure (Choi et al., 23 Sep 2025).

1. Fine-Grained Intent as the Evaluation Unit

BloomIntent operationalizes intent-level evaluation as the foundation for user-centric search assessment. For any given search query, which may be ambiguous or polysemous, the framework constructs a set of intent statements that represent distinctive plausible user goals. This modeling is grounded in adapted taxonomies of user attributes and in established classes of information-seeking intent. The system explicitly recognizes that if 100 people issue the same search query, they may each have a different goal; thus, the intent set for any query is diverse and designed to cover this breadth of possible user needs.

Intent statements in BloomIntent are instantiated as clear, sentence-level hypotheses, each describing a specific goal a user might have when issuing the query. These span factual, navigational, transactional, comparative, and exploratory forms, but are dynamically generated (not drawn from a closed list). The explicit modeling of such intent units enables the evaluation process to diagnose precisely which needs are met or unmet by a given search experience.

2. Methodology: Intent Generation and Contextualization Pipelines

BloomIntent’s methodology is structured as a two-stage pipeline:

Intent Generation Pipeline:

The system begins with query enrichment. For each base query, BloomIntent synthesizes expanded queries by integrating:

Search log co-occurrence and query reformulation patterns,
External background knowledge harvested from search engines (e.g., Google, Naver),
User attribute data (e.g., device, location, language, demographics where available).

Each query expansion is mapped to intent taxonomies and then converted into explicit, evaluable sentences (intent statements). These taxonomies systematically surface plausible goals that real users may hold, addressing both common and edge-case needs.

Intent Contextualization Pipeline:

Given each intent statement, the system uses LLMs to automatically assess the degree to which a Search Engine Results Page (SERP) satisfies the specific intent. Evaluations are made across four dimensions—satisfaction, relevance, clarity, and reliability—by prompting the LLM with structured evaluation templates that output binary or ordinal scores at the intent level. To manage scalability and facilitate analysis, BloomIntent clusters semantically similar intents (using embedding-based similarity and BERTScore), then aggregates and visualizes results in an interactive web interface that supports deep diagnostic exploration.

3. Empirical Evaluation and Benchmarks

BloomIntent’s technical correctness and practical utility are supported by three empirical studies:

Intent Generation Quality: Expanded queries generated by BloomIntent are compared to real user reformulations using BERTScore. Manual validation further shows that 84% of these automatically generated queries are judged relevant and 86% plausible with respect to actual search behavior, confirming that the method robustly captures both typical and atypical user goals.
Intent Statement Quality vs. Baselines: In direct comparison (head-to-head) with a strong baseline system (MILL), BloomIntent’s generated intents are more frequently rated as evaluable (i.e., operationalizable for downstream automated or human judgment) and more realistic in capturing plausible user needs. The system’s intents are also richer semantically, although they may exhibit slightly lower lexical diversity than those produced by the baseline.
LLM-Based Evaluation Reliability: When comparing BloomIntent’s automated SERP-level satisfaction judgments with expert human raters, the system achieves 72% agreement (accuracy 72.1%, Cohen’s κ = 0.445) on the satisfaction dimension, which is considered moderate inter-rater reliability for human judgment benchmarks. Agreement is strongest in unambiguous cases; in situations where the query is highly ambiguous, LLM evaluation provides scalable triage but selective human intervention may further improve precision.

4. Case Study: Diagnosing Search Experience Gaps

A structured observation with professional search specialists (n=4) illustrates BloomIntent’s concrete value in expert-driven search quality analysis. The user-intent clusters immediately revealed not only which facets of user need were met or unmet by a given SERP, but also uncovered uncommon or underserved needs (e.g., comparative or visual intents) often lost in aggregate query-level metrics. Experts reported that this facilitated actionable diagnosis and prioritization: for example, recommending the introduction of richer product comparison modules or clarifying ambiguous interfaces for specific user segments.

5. System Impact and Provisioning for Actionability

By decomposing queries into a spectrum of specific intents, BloomIntent equips developers and analysts to move beyond generic, coarse metrics—e.g., mean average precision at the query level—toward actionable, intent-targeted improvement. Misalignments can be directly attributed to particular information needs (e.g., only one intent cluster is unsatisfied), thus triaging engineering or design effort. This granular feedback loop directly supports iterative search engine development, enabling targeted improvements in SERP layouts, content ranking, and personalization modules with respect to real, multi-faceted user goals.

A summary of methodological differences is given below:

BloomIntent Principle	Conventional Search Evaluation	BloomIntent Approach
Evaluation unit	Query/session level	Fine-grained intent statement level
Diversity modeling	Aggregated, implicit	Explicit, taxonomy-anchored
Diagnostic precision	Low	High (faceted, per-intent)
Evaluation workflow	Human or crude metric	LLM-assisted, structured, scalable
Actionability	Limited	Direct, intent-targeted

6. Limitations and Future Directions

BloomIntent provides only moderate (not perfect) agreement with expert raters, especially for ambiguous queries. One direction for further research lies in hybrid human-in-the-loop calibration, enabling selective expert review for intents where LLM-based judgments are less confident or where inter-rater variance is high. Another opportunity is the extension of the framework to support non-textual and multimodal search scenarios including image and video retrieval, or to conversational and multi-turn search tasks where user intent evolves over time.

Refinements in balancing intent novelty and plausibility (so that rare but meaningful needs are surfaced but not overrepresented) remain open. Further, exploiting diverse user attribute priors could enable more personalized evaluations—e.g., exploring how “intent satisfaction” for the same SERP differs between age, language, or device user cohorts.

7. Broader Relevance and Implications

The framework’s formalization and empirical validation mark a shift in user modeling for search evaluation: from treating user need as a monolithic, latent factor to considering it explicitly as a multi-faceted, taxonomically-grounded set of potential goals. The systematic construction and LLM-based evaluation of these intent sets exemplifies a scalable paradigm for large-scale, user-aligned IR system assessment. This approach is extensible to other domains—such as recommender evaluation, dialog systems, and AI-driven UX research—where the granularity of intent modeling is crucial for bridging the gap between system-centric metrics and real-world user satisfaction (Choi et al., 23 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

BloomIntent: Automating Search Evaluation with LLM-Generated Fine-Grained User Intents (2025)

Follow Topic

Get notified by email when new papers are published related to BloomIntent.