Selenite: Scaffolding Online Sensemaking with Comprehensive Overviews Elicited from Large Language Models

Published 3 Oct 2023 in cs.HC and cs.AI | (2310.02161v4)

Abstract: Sensemaking in unfamiliar domains can be challenging, demanding considerable user effort to compare different options with respect to various criteria. Prior research and our formative study found that people would benefit from reading an overview of an information space upfront, including the criteria others previously found useful. However, existing sensemaking tools struggle with the "cold-start" problem -- it not only requires significant input from previous users to generate and share these overviews, but such overviews may also turn out to be biased and incomplete. In this work, we introduce a novel system, Selenite, which leverages LLMs as reasoning machines and knowledge retrievers to automatically produce a comprehensive overview of options and criteria to jumpstart users' sensemaking processes. Subsequently, Selenite also adapts as people use it, helping users find, read, and navigate unfamiliar information in a systematic yet personalized manner. Through three studies, we found that Selenite produced accurate and high-quality overviews reliably, significantly accelerated users' information processing, and effectively improved their overall comprehension and sensemaking experience.

Abstract PDF Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper introduces Selenite, a framework using GPT-4 and zero-shot NLI to generate detailed overviews of complex information spaces.
It employs a Chrome extension architecture to extract topics, criteria, and options, achieving high recall and enhanced user efficiency.
Empirical evaluations demonstrate reduced cognitive load, improved precision, and robust scalability across diverse application domains.

Scaffolding Sensemaking: A Technical Review of "Selenite: Scaffolding Online Sensemaking with Comprehensive Overviews Elicited from LLMs"

Introduction

"Selenite: Scaffolding Online Sensemaking with Comprehensive Overviews Elicited from LLMs" (2310.02161) addresses a fundamental barrier in web-scale sensemaking: the prohibitive cost of developing a comprehensive and unbiased overview of complex, unfamiliar information spaces, especially at the cold start of the sensemaking process. Conventional systems are either dependent on prior expert curation or manual user effort, leading to incomplete, biased, or fragmented overviews. Selenite circumvents these bottlenecks by leveraging LLMs—specifically, GPT-4—as both an implicit knowledge base and a reasoning machine to automagically scaffold criteria, options, and navigation aids for unfamiliar domains.

This essay examines Selenite’s underlying architecture, its integration of LLM and zero-shot NLI methods, user interface strategies, and the empirical findings derived from intrinsic and extrinsic evaluations. Emphasis is placed on implementation details, design trade-offs, observed numerical effects, potential limitations, and broader implications for interactive human-AI sensemaking systems.

Selenite Architecture and Workflow

Selenite operationalizes three principal design goals: (D1) upfront construction of a global overview via common criteria and options; (D2) local comprehension through page- and paragraph-level summaries, alongside contextually grounded annotation; and (D3) dynamic suggestion of sensemaking next steps based on individual user coverage and reading provenance.

The core architecture is embodied as a Chrome extension, realized in TypeScript/React, with backend services on Google Firebase and GPU-accelerated ML inference for zero-shot NLI, using BART/MNLI models.

Principal System Workflow:

Topic and Options Extraction On page load, Selenite frames topic inference as an LLM-based summarization problem, querying GPT-4 with the page title and opening paragraphs. For extraction of "options" (e.g., products in comparative reviews), Selenite partitions page content into manageable chunks (to respect LLM context window constraints), then parallelizes GPT-4 queries to extract candidate options.
Criteria Elicitation and Refinement Selenite prompts GPT-4 iteratively (using Self-Refine methods) to generate, expand, and rank around 20–25 commonly considered criteria per domain, yielding string tuple lists of (criterion, description). This is supplemented by the user’s ability to post-edit, add, or request further diversification in the overview.
Zero-Shot NLI-based Annotation For each paragraph, criteria coverage is computed using a BART-large-MNLI model as a multi-label zero-shot classification task, thresholded aggressively for recall (typical label prob. ≥ 0.96), and enriched with per-paragraph in situ annotation.
Structured Navigation and Analysis Tools Selenite presents all criteria and options in a persistent sidebar, supports criterion-scoped navigation ("previous/next" buttons), and exposes a "zoom in" workflow (invoking GPT-4 again) to disambiguate convoluted paragraphs, labeling phrases with aspect/sentiment granularity.
Figure 1: The main Selenite interface sidebar, demonstrating global overview, encountered options, local annotations, and progress summaries for guided sensemaking.

Figure 2: Main workflow: after landing on a page, users receive a global overview (criteria/options), in-situ paragraph annotation, and dynamic search suggestions on exit.

Figure 3: Fast navigation to criteria mentions via structured UI affordances.

Figure 4: Zoom-in analysis: paragraph-level "Analyze" button triggers LLM-powered segmentation of phrases by criterion and sentiment polarization.

Implementation Details, Models, and Scalability

LLM and NLI Pipelines

Topic and Criteria Elicitation: GPT-4, temperature 0.3, using multi-stage prompt chaining. To reduce latency and prevent rate limitations, dual-API and retry logic is implemented, with real-world response times typically <10s per prompt even under load.
Paragraph Annotation: BART-large-MNLI, accessible via batched GPU inference over a cloud API. Paragraph-level processing is parallelized; empirical thresholds are tuned to favor recall, trading off spurious annotations for minimized criteria omission (which is more disruptive for navigation, per user study feedback).
"Zoom-in" Deep Analysis: Multi-step GPT-4 prompt orchestration, first extracting relevant phrases per criterion and performing aspect-based sentiment classification, then labeling with candidate options.

UI/UX Considerations

Sidebar: Present throughout the browsing session, aggregates criteria, options, and highlights local coverage (option/criteria presence determined by NLI models and LLM extraction).
In-situ Annotations: Rendered as badges atop paragraphs, previewing covered criteria; users can use them for non-linear skimming/navigation.
Progress Summaries: On page exit, the sidebar offers explicit coverage feedback—criteria seen/skipped—and proposes maximal-coverage search queries for unexplored dimensions (using a semantic diversity/relevance algorithm on top of embedding-space distance calculations).

Empirical Evaluations and Metrics

Intrinsic Measurement

Selenite’s intrinsic capability was benchmarked on 10 representative, high-diversity domains (e.g., “best baby strollers”, “best air purifiers”, “birthday gift ideas”) by comparing LLM/NLI-extracted criteria and options against ground truth constructed by aggregation of top-5 Google search results and careful multi-annotator unification. Quantitative metrics:

Metric	Topic-level Mean	Paragraph-level Mean
Precision	0.80	0.85
Recall	0.95	0.98
F1-score	0.87	0.91

Strong recall at both levels indicates that, for most domains, Selenite surfaces a superset of user-identified criteria—critical to mitigating anchoring bias. No substantial hallucination of irrelevant criteria was observed on sampled topics.

Option extraction via GPT-4 achieved 100% accuracy relative to human-curated options per page, confirming that LLM-based extraction can outperform HTML/tag heuristics and is robust to semantic divergence and page structure non-uniformity.

Human-Centric Usability and Comprehension Gains

Efficiency and Coverage:

In controlled within-subjects studies (n=12), Selenite reduced average sensemaking task completion times by 36.3% and increased the number of valid criteria identified by ~90% compared to the baseline, and users achieved significantly higher precision (from 78.4% to 98.8%, p<0.05) and recall (from 30.4% to 73.0%, p<0.05) vs. ground truth.

Cognitive Load and Satisfaction:

NASA TLX scores confirmed significantly reduced mental demand, temporal demand, and effort, with increased perceived performance (paired t-tests, p<0.05). System Usability Scale (SUS) medians were 6–7 on a 7-point scale for comprehensibility and recommendability.

Behavioral Adaptations:

Qualitative studies revealed that after a short trust calibration phase, most users shifted from skimming entire articles to criteria-first, selective reading, guided directly by Selenite's in-situ metadata and navigation affordances. Search behavior diversified, with Selenite users issuing more queries and encountering more unique, high-utility information.

Design Trade-offs and Limitations

Knowledge Quality vs. Overload

By design, Selenite optimizes for coverage but trades off increased annotation density—which can, in rare cases, distract or overwhelm users, especially for unrelated or repetitive page sections. As LLM-generated criteria lists grow, effective UI affordances (collapsible, fade hierarchy, or knowledge graphs) become important, but Selenite currently defaults to a flat, ranked-list presentation.

Domain Validity and LLM Limitations

LLM-based world knowledge ensures broad domain validity but may underperform in highly specialized or emergent domains where training data coverage is lacking. For options, Selenite mitigates potential staleness by extracting from local page context; for criteria, the risk is lower due to stability of comparative dimensions across time, but could be addressed with hybrid RAG approaches in future implementations.

Annotation Robustness

The NLI model, while demonstrating high recall, is sensitive to perturbations in criterion descriptions; users may need to intervene occasionally by editing/fusing criteria or adjusting label thresholds. Ground-truth mismatches typically arise due to correlated or hierarchical criteria (e.g., "innovation" subsuming "growth speed" in deep learning frameworks); Selenite plans to expose criteria connection metadata in further iterations.

User Overdependence and Anchoring Bias

Exposure to criteria and progress summaries at reading onset and exit introduces anchoring risk, but empirical evaluations found that Selenite-proposed criteria typically cover a strict superset of those surfaced by unaided users. Still, further UI work is needed to encourage serendipitous exploration and critical engagement, such as progressive disclosure or counterfactually-suggested criteria.

Engineering and Computational Considerations

Response Latency: LLM API requests per page (topic/criteria extraction) are pre-fetched and cached per session; per-paragraph NLI inference is parallelized as cloud function calls, with L4 calculation yielding low sub-second latency for typical pages in common domains.
Scaling: The dual-API, retry, and session cache architecture supports moderate concurrent user counts; with future public API quotas for GPT-4 or equivalents, this could be further optimized by batch processing or moving to open-source LLM backends.
Extensibility: Selenite’s architecture is task-independent at the core; domain prompt design can be extended for non-comparative sensemaking (debugging, skill learning) by changing LLM instruction templates.

Implications and Future Directions

Practical Applications

Selenite’s methodology is generalizable to any information foraging or sensemaking workflow constrained by knowledge disparity and information overload. It is particularly applicable for:

Rapid onboarding in technical domains (e.g., new software framework comparisons)
Consumer or B2B product comparison portals
Cross-domain knowledge graph bootstrapping
Reading aids for accessibility and cognitive scaffolding

Theoretical Significance

This work demonstrates that LLMs, when prompted with top-down, context-anchored instructions and combined with zero-shot NLI annotation, can operationalize human-expert "overviews" with high coverage and accuracy, matching—if not exceeding—manual annotation pipelines at a fraction of the cost and latency.

It also affirms that sensemaking systems need not be bottlenecked by prior expert curation, pushing the boundary of cold-start support in user modeling, information management, and HCI.

Prospective Developments

Integration with RAG and verifiability models: Using external corpus RAG (retrieval-augmented generation) to supplement LLM world knowledge for specialized or dynamically-evolving domains.
Knowledge structures: Moving from flat lists to hierarchical or graph-based criterion representations for improved cognitive ergonomics and scalable exploration.
Beyond comparison tasks: Adapting the design goals and prompts for open-ended skill acquisition, troubleshooting, or investigative journalism domains.
Field deployment and analytics: Large-scale, long-term field studies to observe longitudinal effects on user sensemaking patterns, anchoring behavior, and knowledge retention.

Conclusion

Selenite establishes a robust, extensible architecture for LLM-powered, context-grounded sensemaking scaffolding on the web, validated by both strong empirical metrics and high user adoption in behavioral studies. By combining global overviews, fine-grained automated annotation, and actionable guidance, Selenite provides a paradigm shift in how systems can lower the cost of entry into unfamiliar information spaces and support the active construction of comparative mental models.

The broader implication is a strengthening of interactive, human-in-the-loop AI collaboration protocols, where LLMs act as both collaborator and scaffold—raising the baseline for sensemaking and decision support in digital environments.

Figure 5: Example user study material: the "best baby strollers" article as used in option/criteria evaluation and user-guided information extraction analysis.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Selenite: Scaffolding Online Sensemaking with Comprehensive Overviews Elicited from Large Language Models

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview: What’s this paper about?

This paper introduces Selenite, a helpful tool you add to your web browser. Its goal is to make it easier to understand new, complicated topics online—like choosing the best baby stroller or comparing different coding tools—by giving you a clear, organized “big-picture” overview right away. It uses a very smart AI (a LLM, or LLM, like GPT-4) to suggest what’s most important to look for and then guides you as you read so you don’t get lost or waste time.

The key questions the researchers asked

The paper asks three simple questions:

Can an AI quickly give people a useful “map” of a topic—what options exist and which qualities (criteria) matter most—so they start off with good guidance?
Can this map help people read smarter, by pointing out what’s in each page and paragraph so they can jump to the parts they care about?
Does this actually help people understand faster and make better decisions?

How did they do the research? (Methods and approach)

Think of learning a new topic like exploring a new city:

A “big-picture map” helps you see the main neighborhoods and landmarks (the important criteria).
Street signs and summaries help you navigate each block (paragraph) without wandering aimlessly.
Suggestions for where to go next help you discover new places you haven’t visited yet.

Selenite gives you all three:

It creates a global overview: When you open a page about a topic (say, “best baby strollers”), Selenite’s sidebar shows common criteria people care about (like safety, maneuverability, durability) and the options mentioned on the page.
It helps at the page and paragraph level: It highlights which criteria each paragraph talks about and lets you jump between all the paragraphs that discuss the same thing (like maneuverability) across the page.
It guides your next steps: At the end of a page, it summarizes what you’ve covered and suggests smart search ideas to find new information you haven’t seen yet.

To build this, the researchers used:

LLMs, such as GPT-4, which are computer programs trained on lots of text and can generate helpful summaries and lists. They asked GPT-4 to:
- Recognize the topic of a page.
- List commonly important criteria for that topic (like a checklist).
- Help explain tricky paragraphs (what’s positive, negative, or neutral about each criterion).
A classification model (think “smart tagger”) that can read a paragraph and decide which criteria it discusses, so Selenite can label paragraphs and help you navigate.
A Chrome extension interface so everything appears alongside the page you’re reading.

They ran three types of studies:

A formative study (interviews) to learn what people struggle with when reading unfamiliar topics online.
An intrinsic evaluation to check if the AI-generated overviews were accurate and high-quality.
Usability and case studies to see if Selenite actually helped people read faster and understand better.

They also used techniques to make the AI’s results more reliable, like “Self-Refine,” which asks the AI to improve its own answers, and grounding the AI’s summaries in the actual page content you’re reading, so it’s easier to verify.

What did they find and why it matters

The main results were positive:

Selenite’s overviews were accurate and high-quality: The criteria lists and summaries were reliable enough to be useful right away.
It sped up reading and decision-making: People got to the important parts faster because they knew what to look for and where to find it.
It improved comprehension: By labeling paragraphs and providing quick summaries, readers better understood complicated sections and didn’t miss important details.
It made the whole experience less overwhelming: Starting with a clear overview and having smart navigation kept people focused and reduced confusion.

Why this matters: When you don’t know much about a topic, it’s easy to miss key ideas or get stuck reading repetitive, unhelpful content. Selenite acts like a friendly guide—it shows you the big picture, points out the valuable parts, and helps you move on to find new information. That can lead to better choices, whether you’re buying something, learning a skill, or researching for school.

What does this mean for the future? (Implications)

This research suggests that future reading and research tools should do more than store notes—they should help you understand from the start. Tools like Selenite could:

Help students and professionals quickly build a strong mental model of a new topic.
Reduce wasted time on duplicate or low-value pages by previewing what matters.
Support better decision-making with clear criteria and side-by-side comparisons.
Be integrated into browsers or search engines to make exploring new topics easier for everyone.

There’s still room to improve—AI can make mistakes (called “hallucinations,” where it says something that isn’t true), so it’s important that tools keep grounding their answers in real page content and make verification easy. But overall, Selenite shows that AI can be a powerful reading companion: it gives you a map, guides your steps, and helps you discover more—all while keeping you in control.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of what remains missing, uncertain, or unexplored based on the paper’s current scope and evidence.

External validity across domains: Assess how Selenite performs beyond consumer-style comparisons (e.g., scientific literature reviews, health/medical, legal, political, civic information) where stakes, ambiguity, and domain conventions differ.
Non-English and cross-lingual robustness: Evaluate topic recognition, criteria retrieval, and paragraph classification for non-English content and mixed-language pages; replace/extend English-centric models (e.g., BART-large-MNLI) with multilingual alternatives.
Recency and world-knowledge limits: Quantify how GPT-4’s knowledge cutoff affects criteria completeness for rapidly evolving domains; compare against retrieval-augmented generation (RAG) or web-grounded evidence for up-to-date criteria.
Hallucinations and factuality: Systematically measure hallucination/falsehood in criteria lists and paragraph-level explanations; test mitigation strategies (Self-Refine, citations, RAG) and quantify their impact.
Bias and fairness in criteria: Identify demographic, cultural, and topical biases in LLM-elicited “commonly considered” criteria; study how criteria differ across cultures and contexts, and design methods to surface multiple, pluralistic framings.
Anchoring and confirmation effects: Test whether top-down criteria prime/anchor users, narrow exploration, or reinforce confirmation bias; design and evaluate countermeasures (e.g., alternative framings, randomized exposure, “explore outside the list” prompts).
Provenance and auditability: Add explicit citations and source links for each generated criterion and paragraph-level claim; study how provenance affects trust, verification behavior, and error correction.
Transparency and uncertainty communication: Expose calibrated confidence or uncertainty for topic labels, criteria presence, and sentiment; evaluate how such signals influence user decisions and error handling.
Topic recognition accuracy and scope: Rigorously benchmark GPT-4–based topic inference and SentenceBERT clustering across diverse, ambiguous, or multi-topic pages; define thresholds, handling of multi-label topics, and user override workflows.
Options extraction robustness: Evaluate extraction across heterogeneous web structures (boilerplate-heavy pages, dynamic content, tables, lists, carousels, comments, ads); compare against DOM-aware and boilerplate-removal baselines.
Paragraph-level classification validity: Provide ablations and benchmarks for the 0.96 NLI threshold, precision–recall tradeoffs, latency, and cost; compare zero-shot vs. fine-tuned, domain-adapted models.
Adversarial robustness and prompt injection: Analyze how untrusted web content can manipulate LLM prompts/outputs; implement and evaluate sanitization and isolation strategies for safe prompting.
Privacy and data handling: Specify and evaluate policies for sending page content to remote APIs (PII handling, data retention, on-device alternatives), and assess user perceptions of privacy.
Latency, scalability, and cost: Measure end-to-end performance and cost at scale (many tabs/pages), caching strategies, batching, offline modes, and feasibility on lower-resource devices.
Accessibility: Assess color schemes (red/green sentiment), keyboard navigation, screen-reader compatibility, and cognitive load; provide accessible alternatives to color-coded annotations.
UI overload and attentional impact: Quantify cognitive load from sidebars, annotations, and “zoom-in” views; study attention fragmentation and optimal granularity/levels-of-detail controls.
Personalization and adaptive learning: Explore models that learn user-specific criteria importance, vocabulary/jargon preferences, and reading goals over time without reinforcing pre-existing biases.
Longitudinal and field evaluations: Move beyond short lab/usability studies to longitudinal, in-the-wild deployments measuring learning, knowledge retention, decision quality, satisfaction, and sustained adoption.
Collaboration and social transparency: Extend to multi-user sensemaking (shared criteria sets, provenance/versioning, conflict resolution) and study how group dynamics affect bias, coverage, and trust.
Novelty and de-duplication in search suggestions: Define algorithms and metrics for “novel information gain”; evaluate end-of-page query suggestions for diversity, redundancy reduction, and avoidance of filter bubbles.
Handling multimodal and non-HTML sources: Extend to PDFs, videos, slides, code repositories, and datasets; evaluate extraction/annotation quality for multimodal content and structured data.
Quantitative attribute extraction: Support accurate extraction/normalization of numeric data (units, ranges, uncertainty), and evaluate automatic table generation and consistency checking across sources.
Failure handling and recoverability: Design interactions to flag, correct, and learn from system errors (misclassified criteria, missed options) and measure how quickly users detect and fix issues.
Ethical impacts on critical reading: Examine whether overviews and annotations reduce independent critical reading, exploration breadth, or critical thinking; design nudges for balanced skepticism and verification.
Cultural adaptation of criteria: Investigate how “commonly considered” criteria vary by region, culture, and norms; incorporate locale-aware templates and user-selectable cultural profiles.
Evaluation benchmarks and gold standards: Create open, expert-annotated benchmarks for “criteria comprehensiveness” and paragraph-level mappings to support reproducible comparisons across methods.
Integration and interoperability: Study export/import, APIs, and integration with note-taking, reference managers, and search engines; assess friction, data provenance, and workflow fit.
Security and extension threat model: Define and test a browser-extension security model, required permissions, local storage policies, and defenses against malicious pages and XSS-style attacks.
Vendor dependence and reproducibility: Address reliance on proprietary GPT-4; compare with open models, distillation/on-device variants, and report reproducible pipelines and hyperparameters.
Metrics for “comprehensiveness” and “quality”: Operationalize and validate metrics for overview coverage, diversity, and usefulness; relate them to downstream decision outcomes and user satisfaction.

View Paper Prompt View All Prompts

Glossary

Anchoring bias: A cognitive bias where initial information disproportionately influences subsequent judgments; in prompting, early outputs can bias later ones. "To minimize potential anchoring biases, we strive to achieve a balance between relevance and diversity in our prompting strategy."
BART-large-MNLI: A BART model fine-tuned on Multi-Genre Natural Language Inference, commonly used for zero-shot classification tasks. "We used the bart-large-mnli model"
Chain-of-thought prompting: A prompting technique that elicits step-by-step reasoning in LLMs to improve problem solving. "It also aligns with the idea of Chain-of-thought prompting proposed by \cite{wei_chain--thought_2023}"
Context window size: The maximum number of tokens an LLM can process in a single input, affecting how much text it can consider at once. "and expansive context window size \cite{openai_gpt-4_2023} to directly extract options from the entire text content of a web page."
Cosine distance: A measure of dissimilarity between vectors based on the cosine of the angle between them, often used with embeddings. "based on the cosine distances on topic semantic embeddings computed using SentenceBERT~\cite{reimers_sentence-bert_2019}."
Crowdsourcing: Collecting information or work from a large distributed group of people, often via online platforms. "or crowdsourcing \cite{chang_alloy:_2016,chilton_cascade_2013,hahn_knowledge_2016}."
Duplicate detection algorithms: Techniques that identify and filter out duplicate or near-duplicate documents in retrieval systems. "despite the extensive use of duplicate detection algorithms in modern search engines \cite{plegas_reducing_2013}."
Entailment (in NLI): A relation where a hypothesis logically follows from a premise, used as a label in NLI tasks. "The entailment and contradiction probabilities are then converted into label probabilities"
Grounding (LLMs): Linking model outputs to external sources or user-provided content to improve accuracy and verifiability. "2) grounding LLM generations with the content that users would actually read, enabling natural verification."
Hallucination (LLMs): The tendency of LLMs to generate plausible-sounding but incorrect or fabricated information. "However, LLMs face well-known challenges like hallucination and falsehood \cite{thorp_chatgpt_2023,bang_multitask_2023,terry_ai_2023}"
In-situ annotations: Inline annotations presented directly within the reading context (e.g., at paragraph starts) to summarize or tag content. "Selenite provided {in-situ annotations} of mentioned criteria at the beginning of each paragraph"
Knowledge graph: A structured representation of entities and their relationships used for reasoning and retrieval. "making them potentially valuable for tasks like knowledge graph querying and retrieving common sense information"
Knowledge retriever: A component that identifies and supplies relevant knowledge to support downstream tasks or generation. "Selenite leverages GPT-4, an LLM developed by OpenAI, as a knowledge retriever"
LLMs: Very large neural LLMs trained on massive text corpora with broad capabilities in generation and reasoning. "leverages LLMs as reasoning machines and knowledge retrievers"
Natural Language Inference (NLI): The task of determining whether a hypothesis is entailed by, contradicts, or is neutral with respect to a premise. "following a natural language inference (NLI) paradigm \cite{yin_benchmarking_2019}"
Oracle (in prompting): An idealized source of authoritative answers used conceptually to obtain ground-truth or comprehensive guidance. "we directly query an ``oracle'' for a globally applicable and comprehensive set of criteria."
Retrieval-Augmented Generation (RAG): A technique that augments generation by retrieving relevant external documents to ground model outputs. "such as retrieval-augmented generation (RAG) \cite{lewis_retrieval-augmented_2020}"
Semantic embeddings: Vector representations that capture the meaning of text, enabling similarity comparison and clustering. "topic semantic embeddings computed using SentenceBERT~\cite{reimers_sentence-bert_2019}"
Semantic web standards: Conventions and best practices for adding machine-interpretable meaning to web content (e.g., semantic HTML). "web pages frequently disregard semantic web standards and best practices \cite{mendes_toward_2018,henschen_using_2009}"
Self-Refine: An iterative prompting technique where the model critiques and improves its own outputs. "reducing hallucination through techniques such as Self-Refine \cite{madaan_self-refine_2023};"
SentenceBERT: A model that produces high-quality sentence embeddings for semantic similarity and clustering. "computed using SentenceBERT~\cite{reimers_sentence-bert_2019}"
Sensemaking: The process of building a mental model to interpret and act within an information space. "Sensemaking in unfamiliar domains can be challenging, demanding considerable user effort to compare different options with respect to various criteria."
Topic recognition: Automatically identifying and labeling the subject or theme of a document or web page. "Automatically recognizing topics."
Transformer models: Neural architectures based on self-attention mechanisms, foundational to modern LLMs. "recent advances in large pre-trained transformer models \cite{vaswani_attention_2017,devlin_bert_2019,lewis_bart_2019}"
World knowledge (LLMs): Factual and commonsense information implicitly stored in model parameters during pretraining. "the world knowledge of an LLM is out-of-date"
Zero-shot text classification: Classifying text into categories without task-specific labeled training data, often via NLI prompts. "fine-tuned to perform zero-shot text classification tasks"

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

These applications can be deployed today using Selenite’s approach (LLM-elicited criteria, automatic topic/option extraction, paragraph-level annotations, and end-of-page progress cues) with modest engineering and governance.

Consumer decision support browser extension (Daily life, E-commerce)
- Use case: Shoppers researching complex purchases (e.g., strollers, mattresses, cameras) get an upfront criteria overview, page previews by aspect, and non-linear navigation across reviews.
- Tools/products/workflows: Chrome/Edge/Firefox extension; retailer plug-ins; comparison-site widgets that surface “commonly considered criteria” and “what this page adds.”
- Assumptions/dependencies: Acceptable LLM API costs/latency; consent to send page text to cloud services; human-in-the-loop verification to mitigate hallucinations; English-first coverage.
Developer technology selection companion (Software/IT)
- Use case: Engineers comparing frameworks, libraries, cloud services leverage criteria like stability, community, integration effort; navigate docs/issues by criterion.
- Tools/products/workflows: IDE/browser add-on for docs sites and GitHub; integration with internal architectural decision records (ADRs); team-onboarding reading companion.
- Assumptions/dependencies: Corporate privacy controls; domain prompts tuned to software; retrieval augmentation for up-to-date releases.
Literature review and paper-reading companion (Academia, R&D)
- Use case: Students and researchers get “criteria” for a topic (e.g., evaluation metrics, datasets, methods), paragraph annotations, and suggestions for “what to read next” that reduce redundancy.
- Tools/products/workflows: Plugins for PDF readers (e.g., Acrobat), reference managers (e.g., Zotero/Mendeley), and academic browsers; courseware integrations for reading assignments.
- Assumptions/dependencies: Accurate PDF parsing/OCR; discipline-specific prompts; clear disclaimers about LLM limitations.
Editorial research and fact-check triage (Media/Journalism)
- Use case: Reporters preview pages by aspects (e.g., methodology, conflicts of interest, counterarguments) and jump among aspect mentions; end-of-page cues to find novel angles.
- Tools/products/workflows: Newsroom browser extension; CMS sidepanels that show covered vs. missing angles; de-duplication guidance for source gathering.
- Assumptions/dependencies: Mandatory source verification policies; model governance to avoid over-reliance on LLM “knowledge.”
Policy analysis and stakeholder brief preparation (Policy/Government/NGOs)
- Use case: Analysts exploring proposals gain an instant criteria scaffold (e.g., cost, equity impact, enforceability), navigate long PDFs/webpages by aspect, and receive targeted next-query suggestions.
- Tools/products/workflows: Secure in-browser tool for government intranets; integration with document repositories and regulatory portals.
- Assumptions/dependencies: Compliance (GDPR, records retention), on-prem or vetted LLMs; careful framing to avoid normative bias in “criteria.”
Customer support knowledge-base navigator (Customer success/Enterprise)
- Use case: Agents and users scan troubleshooting pages by aspect (e.g., prerequisites, steps, error conditions), jump between relevant sections, and identify missing coverage.
- Tools/products/workflows: Help center sidebar; CRM integration; agent co-pilot showing per-article aspect coverage.
- Assumptions/dependencies: Access to page text; multi-language support for global KBs; performance under high ticket volume.
Product management competitive analysis (Industry/Product)
- Use case: PMs collect “options encountered” and compare by common criteria (price, integrations, SLAs), with paragraph-level evidence annotations.
- Tools/products/workflows: Browser extension + export to spreadsheets/Notion; workflow to turn reading trails into structured comparison docs.
- Assumptions/dependencies: Organizational acceptance of evidence capture; sensitivity to vendor content licensing.
Internal onboarding and SOP discovery (Enterprise Knowledge Management)
- Use case: New hires get a big-picture map of internal docs by criteria (e.g., policy scope, approval steps), enabling faster ramp-up.
- Tools/products/workflows: Intranet sidebar; SharePoint/Confluence plugin; “what to read next” suggestions focused on gaps.
- Assumptions/dependencies: On-prem deployment or secure API gateways; role-based access controls.
Education: reading scaffolds for non-linear comprehension (Education)
- Use case: Learners skim dense materials with aspect summaries (e.g., theorem assumptions, proofs, applications) and track which aspects they’ve mastered.
- Tools/products/workflows: LMS integration; adaptive reading guides; formative assessment linked to aspects.
- Assumptions/dependencies: Instructor oversight; alignment with curriculum; accommodations for accessibility.
Research-ops and horizon scanning (Strategy/Corporate foresight)
- Use case: Analysts synthesize a domain quickly using criteria (market drivers, risks, regulatory landscape), avoid duplicate sources, and log coverage gaps.
- Tools/products/workflows: Sensemaking sidebar + export to brief templates; query-suggestion module for novelty-seeking.
- Assumptions/dependencies: Domain-specific prompt libraries; governance for confidential topics.
E-commerce comparison pages and review UX (Retail/Marketplaces)
- Use case: Surfaces “commonly considered criteria” and per-product evidence highlights from reviews/Q&A.
- Tools/products/workflows: Merchant and marketplace widgets; per-product aspect coverage meters; “compare by aspect” navigation.
- Assumptions/dependencies: Review parsing quality; moderation to avoid misclassification; platform performance constraints.

Long-Term Applications

These applications require further research, domain adaptation, large-scale integration, or stronger guarantees (accuracy, privacy, compliance) before broad deployment.

Search engine and browser-level “overview of criteria” (Software/Search)
- Use case: SERPs show domain criteria, per-result aspect coverage, and de-dup suggestions; users jump directly to the most novel sources.
- Tools/products/workflows: Native browser sidebars; search provider APIs; ranking models that incorporate “novelty by aspect.”
- Assumptions/dependencies: Large-scale indexing of aspect coverage; robust anti-hallucination pipelines; UX validation at web scale.
Regulated-domain decision support (Healthcare, Legal, Finance)
- Use case: Clinicians/lawyers/analysts get aspect scaffolds (e.g., contraindications, precedent factors, risk metrics) tied to verified sources.
- Tools/products/workflows: RAG pipelines over trusted corpora; signed citations; audit trails; ISO/IEC-compliant AI governance.
- Assumptions/dependencies: Near-zero hallucinations; domain-tuned models; privacy/security (HIPAA, GDPR); liability frameworks; expert oversight.
Organization-wide sensemaking fabric over private corpora (Enterprise)
- Use case: Unified “aspect-aware” navigation across emails, tickets, docs, and code; dynamic gap analysis for knowledge assets.
- Tools/products/workflows: Connectors to DMS/EDRMS, wikis, code repos; embeddings infra; policy-compliant data pipelines.
- Assumptions/dependencies: Data governance and access control; content deduplication at scale; multi-tenant isolation.
Contract and case-law analysis with aspect extraction (LegalTech)
- Use case: Extract covenant types, obligations, exceptions, and case factors; intra-document navigation by aspect with sentiment/stance labels.
- Tools/products/workflows: Secure doc viewers; clause libraries; continuous learning from attorney feedback.
- Assumptions/dependencies: High-precision models; jurisdictional variance; confidential processing.
Scientific synthesis at scale (Academia/Pharma)
- Use case: Cross-paper aspect maps (e.g., methods, datasets, outcome measures, limitations) with confidence scores and contradiction flags.
- Tools/products/workflows: Domain ontologies; evidence grading; living reviews auto-updated with new literature.
- Assumptions/dependencies: Standardized metadata; citation grounding; contradiction detection; inter-annotator-agreement benchmarks.
Personalized reading tutors and metacognitive coaching (Education/EdTech)
- Use case: Adaptive guidance on what to read and how, with aspect-driven strategies and reflection prompts based on student progress.
- Tools/products/workflows: AI tutors integrated with LMS; analytics on aspect mastery; formative feedback loops.
- Assumptions/dependencies: Learning science validation; bias/fairness audits; privacy protections for student data.
Accessibility-forward intelligent readers (Accessibility/Assistive tech)
- Use case: Voice and screen-reader companions that announce aspect summaries and enable voice navigation by aspect across documents.
- Tools/products/workflows: ARIA-compliant sidebars; speech interfaces; low-vision optimized layouts.
- Assumptions/dependencies: Robust multi-modal parsing; latency constraints for real-time TTS; multilingual support.
Collaborative sensemaking and consensus-building platforms (Civic tech/Policy)
- Use case: Stakeholders co-construct aspect maps for proposals; system highlights underrepresented aspects and biases.
- Tools/products/workflows: Multi-user dashboards; provenance tracking; deliberation support tools.
- Assumptions/dependencies: Methods for bias detection/mitigation; moderation; civic process integration.
Autonomous research agents with novelty-seeking loops (Software/AI agents)
- Use case: Agents that read, extract options/aspects, identify gaps, and autonomously search for non-redundant new evidence.
- Tools/products/workflows: Agent frameworks; novelty scoring; safe exploration policies; human approval checkpoints.
- Assumptions/dependencies: Reliable long-horizon planning; cost controls; guardrails against error cascades.
Multilingual, cross-cultural sensemaking (Global markets/Education)
- Use case: Aspect scaffolds and annotations across languages, tuned to cultural norms and region-specific criteria.
- Tools/products/workflows: Multilingual NLI/LLMs; locale-aware prompting; cross-lingual retrieval.
- Assumptions/dependencies: High-quality multilingual models; culturally sensitive design; evaluation across locales.
Content management systems with aspect-aware authoring (Software/CMS)
- Use case: Authors get feedback on aspect coverage, redundancy, and missing sections during writing; readers see aspect previews out-of-the-box.
- Tools/products/workflows: CMS plugins (WordPress, Drupal); editor sidebars; writer-quality metrics.
- Assumptions/dependencies: Adoption by authors; alignment with editorial standards; compute costs during authoring.
Platform-level knowledge integrity (Misinformation/Platform policy)
- Use case: Platforms label articles with aspect coverage and encourage diversification of sources in feeds; users discover less redundant, more comprehensive views.
- Tools/products/workflows: Ranking signals for aspect diversity; transparency dashboards.
- Assumptions/dependencies: Policy alignment; risk of gaming; fairness and viewpoint diversity considerations.

Notes on feasibility across applications

Dependencies: Stable access to high-quality LLMs; prompt libraries; Self-Refine or equivalent to reduce hallucination; optional RAG for grounding; paragraph-level NLI classifiers.
Risks/assumptions: Hallucination and outdated model knowledge (mitigated by local grounding and RAG); privacy/compliance when sending content off-device; domain adaptation needed for specialized areas; English-centric performance unless multilingual models are added.
UX integration: Browser extensions are low-friction for immediate deployment; large-scale or regulated use requires secure, audited infrastructures and human oversight.

View Paper Prompt View All Prompts

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Selenite: Scaffolding Online Sensemaking with Comprehensive Overviews Elicited from Large Language Models

Summary

Scaffolding Sensemaking: A Technical Review of "Selenite: Scaffolding Online Sensemaking with Comprehensive Overviews Elicited from LLMs"

Introduction

Selenite Architecture and Workflow

Implementation Details, Models, and Scalability

LLM and NLI Pipelines

UI/UX Considerations

Empirical Evaluations and Metrics

Intrinsic Measurement

Human-Centric Usability and Comprehension Gains

Design Trade-offs and Limitations

Knowledge Quality vs. Overload

Domain Validity and LLM Limitations

Annotation Robustness

User Overdependence and Anchoring Bias

Engineering and Computational Considerations

Implications and Future Directions

Practical Applications

Theoretical Significance

Prospective Developments

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview: What’s this paper about?

The key questions the researchers asked

How did they do the research? (Methods and approach)

What did they find and why it matters

What does this mean for the future? (Implications)

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Glossary

Practical Applications

Immediate Applications

Long-Term Applications

Open Problems

Continue Learning

Related Papers

Authors (6)

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research