Papers
Topics
Authors
Recent
Search
2000 character limit reached

Asta Interaction Dataset (AID)

Updated 28 February 2026
  • AID is a large-scale dataset of anonymized user queries and clickstreams from AI-powered research tools used for literature discovery and scientific Q&A.
  • It comprises over 258,000 queries and 432,000 click events captured via two interfaces, PaperFinder and ScholarQA, revealing detailed user behavior patterns.
  • The dataset supports rigorous evaluation of LLM-assisted scientific tools and informs intent-aware research assistant design by analyzing engagement, query intent, and navigation dynamics.

The Asta Interaction Dataset (AID) is a large-scale, publicly available corpus of real-world user interaction logs from AI-powered scientific research tools. Compiled from over 258,000 anonymized queries and more than 432,000 detailed clickstream events, AID captures engagement patterns, query characteristics, and behavioral dynamics as researchers utilize LLM-driven retrieval-augmented platforms for literature discovery and scientific question answering. Released by the Allen Institute for AI under a non-commercial license, AID enables rigorous study of expert workflows and intent-aware research assistant design, and supports robust evaluation of AI-assisted scientific tools (Haddad et al., 26 Feb 2026).

1. Composition and Data Collection

AID was generated through the deployment of two complementary interfaces within an LLM-powered scientific platform:

  • PaperFinder (PF): A chat-driven literature discovery tool, returning a ranked list of publication titles, LLM-generated relevance summaries, and links to Semantic Scholar for each user query.
  • ScholarQA (SQA): A single-turn scientific question-answering system, generating structured, multi-section reports containing collapsible titles, one-sentence TL;DRs, expandable bodies with inline citations, and section-level feedback mechanisms.

The dataset consists of 258,935 anonymized queries and 432,059 clickstream events spanning both interfaces. Modalities captured include raw user-submitted text (ranging from concise English questions and code snippets to multi-paragraph draft text), click interactions (e.g., link navigations, section expansions, evidence card views, thumb feedback), and contextual metadata (timestamps, tool identifiers, result positions, section indices, paper corpus IDs). Internal analyses showed a median of two sessions per user, with approximately 40% of users submitting multiple queries (Haddad et al., 26 Feb 2026).

2. Data Schema and Anonymization

AID is released as six interlinked Parquet files, joinable via a hashed session identifier (thread_id), and omits all personal identifying information. The files and principal fields are:

File Name Key Fields/Description Tool(s)
optin_queries_anonymized.parquet query, thread_id, query_ts, tool PF + SQA
pf_shown_results_anonymized.parquet thread_id, query_ts, result_position, corpus_id PF
report_section_titles_anonymized.parquet thread_id, section_idx, section_title SQA
section_expansions_anonymized.parquet thread_id, section_expand_ts, section_id SQA
report_corpus_ids_anonymized.parquet thread_id, corpus_id SQA
s2_link_clicks_anonymized.parquet thread_id, s2_link_click_ts, corpus_id, tool PF + SQA

No user or session PII is released. Original user identifiers were retained only internally in pseudonymous form for analysis; any query flagged by an LLM for potential PII (<1% of total) was removed. All paper IDs are public Semantic Scholar identifiers. Released data supports robust linkage of event types while ensuring user anonymity.

3. Query Intent Taxonomy

AID queries are annotated with a non-exclusive taxonomy of 16 distinct intents, reflecting the diversity and complexity of research use cases. Each query may have zero or more assigned intents. The taxonomy, with SQA intent occurrence rates and brief examples, is summarized as follows:

Intent Name Fraction (SQA) Example
Broad Topic Exploration 51.6% "GLP-1 and diabetes"
Concept Definition/Explanation 28.2% "Summarize the concept of ‘Technoimagination’ by Vilém Flusser"
Specific Factual Retrieval 12.6% "What are the four core concepts of Rotter’s theory?"
Causal and Relational Inquiry 19.1% "Relation between nighttime digital device use and sleep quality"
Comparative Analysis 7.3% "Trade-offs between HBr and Cl₂ plasma gases for reactive ion etching"
Methodological/Procedural Guidance 9.1% "How often should I collect mosquitoes for dengue surveillance?"
Tool and Resource Discovery 2.3% "Are there any tools to count the quality or semantic content of citations?"
Research Gap and Limitation Analysis 5.2% "Survey on the limitations of classical NLP evaluation metrics"
Citation Evidence Finding 5.7% "Can you assist me to get the source: …(WHO, 2023)?"
Specific Paper Retrieval 0.7% "Anderson and Moore’s paper on the stability of the Kalman filter"
Ideation 1.7% "Give me a cost-efficient way to build rapid antigen tests …"
Application Inquiry 4.2% "ETA prediction with GPS data from cargo"
Data Interpretation Support 1.1% "Why do TarM knockout strains show higher IL-1β responses …?"
Content Generation Experimentation 1.1% "Improve this Materials and Methods section for a journal paper…"
Academic Document Drafting 6.2% "Write a full Materials and Methods section suitable for submission…"
Complex Cross-Paper Synthesis 2.1% "Describe how conditional Pesin entropy formula evolved..."

These annotations enable intent-aware research into query distribution, interface design, and evaluation benchmarks.

4. Engagement and Usage Metrics

AID provides precise quantitative assessment of user interaction complexity and engagement. Key findings include:

  • Query Complexity and Length: Mean constraints/query: PF 0.60 ± 0.05, SQA 0.82 ± 0.04, S2 (traditional search) 0.15 ± 0.02. Named entities/query: PF 4.00 ± 0.20, SQA 5.14 ± 0.10, S2 2.25 ± 0.05. Relations/query: PF 2.17 ± 0.08, SQA 2.68 ± 0.02, S2 1.20 ± 0.04. Token length: PF 17.04 ± 2.51, SQA 36.96 ± 6.82, S2 5.35 ± 0.18; distributions are heavy-tailed (see Figure 1) (Haddad et al., 26 Feb 2026).
  • Session Duration: Median PF session lasts 4 minutes; SQA, 8 minutes. Median queries per session: 1–2. Median sessions per user: 2.
  • Engagement Depth:
    • Click-through rate (CTR) defined as the proportion of reports with at least one Semantic Scholar link click.
    • Citation engagement score Cengage=# evidence clicks# total generated citationsC_{engage} = \frac{\text{\# evidence clicks}}{\text{\# total generated citations}}.
    • Churn rate: fraction of users issuing no further queries after a report.
    • Return rate: fraction of users returning after initial interaction.
  • Experience Trends: Broad Topic queries decrease from 61.2% (single-query users) to 53.5% (experienced). Citation Evidence Finding rises from 6.3% to 9.7%. In SQA, section-expansion and evidence-click CTR increase by ≈27% from first to fourth query. PF link-click rate declines by ≈24% as users adapt to consuming PF summaries directly.

5. User Behavior and Interaction Dynamics

AID reveals that researchers employ the underlying LLM systems as collaborative research partners rather than passive information sources:

  • Delegation of Complex Tasks: Users frequently assign high-level tasks to the system, including full academic document drafting (6.2% of SQA queries), research gap identification (5.2%), and methodological guidance (9.1%).
  • Non-Linear Navigation: In SQA, user interaction with structured reports is markedly non-linear—43% of users skip the introduction section upon first expansion, with most preferring to begin with subsequent sections. Over 50% of sessions exhibit non-sequential navigation including backward jumps and revisiting earlier sections (documented in accompanying Sankey diagrams and heatmaps).
  • Persistence and Recurrent Use: 42% of PF users and 50.5% of SQA users revisit prior reports, with a median lag of 4–6 hours. These behaviors indicate that generated outputs are perceived as persistent reference artifacts. Near-duplicate queries are often submitted within 16 minutes, accounting for 18.8% (SQA) and 14.8% (PF) of sessions.

A plausible implication is that advanced LLM integrations shift researchers' interaction models closer to iterative, partner-mediated collaborative workflows.

6. Public Access and Licensing

AID is available under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. The dataset may be freely used for non-commercial research and evaluation, with restrictions prohibiting user re-identification or linkage to external PII. Full resources are accessible via the Allen Institute for AI at https://github.com/allenai/asta-interaction-dataset (Haddad et al., 26 Feb 2026).

7. Illustrative Examples and Analytical Figures

Representative queries exemplify both interfaces’ functionality:

  • PF Example: "Comparative analysis of HBr vs Cl₂ plasma gases for reactive ion etching of polysilicon" returns five ranked papers with single-sentence LLM-generated summaries.
  • SQA Example: "Write a Materials and Methods section suitable for a plant science journal" prompts a multi-section report comprising Introduction, Sample Preparation, Instrumentation, Procedure, and Data Analysis—with TL;DRs and inline citations.

Key figures include:

  • Bar chart (Table 4) of query complexity and length across PF, SQA, and S2.
  • Heavy-tailed query length distributions (Figure 1).
  • Distributions of intent and phrasing style (Figure 2), demonstrating the prevalence of keyword-style and Broad Topic Exploration queries.
  • Sankey diagrams and heatmaps visualizing user navigation patterns within SQA reports.

Together, these features establish AID as a foundational resource for empirical analysis of real-world interactions in LLM-augmented research environments, informing the development of prompt evaluation protocols, intent-sensitive AI tool design, and longitudinal studies of researcher-machine collaboration (Haddad et al., 26 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Asta Interaction Dataset (AID).