EDA-Aware Assistant: In-Notebook Data Analysis

Updated 27 January 2026

EDA-aware assistant is an intelligent system that integrates in-situ code search, API recommendations, and visualizations within computational notebooks.
It employs GraphCodeBERT-based retrieval and recommendation models to provide context-aware insights, significantly boosting analysis efficiency.
The system features a multi-panel UI and has been validated through rigorous offline metrics and user studies to enhance practical data exploration.

An EDA-aware assistant is an intelligent system designed to assist users in exploratory data analysis (EDA) tasks by providing in-situ code search, API recommendations, and interactive visualizations within computational notebook environments such as JupyterLab. The core motivation is to enable both novices and experienced data scientists to efficiently obtain, understand, and apply best practices in data analysis by leveraging both code retrieval and context-aware suggestion capabilities (Li et al., 2021).

1. System Architecture

The EDAssistant architecture comprises three primary layers:

Notebook Integration Layer: A front-end JupyterLab extension that:
- Monitors the user's notebook, extracting real-time code context from active cells.
- Provides user interface components:
- Search Results (via a "DNA Plot")
- Notebook Detail View (foldable code/cell context)
- API Suggestion View (tag-style recommender)
- Enables in-situ search and recommendation flows via buttons or triggers.
Analytics Engine (Backend):
- Search Engine: Employs a fine-tuned GraphCodeBERT encoder to generate vector embeddings of EDA code sequences for embedding-based retrieval.
- API Recommender: Uses the same encoder; outputs a distribution over an API vocabulary (≈ 19,453 tokens) using a feed-forward layer with softmax for next-API prediction.
Data Preprocessing & Storage:
- Program Slicer: Segments notebooks into distinct EDA sequences based on program slicing of output-producing cells.
- Program Analyzer:
- Tags code blocks into EDA operation types (preparation, modeling, evaluation, visualization) via Guided LDA.
- Extracts ordered API call lists per sequence.
- Computes TF-IDF keywords for each sequence.
- Embedding Store: Caches all 768-dimensional embeddings for efficient lookup.

Conceptual Dataflow:

[JupyterLab UI] ←→ [Analytics Engine] ←→ [Embedding/Metadata Store] ←→ [Raw Notebooks → Slicer → Analyzer]

This architecture tightly integrates code context extraction, ML-powered retrieval/recommendation, and interactive visualization within the notebook UI, minimizing context switching and aligning external knowledge sources with immediate analytical needs (Li et al., 2021).

2. Machine Learning Models and Algorithms

The assistant's principal machine learning modules both employ GraphCodeBERT:

EDA Sequence Search:
- Each code sequence is encoded with GraphCodeBERT, using mean pooling, yielding an embedding $f(\cdot)$ .
- Retrieval relies on cosine similarity:
$S(q, c) = \frac{f(q) \cdot f(c)}{\|f(q)\|\;\|f(c)\|}$ - Training employs a margin-based ranking loss over (query, positive, negative) triplets:

$L_{\mathrm{rank}} = \sum_{(q, p^+, n^-)} \max (0, \alpha - S(q, p^+) + S(q, n^-))$

with $\alpha$ typically set to 0.2.
API Recommendation:
- On top of $f(q)$ , a dense layer followed by softmax outputs an API distribution:
$z = W f(q) + b, \quad p_i = \frac{\exp(z_i)}{\sum_j \exp(z_j)}$ - Ground-truth supervision is derived from the next API call in each sequence. - Standard cross-entropy loss is used:

$L_{\mathrm{cls}} = -\sum_i y_i \log p_i$ - At inference, APIs with $p_i$ above a threshold (e.g., 0.5) or the top- $k$ are recommended.
Baselines and Metrics:
- Baseline: Doc2Vec embedding for EDA sequence retrieval.
- Search task: Recall@ $k$ ; API recommendation: Intersection/Union (IoU) and accuracy.
- Reported: GraphCodeBERT recall@20 ≈ 0.35 (Doc2Vec ≈ 0.07), API IoU ≈ 0.40 (with dense layer) (Li et al., 2021).

3. Training Corpus and Data Processing

Data Source: 38,581 Kaggle notebooks from 281 competitions (MetaKaggle; tags: featured, research, recruitment, playground).
- 856,941 code cells, 303,041 markdown cells.
- Slicing yields 236,501 executable EDA sequences.
API Vocabulary: Extracted from pandas, numpy, scipy, scikit-learn, matplotlib, seaborn, Python built-ins; 19,453 unique API calls.
Block Labeling: Guided LDA classifies code blocks into four EDA macro stages:
1. Configuration & data preparation
2. Model exploration & development
3. Hypothesis verification & evaluation
4. Visualization & output examination
Statistics: Median notebook contains 22 code cells; 75% have ≤ 39 cells.

This corpus is program-sliced and semantically labeled, with per-sequence embeddings and metadata (API lists, TF-IDF keywords) computed offline for immediate lookup and ranking during assistant interaction (Li et al., 2021).

4. Visualization and User Interaction Design

EDAssistant introduces coordinated, multi-panel UI components for interactive in-notebook guidance:

DNA Plot (Search Results View):
- Each retrieved EDA sequence is rendered as a horizontal "chromosome".
- Colored blocks indicate semantic code blocks (EDA operation type); white gaps represent folded code.
- Sidebar displays top TF-IDF API keywords.
- Hover reveals code snippet and cell index.
Notebook Detail View:
- Full code context of a selected sequence, with toggles to fold/unfold neighboring cells and markers for cell-origin.
- Links from DNA Plot permit synchronized exploration.
API Suggestion View:
- Displays a tag cloud of recommended APIs, opacity proportional to recommendation probability.
- Click-through to API documentation is supported.
- Manual invocation via "Suggest Methods" triggers on-demand recommendation.

This visualization paradigm supports efficient scanning, contextualization, and follow-up—targeting rapid comprehension and actionable next steps in multi-step, multi-block EDA tasks (Li et al., 2021).

5. Evaluation Methodology

Evaluation integrates both offline ML metrics and controlled human user studies:

Offline ML Evaluation:
- Search Retrieval: Recall@ $k$ on a held-out set of EDA sequences.
- API Recommendation: Accuracy ( $|\text{pred} \cap \text{true}| / |\text{true}|$ ) and IoU ( $|\text{pred} \cap \text{true}| / |\text{pred} \cup \text{true}|$ ).
User Study:
- Within-subjects, $n=14$ computer science participants (novice/entry-level data scientists).
- Datasets: loan-default, student-scores (counter-balanced).
- Tools: Google Search vs. EDAssistant (counter-balanced).
- Tasks: constrained plotting and open-ended modeling/exploration.
- Metrics: time to solution, number of code lines/cells/charts, search count, ML model completion rate.
- Post-experiment: 7-point Likert-scale satisfaction/usefulness, semi-structured interviews.

Key results:

EDAssistant did not significantly slow task completion (Google: 9.8 min vs EDAssistant: 12.9 min).
More participants built predictive models with EDAssistant (10/14) vs. Google (6/14).
EDAssistant was preferred for in-context, code-only relevance; Google yielded broader resource diversity.
Satisfaction/usefulness ratings: Google (7.0/6.0), EDAssistant (6.0/6.0) (Li et al., 2021).

6. Design Implications and Recommendations

Major findings and design recommendations for future EDA-aware assistants include:

In-situ, Context-Aware Search: Reduces context-switch friction, aids rapid acquisition of runnable code for novices.
Passive and Active Discovery: Passive (auto-updated based on code context) and active (keyword-driven) search should co-exist, providing both immediacy and control.
Block-Structured Visualization: Semantically segmented visual metaphors such as the DNA Plot promote understanding of multi-cell analytical pipelines.
API Recommendation with User Control: While API suggestions accelerate iterative analysis, users desire granular filterability and explanation for recommendations.
Diversity vs. Consistency: Curated, code-only retrieval can foster reliability, but may reinforce canonical, non-novel routines unless balanced by exposure to diverse, innovative workflows.
Encouraging Exploration: To avoid ossifying best practices, assistants should promote cross-dataset transfer, pipeline variation, and surface explanations ("RandomForestClassifier is suggested because 75% of similar cases used it next").

Specific future design directions endorsed include hybrid keyword/context search, automated code summarization, multi-modal context integration (forums, blog posts), surface-level dataset comparison, and recommendation transparency (Li et al., 2021).

In summary, the EDA-aware assistant concept, as instantiated in EDAssistant, represents a tightly-coupled, ML-driven system for in-notebook code search, API recommendation, and visualization, underpinned by large, labeled, and semantically indexed real-world EDA notebook corpora and evaluated through both objective metrics and user-centered trials. The approach demonstrates improvement in analytic productivity, code discoverability, and user satisfaction, while highlighting important challenges in balancing prescriptive support with exploratory freedom (Li et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

EDAssistant: Supporting Exploratory Data Analysis in Computational Notebooks with In-Situ Code Search and Recommendation (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EDA-Aware Assistant.