Papers
Topics
Authors
Recent
Search
2000 character limit reached

topicwizard Framework Overview

Updated 18 February 2026
  • topicwizard is a Python-based, model-agnostic framework designed for interactive visualization and interpretation of various topic models such as LDA, NMF, and CTM.
  • It standardizes model outputs into canonical topic–term (φ) and document–topic (Θ) matrices to support coordinated exploration of topics, words, documents, and user groups.
  • The framework employs reactive updates, caching, and UMAP-based dimensionality reduction to enable rapid, scalable, and comparative model analysis in interactive UIs.

topicwizard is a Python-based, model-agnostic framework designed for interactive visualization and interpretation of topic models. It supports both classical bag-of-words (BoW) models such as Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), Latent Semantic Indexing (LSI), and neural or contextual topic models including Contextualized Topic Model (CTM), BERTopic, Top2Vec, and KeyNMF. By standardizing diverse topic models into canonical topic–term (ϕ\phi) and document–topic (Θ\Theta) matrices, the framework enables linked, multi-perspective exploration of topics, words, documents, and user-defined groups. topicwizard emphasizes immediate, reactive updates across all views, enabling coherent navigation of semantic relationships in large textual corpora (Kardos et al., 19 May 2025).

1. Architecture and Model Compatibility

topicwizard is structured as a Python package with an optional web application/user interface layer. The architecture achieves model-agnosticism by accepting any topic model that provides a topic–term matrix ϕ\phi and a document–topic matrix Θ\Theta. These standard matrices can be extracted from models following scikit-learn conventions (attributes like components_ and a transform method), as well as via adapter modules for libraries such as Gensim, BERTopic, or Turftopic. If available, document embeddings (e.g., SBERT outputs from CTM) are used directly in lieu of Θ\Theta for document mapping.

Deployment options include seamless embedding in interactive Jupyter workflows, containerization via Docker images for cloud or local hosting (using Flask/FastAPI), and a Figures API for static export of consistent, publication-ready images.

The data flow consists of:

  1. User provision of corpus data and optional metadata.
  2. Data ingestion, vocabulary/index mapping, and ϕ\phi, Θ\Theta extraction.
  3. UMAP-based projection for topics, words, and documents.
  4. Visualization rendering in a multi-panel UI where selections in one view reactively update all others.

2. Core Components and Processing Modules

The framework is organized into several specialized modules:

  • Data Ingestion Layer: Reads raw text, associated metadata, and computes per-document statistics (e.g., word counts d|d|).
  • Model Interface: Extracts and normalizes ϕ\phi (n×mn \times m topic–term matrix) and Θ\Theta (D×nD \times n document–topic matrix) to ensure interpretability (rows of ϕ\phi sum to 1 for probabilistic models).
  • Pre-computation & Caching: Obtains and persistently caches UMAP embeddings for each entity type. Group–topic aggregation matrices GRg×nG \in \mathbb{R}^{g \times n} are computed as:

Gij=k=1DΘkjI(gk=i)G_{ij} = \sum_{k=1}^D \Theta_{kj} \cdot I(g_k = i)

where I()I(\cdot) indicates group assignment.

  • Visualization Engine: Built with Panel/Bokeh for high-level layout, D3.js for SVG interactivity, UMAP for dimensionality reduction, and word cloud libraries.
  • User Interaction Layer: Implements master-detail linkage; selecting topics, words, or documents updates all relevant panels, allowing for fluid and contextually linked exploration.

3. Mathematical Foundations and Matrix Transformations

topicwizard provides formal computations to support quantitative and visual analytics:

  • Topic Importance: Size in inter-topic maps reflects topic prevalence,

st=d=1DΘdtds_t = \sum_{d=1}^D \Theta_{dt} \cdot |d|

where d|d| is the document length.

  • Group-Topic Aggregation: For user-defined groups (e.g., publication years, authors), the GG matrix is constructed as above, facilitating group-level mapping and topic prevalence analysis.
  • Word Embeddings: Each vocabulary word ww is embedded into RN\mathbb{R}^N via ϕ:,w\phi_{:,w}, then mapped to two dimensions with UMAP to create word maps.
  • Word–Topic Distribution Plot: For a given word w0w_0, the topic distribution is quantified as

Θˉt=d:w0dΘdt\bar{\Theta}_{t} = \sum_{d:w_0 \in d} \Theta_{dt}

and visualized as a bar chart, normalized to reflect topic association for in-corpus occurrences.

4. Interactive Visualization and Analytical Perspectives

topicwizard supports several coordinated visualizations, organized by analytical entity:

  • Topics: The inter-topic map displays topics projected into 2D via UMAP, with circle sizes reflecting sts_t. Interactions allow for topic selection, relabeling, and the display of topic–word bar charts (showing both ϕtw\phi_{tw} and global word frequency backgrounds), as well as topic-specific word clouds.
  • Words: The word map provides cosine-based word neighborhood highlighting, and for a selected word, a word–topic distribution plot showing topic association for in-context usage.
  • Documents: The document map projects Θd,\Theta_{d,\cdot} (or embeddings) to 2D points, colored by argmaxtΘdt\arg\max_t \Theta_{dt}. It includes metadata-based filtering (e.g., by date), document–topic bar charts, and a timeline interface for dynamic topic prevalence within long documents. Inline snippet viewers highlight topic-informative words.
  • Groups: UMAP of group vectors Gi,G_{i,\cdot} provides a map for user-selected groupings. Group–topic bar charts and group word clouds indicate salient semantic content and topic distribution by group.

5. Interface, Usability, and Supported Analytical Tasks

topicwizard is designed for rapid cognitive switching between analytical views, facilitated by a UI that synchronizes linked selections and filters across all panels. Supported tasks include:

  • Assessing Topic Coherence: Switching between word lists, bar charts, and word clouds per topic to investigate semantic consistency.
  • Comparing Models: Loading multiple topic models in parallel with synchronized visualization, directly supporting comparative cluster shape/coherence analysis (e.g., LDA versus KeyNMF or CTM).
  • Metadata-Based Filtering: UI elements such as date sliders and category checkboxes trigger immediate recalculation and re-rendering of the topic and document maps, enabling local, temporally, or categorically focused analysis.
  • Drill-down Exploration: Users can select document clusters, examine rankings, and launch topic-colored document viewers. Rapid topic renaming and label export streamline iterative model refinement and reporting workflows.

6. Performance, Scalability, and Deployment Strategies

To meet scalability requirements for large vocabularies and corpora, topicwizard employs:

  • Caching and Incremental UMAP: Embedding computations are cached and only recomputed when necessitated by model or data changes.
  • Asynchronous Updates: Heavy computational tasks, such as large-scale filtering, execute asynchronously within a thread pool, ensuring UI responsiveness.
  • Efficient Data Structures: For BoW topic models, the ϕ\phi and Θ\Theta matrices are stored in compressed sparse row (CSR) format, enabling rapid slicing and dot-product operations for dynamic filtering.
  • Cloud and Containerization: The web-application can be containerized with a single command, suitable for resource-constrained or cloud-based deployment (e.g., via Kubernetes, AWS ECS).
  • Publication-Grade Output: The Figures API allows direct, code-path-consistent export of visualizations as PNG or SVG images for integration into publications or reports.

7. Empirical Demonstrations and Adoption

Several case studies demonstrate topicwizard's capacity for domain-specific analyses:

  • KeyNMF on Chinese Diaspora Media: Application of a transformer-based contextual topic model (KeyNMF) to Chinese diaspora news archives enabled multi-period analysis of pro- and anti-PRC narrative evolution prior to the 2024 European elections.
  • Model Comparison: Using both LDA and CTM models, topicwizard supports direct, visual investigation of topic coherence and prevalence, enabling robust model selection and validation.
  • Short Text Modeling with tweetopic: In modeling tweet corpora, topic-aligned clusters in the document map were found to correspond to well-known hashtags and event markers.
  • Adoption Metrics: As of publication, topicwizard surpassed 45,000 PyPI downloads and has seen adoption across digital humanities and enterprise BI labs, supporting its practical utility in academic and business contexts (Kardos et al., 19 May 2025).

These cumulative features provide a model-agnostic, multi-faceted platform for topic model exploration, grounding interpretive visualizations directly in the statistical and textual substrates of the input corpora.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to topicwizard Framework.