topicwizard Framework Overview
- topicwizard is a Python-based, model-agnostic framework designed for interactive visualization and interpretation of various topic models such as LDA, NMF, and CTM.
- It standardizes model outputs into canonical topic–term (φ) and document–topic (Θ) matrices to support coordinated exploration of topics, words, documents, and user groups.
- The framework employs reactive updates, caching, and UMAP-based dimensionality reduction to enable rapid, scalable, and comparative model analysis in interactive UIs.
topicwizard is a Python-based, model-agnostic framework designed for interactive visualization and interpretation of topic models. It supports both classical bag-of-words (BoW) models such as Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), Latent Semantic Indexing (LSI), and neural or contextual topic models including Contextualized Topic Model (CTM), BERTopic, Top2Vec, and KeyNMF. By standardizing diverse topic models into canonical topic–term () and document–topic () matrices, the framework enables linked, multi-perspective exploration of topics, words, documents, and user-defined groups. topicwizard emphasizes immediate, reactive updates across all views, enabling coherent navigation of semantic relationships in large textual corpora (Kardos et al., 19 May 2025).
1. Architecture and Model Compatibility
topicwizard is structured as a Python package with an optional web application/user interface layer. The architecture achieves model-agnosticism by accepting any topic model that provides a topic–term matrix and a document–topic matrix . These standard matrices can be extracted from models following scikit-learn conventions (attributes like components_ and a transform method), as well as via adapter modules for libraries such as Gensim, BERTopic, or Turftopic. If available, document embeddings (e.g., SBERT outputs from CTM) are used directly in lieu of for document mapping.
Deployment options include seamless embedding in interactive Jupyter workflows, containerization via Docker images for cloud or local hosting (using Flask/FastAPI), and a Figures API for static export of consistent, publication-ready images.
The data flow consists of:
- User provision of corpus data and optional metadata.
- Data ingestion, vocabulary/index mapping, and , extraction.
- UMAP-based projection for topics, words, and documents.
- Visualization rendering in a multi-panel UI where selections in one view reactively update all others.
2. Core Components and Processing Modules
The framework is organized into several specialized modules:
- Data Ingestion Layer: Reads raw text, associated metadata, and computes per-document statistics (e.g., word counts ).
- Model Interface: Extracts and normalizes ( topic–term matrix) and ( document–topic matrix) to ensure interpretability (rows of sum to 1 for probabilistic models).
- Pre-computation & Caching: Obtains and persistently caches UMAP embeddings for each entity type. Group–topic aggregation matrices are computed as:
where indicates group assignment.
- Visualization Engine: Built with Panel/Bokeh for high-level layout, D3.js for SVG interactivity, UMAP for dimensionality reduction, and word cloud libraries.
- User Interaction Layer: Implements master-detail linkage; selecting topics, words, or documents updates all relevant panels, allowing for fluid and contextually linked exploration.
3. Mathematical Foundations and Matrix Transformations
topicwizard provides formal computations to support quantitative and visual analytics:
- Topic Importance: Size in inter-topic maps reflects topic prevalence,
where is the document length.
- Group-Topic Aggregation: For user-defined groups (e.g., publication years, authors), the matrix is constructed as above, facilitating group-level mapping and topic prevalence analysis.
- Word Embeddings: Each vocabulary word is embedded into via , then mapped to two dimensions with UMAP to create word maps.
- Word–Topic Distribution Plot: For a given word , the topic distribution is quantified as
and visualized as a bar chart, normalized to reflect topic association for in-corpus occurrences.
4. Interactive Visualization and Analytical Perspectives
topicwizard supports several coordinated visualizations, organized by analytical entity:
- Topics: The inter-topic map displays topics projected into 2D via UMAP, with circle sizes reflecting . Interactions allow for topic selection, relabeling, and the display of topic–word bar charts (showing both and global word frequency backgrounds), as well as topic-specific word clouds.
- Words: The word map provides cosine-based word neighborhood highlighting, and for a selected word, a word–topic distribution plot showing topic association for in-context usage.
- Documents: The document map projects (or embeddings) to 2D points, colored by . It includes metadata-based filtering (e.g., by date), document–topic bar charts, and a timeline interface for dynamic topic prevalence within long documents. Inline snippet viewers highlight topic-informative words.
- Groups: UMAP of group vectors provides a map for user-selected groupings. Group–topic bar charts and group word clouds indicate salient semantic content and topic distribution by group.
5. Interface, Usability, and Supported Analytical Tasks
topicwizard is designed for rapid cognitive switching between analytical views, facilitated by a UI that synchronizes linked selections and filters across all panels. Supported tasks include:
- Assessing Topic Coherence: Switching between word lists, bar charts, and word clouds per topic to investigate semantic consistency.
- Comparing Models: Loading multiple topic models in parallel with synchronized visualization, directly supporting comparative cluster shape/coherence analysis (e.g., LDA versus KeyNMF or CTM).
- Metadata-Based Filtering: UI elements such as date sliders and category checkboxes trigger immediate recalculation and re-rendering of the topic and document maps, enabling local, temporally, or categorically focused analysis.
- Drill-down Exploration: Users can select document clusters, examine rankings, and launch topic-colored document viewers. Rapid topic renaming and label export streamline iterative model refinement and reporting workflows.
6. Performance, Scalability, and Deployment Strategies
To meet scalability requirements for large vocabularies and corpora, topicwizard employs:
- Caching and Incremental UMAP: Embedding computations are cached and only recomputed when necessitated by model or data changes.
- Asynchronous Updates: Heavy computational tasks, such as large-scale filtering, execute asynchronously within a thread pool, ensuring UI responsiveness.
- Efficient Data Structures: For BoW topic models, the and matrices are stored in compressed sparse row (CSR) format, enabling rapid slicing and dot-product operations for dynamic filtering.
- Cloud and Containerization: The web-application can be containerized with a single command, suitable for resource-constrained or cloud-based deployment (e.g., via Kubernetes, AWS ECS).
- Publication-Grade Output: The Figures API allows direct, code-path-consistent export of visualizations as PNG or SVG images for integration into publications or reports.
7. Empirical Demonstrations and Adoption
Several case studies demonstrate topicwizard's capacity for domain-specific analyses:
- KeyNMF on Chinese Diaspora Media: Application of a transformer-based contextual topic model (KeyNMF) to Chinese diaspora news archives enabled multi-period analysis of pro- and anti-PRC narrative evolution prior to the 2024 European elections.
- Model Comparison: Using both LDA and CTM models, topicwizard supports direct, visual investigation of topic coherence and prevalence, enabling robust model selection and validation.
- Short Text Modeling with tweetopic: In modeling tweet corpora, topic-aligned clusters in the document map were found to correspond to well-known hashtags and event markers.
- Adoption Metrics: As of publication, topicwizard surpassed 45,000 PyPI downloads and has seen adoption across digital humanities and enterprise BI labs, supporting its practical utility in academic and business contexts (Kardos et al., 19 May 2025).
These cumulative features provide a model-agnostic, multi-faceted platform for topic model exploration, grounding interpretive visualizations directly in the statistical and textual substrates of the input corpora.