PromptTopics in LLM Analytics
- PromptTopic is a cluster of thematically coherent user prompts extracted using transformer-based topic modeling methods.
- It employs robust preprocessing, embedding with sentence transformers, UMAP for reduction, and HDBSCAN clustering to assign soft topic probabilities.
- Applications include analyzing user preferences, diagnosing model performance, and supporting domain-specific fine-tuning with empirical validation.
PromptTopics are thematically coherent clusters automatically identified in large-scale conversational data or text corpora using transformer-based topic modeling methods. In the context of LLM usage analytics, PromptTopics are derived via advanced embedding, clustering, and labeling pipelines—which, for example, associate each user prompt or conversation with a probability distribution over latent topics. This enables granular analysis of content patterns and their downstream connections to system performance, user preferences, or domain adaptation strategies (Bhandarkar et al., 8 Oct 2025).
1. Preprocessing Workflow for PromptTopic Discovery
The initial step in PromptTopic extraction is robust text preprocessing. For multilingual conversational corpora such as LMSYS-Chat-1M, this pipeline comprises several stages:
- Data ingestion: Each record contains a user prompt, two LLM responses, and a human preference label. Individual records are treated as documents; entire conversations (prompt + both responses) are usually concatenated as the analysis unit.
- Language filtering: FastText’s language-ID classifier retains only documents in the target language (e.g., English) with sufficient confidence.
- Text cleaning and normalization: URLs, emojis, non-ASCII characters, redactions (e.g., "[…redacted…]"), repeated punctuation, and short non-informative tokens are stripped or collapsed. Boilerplate instructions ("Write me a poem...") can optionally be removed; empirical tests showed this does not impact topic coherence.
- Final assembly: Yields a cleaned corpus (e.g., ~210,000 English conversational prompts for BERTopic analysis) (Bhandarkar et al., 8 Oct 2025).
2. BERTopic Pipeline: Embedding, Clustering, and Topic Extraction
PromptTopic analysis leverages a multi-stage pipeline based on BERTopic, which is distinguished by:
a. Embedding
Each document is embedded using a sentence transformer (e.g., all-MiniLM-L6-v2), producing a fixed-dimensional vector () for document .
b. Dimensionality Reduction
UMAP projects the embeddings from high- to low-dimensional space ( or ), minimizing the cross-entropy between high- and low-dimension fuzzy simplicial graphs.
Let be the local fuzzy similarity between documents and :
where is Euclidean distance, is distance to the nearest neighbor, and controls spread.
c. Clustering
HDBSCAN constructs a density-based hierarchy in reduced space, extracting clusters ("topics") by maximizing cluster stability. The mutual reachability distance is
where is the reduced-space distance and is the distance to its th nearest neighbor. Clusters are labeled as topics; outliers or singleton points are marked as noise.
d. Topic Labeling
For each cluster, all member documents are concatenated into a "pseudo-document," and the c-TF-IDF statistic is computed:
$\mathrm{TF}_c(t) = \frac{f_{t,c}}{\sum_{u\in V}f_{u,c}}, \qquad \mathrm{IDF}_{\mathrm{c\mbox{-}TFIDF}}(t) = \log\left(1 + \frac{A}{f_t}\right)$
where is average term count per cluster, is count of in , and is the total count over all clusters. The top- terms by $s_c(t) = \mathrm{TF}_c(t)\times \mathrm{IDF}_{\mathrm{c\mbox{-}TFIDF}}(t)$ yield human-readable topic labels (Bhandarkar et al., 8 Oct 2025).
3. Topic Representation and Probability Assignment
Although HDBSCAN yields hard assignments, BERTopic enables soft topic probability distributions for each prompt. For each cluster (topic), centroid is computed in low-dimensional embedding space, and the document-to-topic association is
where is a softmax temperature. This forms a complete topic-probability vector per prompt, capturing thematic uncertainty and overlap.
4. Evaluating and Visualizing PromptTopics
Topic interpretability and separability are evaluated through several quantitative and qualitative tools:
- Inter-topic distance maps: Pairwise cosine distances between topic-word vectors , visualized with classical MDS or UMAP, reveal semantic relationships among topics.
- Model-versus-topic preference matrices: For benchmark datasets with system preference labels (e.g., model A vs. model B chosen by a user for prompt ), one tallies per-topic win rates:
where is the count of wins by model in topic , normalized per-topic () for heatmap visualization (Bhandarkar et al., 8 Oct 2025).
5. Applications and Interpretational Findings
Analysis on large conversational datasets uncovers 29+ high-coherence PromptTopics, including domains such as artificial intelligence, programming, social issues, cloud infrastructure, and advanced mathematics. Empirical findings show:
- User prompt volume is highly skewed (Pareto-like) across topics, e.g., 10 topics covering ~82% of all prompts.
- Model performance varies strongly and informatively by topic. For instance, GPT-4-1106-preview dominates technical areas, while GPT-4-0314 prevails on Social Issues & Ethics.
- PromptTopics support targeted system selection, fine-tuning, hybrid model routing, and diagnostic error analysis.
Visualization of topic embeddings confirms that, for instance, programming-related topics form tight clusters distinct from creative writing or conversational ones (Bhandarkar et al., 8 Oct 2025).
6. Methodological Considerations and Limitations
The approach is sensitive to:
- Embedding choice (transformer architecture, layer, pooling method)
- Clustering parameters and dimensionality reduction hyperparameters
- Preprocessing quality (especially for noisy or multilingual data)
- Topic redundancy, which is mitigated through iterated c-TF-IDF keyword extraction and, if necessary, further cluster merging.
A plausible implication is that future PromptTopic methodologies may integrate dynamic embedding adaptation (e.g., domain-specific fine-tuning) or end-to-end differentiable topic assignment for even finer control and cross-dataset comparability (Bhandarkar et al., 8 Oct 2025).
7. Practical Guidance for Reproduction
To replicate PromptTopic analysis:
- Preprocess and filter the target corpus according to language and noise removal protocols.
- Encode documents with a suitable sentence transformer.
- Apply UMAP for reduction and HDBSCAN for clustering.
- Compute c-TF-IDF for topic keyword extraction.
- Assign soft topic probabilities as needed, compute win-rate or model-alignment metrics for evaluation data, and visualize topic distributions.
This set of techniques enables systematic, scalable exploration of implicit content domains within LLM datasets, supporting both descriptive analytics and model selection strategies (Bhandarkar et al., 8 Oct 2025).