PromptTopics in LLM Analytics

Updated 15 December 2025

PromptTopic is a cluster of thematically coherent user prompts extracted using transformer-based topic modeling methods.
It employs robust preprocessing, embedding with sentence transformers, UMAP for reduction, and HDBSCAN clustering to assign soft topic probabilities.
Applications include analyzing user preferences, diagnosing model performance, and supporting domain-specific fine-tuning with empirical validation.

PromptTopics are thematically coherent clusters automatically identified in large-scale conversational data or text corpora using transformer-based topic modeling methods. In the context of LLM usage analytics, PromptTopics are derived via advanced embedding, clustering, and labeling pipelines—which, for example, associate each user prompt or conversation with a probability distribution over latent topics. This enables granular analysis of content patterns and their downstream connections to system performance, user preferences, or domain adaptation strategies (Bhandarkar et al., 8 Oct 2025).

1. Preprocessing Workflow for PromptTopic Discovery

The initial step in PromptTopic extraction is robust text preprocessing. For multilingual conversational corpora such as LMSYS-Chat-1M, this pipeline comprises several stages:

Data ingestion: Each record contains a user prompt, two LLM responses, and a human preference label. Individual records are treated as documents; entire conversations (prompt + both responses) are usually concatenated as the analysis unit.
Language filtering: FastText’s language-ID classifier retains only documents in the target language (e.g., English) with sufficient confidence.
Text cleaning and normalization: URLs, emojis, non-ASCII characters, redactions (e.g., "[…redacted…]"), repeated punctuation, and short non-informative tokens are stripped or collapsed. Boilerplate instructions ("Write me a poem...") can optionally be removed; empirical tests showed this does not impact topic coherence.
Final assembly: Yields a cleaned corpus (e.g., ~210,000 English conversational prompts for BERTopic analysis) (Bhandarkar et al., 8 Oct 2025).

2. BERTopic Pipeline: Embedding, Clustering, and Topic Extraction

PromptTopic analysis leverages a multi-stage pipeline based on BERTopic, which is distinguished by:

a. Embedding

Each document is embedded using a sentence transformer (e.g., all-MiniLM-L6-v2), producing a fixed-dimensional vector $e_i \in \mathbb{R}^D$ ( $D=384$ ) for document $d_i$ .

b. Dimensionality Reduction

UMAP projects the embeddings $e_i$ from high- to low-dimensional space ( $d=2$ or $d=5$ ), minimizing the cross-entropy between high- and low-dimension fuzzy simplicial graphs.

Let $w_{ij}$ be the local fuzzy similarity between documents $i$ and $j$ :

$w_{ij} = \exp\left(-\frac{\max\{0, d(e_i, e_j) - \rho_i\}}{\sigma_i}\right)$

where $d(\cdot, \cdot)$ is Euclidean distance, $\rho_i$ is distance to the nearest neighbor, and $\sigma_i$ controls spread.

c. Clustering

HDBSCAN constructs a density-based hierarchy in reduced space, extracting clusters ("topics") by maximizing cluster stability. The mutual reachability distance is

$d_{\mathrm{mreach}}(x, y) = \max\left\{ d'(x, y), \mathit{core}_k(x), \mathit{core}_k(y) \right\}$

where $d'(x, y)$ is the reduced-space distance and $\mathit{core}_k(\cdot)$ is the distance to its $k$ th nearest neighbor. Clusters are labeled as topics; outliers or singleton points are marked as noise.

d. Topic Labeling

For each cluster, all member documents are concatenated into a "pseudo-document," and the c-TF-IDF statistic is computed:

$\mathrm{TF}_c(t) = \frac{f_{t,c}}{\sum_{u\in V}f_{u,c}}, \qquad \mathrm{IDF}_{\mathrm{c\mbox{-}TFIDF}}(t) = \log\left(1 + \frac{A}{f_t}\right)$

where $A$ is average term count per cluster, $f_{t,c}$ is count of $t$ in $c$ , and $f_t$ is the total count over all clusters. The top- $m$ terms by $s_c(t) = \mathrm{TF}_c(t)\times \mathrm{IDF}_{\mathrm{c\mbox{-}TFIDF}}(t)$ yield human-readable topic labels (Bhandarkar et al., 8 Oct 2025).

3. Topic Representation and Probability Assignment

Although HDBSCAN yields hard assignments, BERTopic enables soft topic probability distributions for each prompt. For each cluster $c$ (topic), centroid $\mu_c$ is computed in low-dimensional embedding space, and the document-to-topic association is

$\mathrm{sim}(e'_i, \mu_c) = \frac{e'_i \cdot \mu_c}{\|e'_i\|\|\mu_c\|},\qquad p_{i,c} = \frac{\exp(\mathrm{sim}(e'_i, \mu_c)/\tau)}{\sum_{c'}\exp(\mathrm{sim}(e'_i, \mu_{c'})/\tau)}$

where $\tau$ is a softmax temperature. This forms a complete topic-probability vector per prompt, capturing thematic uncertainty and overlap.

4. Evaluating and Visualizing PromptTopics

Topic interpretability and separability are evaluated through several quantitative and qualitative tools:

Inter-topic distance maps: Pairwise cosine distances between topic-word vectors $D_{c,c'} = 1 - s_c^\top s_{c'}/(\|s_c\|\|s_{c'}\|)$ , visualized with classical MDS or UMAP, reveal semantic relationships among topics.
Model-versus-topic preference matrices: For benchmark datasets with system preference labels (e.g., model A vs. model B chosen by a user for prompt $i$ ), one tallies per-topic win rates:

$r_{m,c} = \frac{W(m, c)}{A(m, c)}, \qquad \tilde r_{m, c} = \frac{W(m,c)}{\sum_{m'} W(m',c)}$

where $W(m,c)$ is the count of wins by model $m$ in topic $c$ , normalized per-topic ( $\tilde r_{m,c}$ ) for heatmap visualization (Bhandarkar et al., 8 Oct 2025).

5. Applications and Interpretational Findings

Analysis on large conversational datasets uncovers 29+ high-coherence PromptTopics, including domains such as artificial intelligence, programming, social issues, cloud infrastructure, and advanced mathematics. Empirical findings show:

User prompt volume is highly skewed (Pareto-like) across topics, e.g., 10 topics covering ~82% of all prompts.
Model performance varies strongly and informatively by topic. For instance, GPT-4-1106-preview dominates technical areas, while GPT-4-0314 prevails on Social Issues & Ethics.
PromptTopics support targeted system selection, fine-tuning, hybrid model routing, and diagnostic error analysis.

Visualization of topic embeddings confirms that, for instance, programming-related topics form tight clusters distinct from creative writing or conversational ones (Bhandarkar et al., 8 Oct 2025).

6. Methodological Considerations and Limitations

The approach is sensitive to:

Embedding choice (transformer architecture, layer, pooling method)
Clustering parameters and dimensionality reduction hyperparameters
Preprocessing quality (especially for noisy or multilingual data)
Topic redundancy, which is mitigated through iterated c-TF-IDF keyword extraction and, if necessary, further cluster merging.

A plausible implication is that future PromptTopic methodologies may integrate dynamic embedding adaptation (e.g., domain-specific fine-tuning) or end-to-end differentiable topic assignment for even finer control and cross-dataset comparability (Bhandarkar et al., 8 Oct 2025).

7. Practical Guidance for Reproduction

To replicate PromptTopic analysis:

Preprocess and filter the target corpus according to language and noise removal protocols.
Encode documents with a suitable sentence transformer.
Apply UMAP for reduction and HDBSCAN for clustering.
Compute c-TF-IDF for topic keyword extraction.
Assign soft topic probabilities as needed, compute win-rate or model-alignment metrics for evaluation data, and visualize topic distributions.

This set of techniques enables systematic, scalable exploration of implicit content domains within LLM datasets, supporting both descriptive analytics and model selection strategies (Bhandarkar et al., 8 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Investigating Thematic Patterns and User Preferences in LLM Interactions using BERTopic (2025)

PromptTopics in LLM Analytics

1. Preprocessing Workflow for PromptTopic Discovery

2. BERTopic Pipeline: Embedding, Clustering, and Topic Extraction

a. Embedding

b. Dimensionality Reduction

c. Clustering

d. Topic Labeling

3. Topic Representation and Probability Assignment

4. Evaluating and Visualizing PromptTopics

5. Applications and Interpretational Findings

6. Methodological Considerations and Limitations

7. Practical Guidance for Reproduction

Whiteboard

Follow Topic

Continue Learning

PromptTopics in LLM Analytics

1. Preprocessing Workflow for PromptTopic Discovery

2. BERTopic Pipeline: Embedding, Clustering, and Topic Extraction

a. Embedding

b. Dimensionality Reduction

c. Clustering

d. Topic Labeling

3. Topic Representation and Probability Assignment

4. Evaluating and Visualizing PromptTopics

5. Applications and Interpretational Findings

6. Methodological Considerations and Limitations

7. Practical Guidance for Reproduction

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics