LLM-MemCluster: LLM-Native Clustering

Updated 26 November 2025

LLM-MemCluster is a clustering framework that redefines text clustering as an LLM-native task with a stateful memory buffer and dynamic decision-making.
It employs a dual-prompt strategy to switch between exploratory label creation and strict consolidation, ensuring optimal cluster granularity.
Experimental evaluations reveal substantial improvements in clustering metrics (ACC, NMI, ARI) across benchmarks, outperforming traditional methods.

LLM-MemCluster is a clustering framework that redefines text clustering as an end-to-end LLM-native task, equipping LLMs with a stateful memory buffer and dynamic decision-making through prompt engineering to achieve interpretable, effective, and tuning-free clustering. It eliminates the need for separate embedding, external clustering modules, or post hoc refinement, leveraging in-context reasoning and persistent memory within the LLM prompt. The framework is built upon innovations in dynamic memory management and dual-mode prompting, enabling both fine-grained cluster discovery and automatic consolidation without manual intervention (Zhu et al., 19 Nov 2025).

1. Dynamic Memory Architecture

The core innovation of LLM-MemCluster is the introduction of a prompt-driven memory buffer, denoted $M^{(t-1)}$ , which holds all currently discovered cluster labels after processing the first $t-1$ instances. Each cluster label, $\ell_k$ , is a human-readable descriptor (e.g., "Sports", "Machine Learning"), and this memory is included in the LLM's prompt at each subsequent step. The clustering operates as follows:

Memory Read: For the current input $x_t$ , the LLM receives the prompt containing $M^{(t-1)}$ and outputs both an assigned label $\ell_t$ (either reusing an existing label or proposing a new one) and, optionally, a merge suggestion $s_t = (\mathcal{L}_{old}, \ell_{new})$ , which instructs the consolidation of older labels.
Memory Write: The label set is updated:
- Addition: If $\ell_t \notin M^{(t-1)}$ , then $M^{\prime (t)} \leftarrow M^{(t-1)} \cup \{\ell_t\}$ .
- Merge: If a merge is suggested, $M^{(t)} \leftarrow (M^{\prime (t)} \setminus \mathcal{L}_{old}) \cup \{\ell_{new}\}$ . All prior assignments are retroactively relabeled to guarantee global label consistency (see Eq. 6 in (Zhu et al., 19 Nov 2025)).

The clustering “memory” evolves incrementally, effectively encoding the cluster structure and permitting in-place consolidation and refinement. This dynamic memory enables statefulness in what would otherwise be a stateless LLM interaction loop.

2. Dual-Prompt Strategy for Cluster Discovery and Control

LLM-MemCluster regulates cluster granularity via a dual-prompt strategy, automatically switching between exploratory and consolidative modes. At each step, the system computes the mode:

Relaxed Mode: When $|M^{(t-1)}| < K_{\max}$ , the prompt encourages the LLM to create new labels for semantically novel inputs, supporting fine-grained cluster discovery.
Strict Mode: Once $|M^{(t-1)}| \geq K_{\max}$ (with $K_{\max}$ a user-supplied cap), the prompt constrains the LLM to reuse existing labels or suggest merges, prohibiting excessive cluster proliferation.

Both modes share a common prompt skeleton, differing only in system guidelines and user constraints. This dual-mode mechanism enforces an acceptable cluster count $K \in [K_{\min}, K_{\max}]$ and enables the LLM to balance semantic expressiveness against model/prompt limitations. The automatic switch ensures early exploration followed by late-phase consolidation.

Prompt Mode	Cluster Growth Policy	Trigger Condition
Relaxed	New labels allowed	$\|M^{(t-1)}\| < K_{\max}$
Strict	Reuse/merge enforced	$\|M^{(t-1)}\| \geq K_{\max}$

3. End-to-End Single-Pass Clustering Workflow

LLM-MemCluster executes clustering in a single forward pass with no requirements for external centroid handling or multi-iteration loops. The workflow is:

Initialize empty memory $M$ and assignment log $\mathcal{A}$ .
For each instance $x_i$ $x_{i}$ :
- Determine prompt mode based on $|M|$ and $K_{\max}$ .
- Call the LLM with $x_i$ , memory $M$ , and mode.
- Update $\mathcal{A}$ with $(x_i, \ell_i)$ .
- If $\ell_i$ is new, append to $M$ .
- If a merge suggestion $s_i=(\mathcal{L}_{old},\ell_{new})$ is returned, update $M$ , and retroactively relabel affected assignments in $\mathcal{A}$ .
After the last document, partition the dataset by unique labels in $\mathcal{A}$ .

All logic—assignment, cluster creation, merging—is encapsulated within prompt instructions to the LLM and management of the memory buffer and assignments in the host code. There are no numerical centroids, embedding updates, or external post-processing required.

4. Relation to Memory-Based LLM Clustering and Existing Methods

LLM-MemCluster stands in contrast to embedding-driven and summary-driven memory clustering frameworks such as k-LLMmeans (Diaz-Rodriguez, 12 Feb 2025). In k-LLMmeans, cluster centroids are periodically replaced with LLM-generated summaries that are re-embedded, steering subsequent assignment steps. k-LLMmeans retains the core k-means properties (iterative optimization, numeric centroid, cluster count $k$ fixed a priori) but enhances semantic interpretability and cluster stability with periodic semantic summaries serving as “memories.” The LLM usage in k-LLMmeans is independent of dataset size ( $O(kT/l)$ calls), and mini-batch variants support streaming data.

By contrast, LLM-MemCluster removes dependencies on external centroid computation and embedding space partitioning altogether. The cluster structure, growth, and memory are entirely managed within the LLM-prompt context, making the framework inherently end-to-end and stateful (Zhu et al., 19 Nov 2025).

5. Experimental Evaluation and Benchmarking

LLM-MemCluster was evaluated on six public text clustering benchmarks with cluster counts ranging from 18 to 102. Datasets include ArxivS2S, Massive-I, MTOP-I, Massive-D, FewNerd, and FewRel. Clustering performance was assessed by Accuracy (ACC), Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI).

When evaluated with GPT-4.1-mini and no task-specific tuning, LLM-MemCluster achieved the following:

Substantial gains over strong baselines (including K-means on TF-IDF or large pre-trained embeddings, DBSCAN, spectral clustering, BERTopic, and the prior ClusterLLM method).
On average: ACC +11.5%, NMI +5.3%, ARI +20.8% relative to the strongest baseline (ClusterLLM).
Ablations reveal the necessity of dynamic memory (No Memory $\rightarrow$ ARI ≈ 7.4% collapse), few-shot prompting, and the dual-mode strategy for optimal results.
Hyperparameter robustness: Varying the strict/relaxed switch by up to ±200 steps maintains ARI >50% in most cases.
Portability to other LLMs (GPT-4.1, GPT-3.5-turbo, Gemini-2.5, DeepSeek) is validated, with ARIs in the 46–54% range.

A plausible implication is that the architectural principles—prompt-managed memory and dynamic control of clustering behavior—offer strong inductive biases for unsupervised grouping, reducing sensitivity to model or dataset choice (Zhu et al., 19 Nov 2025).

6. Principles, Advantages, and Limitations

Foundational Principles

Prompt-State Memory: Explicit, editable state in the LLM input, allowing persistent, contextually-evolving label inventories and retroactive adjustment.
Prompt-Driven Control: Cluster count and semantic granularity managed entirely via injected system and user prompt constraints.

Advantages

End-to-End Flow: No external clustering engine or centroid computation; the LLM alone governs grouping and label evolution.
Interpretable Labels: Cluster memory is intrinsically human-readable, facilitating interpretability, auditability, and downstream semantic understanding.
Minimal Tuning: Robust to prompt threshold choices and LLM variant; few-shot examples and dual-mode prompting suffice.
Single-Pass Efficiency: Operates in one linear pass, suitable for streaming and moderate-latency online tasks.

Limitations

LLM Token Limits: For extremely large memory buffers or very large $K_{\max}$ , prompt window constraints may become non-trivial.
Prompt Sensitivity: Although robust averages are reported, pathological prompt formulations or user constraints may degrade performance.
Scalability Ceiling: The effective memory is limited by the maximum token length of the LLM; external strategies (e.g., memory summarization) may be needed for unbounded streams.

7. Comparative Perspective and Significance

LLM-MemCluster marks a shift toward LLM-native, memory-driven clustering methods that do not rely on vector-space partitioning or iterative numerical updates. Its explicit memory buffer, prompt-engineered dynamic mode switching, and single-pass operation distinguish it from classical and recent memory-based clustering frameworks such as k-LLMmeans (Diaz-Rodriguez, 12 Feb 2025). The empirical results demonstrate that stateful, prompt-based reasoning is sufficient for competitive, and in some cases superior, clustering performance—especially in settings requiring semantic interpretability and minimal infrastructure.

These findings imply a broader trend: LLM-oriented clustering, when equipped with prompt-managed memory and dynamic control, offers a viable, interpretable, and scalable alternative to both embedding-based and hybrid clustering models (Zhu et al., 19 Nov 2025, Diaz-Rodriguez, 12 Feb 2025).