Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus

Published 4 Apr 2026 in cs.LG, cs.AI, and cs.MA | (2604.03809v1)

Abstract: Multi-agent LLM committees replicate the same model under different role prompts and aggregate outputs by majority vote, implicitly assuming that agents contribute complementary evidence. We embed each agent's chain-of-thought rationale and measure pairwise similarity: across 100 GSM8K questions with three Qwen2.5-14B agents, mean cosine similarity is 0.888 and effective rank is 2.17 out of 3.0, a failure mode we term representational collapse. DALC, a training-free consensus protocol that computes diversity weights from embedding geometry, reaches 87% on GSM8K versus 84% for self-consistency at 26% lower token cost. Ablation experiments reveal 1-3 point per-protocol run-to-run variance, confirm that hint sharing contributes more than diversity weighting alone, and show that encoder choice strongly modulates collapse severity (cosine 0.908 with mxbai versus 0.888 with nomic) and downstream accuracy. The more robust finding is that collapse is measurable, worsens on harder tasks, and that the choice of embedding proxy is a first-order design decision for any latent communication protocol.

Abstract PDF Upgrade to Chat

Authors (1)

Dipkumar Patel

Summary

The paper introduces a novel metric for detecting representational collapse by measuring cosine similarity and effective rank among LLM agents.
It develops DALC, a diversity-aware consensus protocol that uses projection methods to recalibrate agent influence and reduce token cost.
Experimental results on GSM8K and MATH-500 demonstrate DALC’s efficiency improvements and highlight encoder sensitivity in multi-agent settings.

Representational Collapse in Multi-Agent LLM Committees: Diagnostic Measurement and Diversity-Aware Consensus

Motivation and Background

The aggregation of outputs from replicated LLM agents, differentiated by role prompts, underpins the reliability and accuracy improvements in committee-based agentic architectures. Standard approaches such as self-consistency and majority voting operate under the assumption that agent outputs are sufficiently independent and diverse, enabling error correction and complementarity. This assumption is rarely tested empirically within the context of LLM committees, especially those composed of identical models. The paper addresses this gap by introducing the phenomenon of representational collapse, quantifies it using embedding-based statistics, and proposes DALC (Diversity-Aware Latent Consensus), a protocol to leverage detected diversity at aggregation time.

Measuring Representational Collapse

Representational collapse is defined as excessive embedding similarity among chain-of-thought rationales generated by role-conditioned LLM agents. Using a frozen encoder (nomic-embed-text, 768d), embeddings from three Qwen2.5-14B agents (methodical solver, skeptical verifier, concise expert) reveal a mean pairwise cosine similarity of 0.888 and an effective rank of 2.17 (out of 3). The effective rank, computed from singular values, quantifies the intrinsic dimensionality spanned by the committee’s outputs. Results indicate that shifting role prompts produces minor surface variation in generated text but not substantial representational diversity; agents often occupy a narrow cone in embedding space. This structure undermines the independence assumption behind majority voting, risking error amplification.

Figure 1: DALC protocol: role-conditioning, embedding and decorrelation, hint sharing, and diversity-weighted voting; pre-projection cosine similarity 0.88–0.91 highlights collapse.

The DALC Protocol: Diversity-Aware Aggregation

DALC operationalizes this diagnostic: after short chain-of-thought generation, embeddings are optionally projected to maximize orthogonality (Gram-Schmidt, SVD, or identity), and diversity-weighted voting is computed as $w_i \propto 1 - \bar{s}_i$ , where $\bar{s}_i$ is the agent’s mean cosine similarity to all others. Agents receive truncated hints from other agents’ rationales and produce final answers, aggregated by diversity-weighted voting. Unlike classical diversity-aware selection (MMR, DPPs), DALC retains all chains and adjusts influence post hoc, compatible with agentic deployments where generation is a sunk cost.

Experimental Evaluation

The paper benchmarks DALC against baselines on GSM8K and MATH-500 using Qwen2.5-Instruct (14B, 7B). Results on GSM8K (14B, $n=100$ ) demonstrate that DALC-Id achieves 87% accuracy at 26% lower token cost than self-consistency (SC, 84%). DALC-SVD scores 86%; DALC-GS, despite perfect orthogonalization, falls to 83%. On MATH-500, DALC variants match or slightly surpass SC (57% vs. 56%) at roughly two-thirds the token cost. These token savings (25–34%) are robust across methods and model scales.

Figure 2: DALC achieves comparable or higher accuracy than self-consistency at substantially reduced token cost, across benchmarks and scales.

Collapse Severity, Downstream Impact, and Encoder Sensitivity

Collapse metrics, averaged across 100 questions, confirm high redundancy (cosine $>$ 0.87, rank $\approx 2.1$ ). SVD projection shifts metrics negligibly; hard Gram-Schmidt orthogonalization yields perfect diversity geometrically, but not in downstream accuracy—DALC-GS underperforms relative to DALC-Id and SVD, suggesting that sentence-level embeddings are a coarse proxy for reasoning diversity. Collapse worsens on MATH-500 (harder tasks), supporting the intuition that agents converge more tightly when confident alternative paths are scarce.

Encoder sensitivity is pronounced: replacing nomic-embed-text with mxbai-embed-large further raises cosine similarity (0.908) and reduces effective rank (2.09), coinciding with DALC variants losing their advantage over baselines. This indicates that embedding geometry directly governs detectable diversity, and hence, protocol effectiveness. The embedding proxy must be validated for reasoning-relevant distinctions prior to deployment.

Ablation Studies and Robustness

Three ablation studies elucidate protocol contributions:

Run-to-run variance: Stochasticity in agent output results in 1–3 point swings between independent runs, rendering 1–5 point accuracy differences between protocols within noise.
No-hints ablation: Diversity weighting alone explains most protocol benefits; hint sharing yields marginal additional gains.
Encoder ablation: The latent protocol’s effectiveness collapses with an encoder that induces excessive similarity, underscoring the dependency on representation space.

Collapse measurements are stable across runs, confirming their structural significance.

Theoretical and Practical Implications

Representational collapse presents a critical bottleneck for scalable multi-agent committees. Empirically measuring embedding overlap via a pilot run provides a diagnostic to determine effective committee size: with cosine $> 0.88$ , three agents deliver the diversity of two chains, undermining the rationale for scaling up. This has direct implications for resource allocation and error resilience in agentic systems. Furthermore, unanimous agreement within a collapsed committee is not an indicator of robustness—confidence policies should be augmented with diversity diagnostics (e.g., effective rank).

The protocol and its findings challenge assumptions underlying latent communication methods, such as LatentMAS and ThoughtComm, as their effectiveness hinges on the embedding proxy. Practical deployment demands careful encoder selection and validation to guarantee that diversity measurements reflect genuine reasoning independence.

Limitations and Future Directions

Accuracy improvements are modest and within stochastic variance; a factorial design varying hint sharing, weighting, and projection would clarify protocol contributions. Results are restricted to two Qwen2.5 scales, two encoder families, and $N=3$ agent committees. Future exploration should encompass cross-architecture committees, large $N$ , prompt diversity, expert-level benchmarks, and hidden-state probes with advanced similarity measures (CKA, SVCCA).

Conclusion

This research provides rigorous quantification of representational collapse in multi-agent LLM committees, demonstrating its prevalence and operational consequences in resource allocation and aggregation schemes. DALC offers a protocol to exploit diversity, but its benefit is constrained by collapse severity and encoder characteristics. Effective deployment of agentic systems requires embedding-based diagnostics and careful validation of representation spaces to ensure genuine reasoning independence, with implications for both practical efficiency and theoretical understanding of committee dynamics.

Markdown Report Issue