ChatGLM Family of Models
- ChatGLM Family of Models is a collection of state-of-the-art language models that incorporate advanced data-diverse strategies for improved uncertainty estimation and robustness.
- They employ principled techniques such as data partitioning, mutual information-based regularization, and determinantal point processes to maximize ensemble diversity and performance under distribution shifts.
- These models demonstrate empirical gains in reducing prediction error and ensuring fairness, making them effective for large-scale language tasks and fairness-aware data summarization.
Data-Diverse Drafts (DDD) comprise a principled collection of algorithmic methodologies aimed at maximizing the diversity of models, hypotheses, or data subsets in modern machine learning. Central to their design is the use of epistemic diversity—whether among distilled network drafts, predictive hypotheses, or sampled summaries—to enhance robustness, uncertainty quantification, and fairness. DDD methods are featured in epistemic uncertainty estimation for LLMs, learning from underspecified data in distribution shift scenarios, and fairness-aware data summarization. Core techniques include strategically partitioned training data, mutual information-based regularization, and constrained determinantal point processes.
1. Theoretical Rationale for Data-Diverse Drafts
The DDD paradigm targets the amplification of model or hypothesis diversity to address fundamental sources of uncertainty and brittleness in machine learning. For token-level epistemic uncertainty in LLMs, the diversity of predictive distributions within an ensemble (quantified by Jensen–Shannon divergence, JSD) constitutes the principal "variance proxy" in bias–variance decompositions:
Here, denotes the predictive distribution of the -th draft, the uniform mixture, and the teacher/target model. Purely initialization- or noise-based diversity is quickly destroyed by standard distillation; DDD circumvents this via distinct data partitioning for draft training, thereby maximizing the ensemble's JSD and improving uncertainty quantification (Park et al., 2 Feb 2026).
In distribution shift and underspecification settings, DDD-inspired methods such as DivDis select structurally distinct hypotheses (predictive functions) conditioned to be source-consistent but maximally disagree on target (unlabeled) data, enabling discovery of solutions robust to spurious correlations (Lee et al., 2022).
2. Algorithmic Implementations in DDD Frameworks
DDD for Uncertainty-Aware LLM Distillation
The DDD method partitions a teacher-generated dataset into disjoint subsets. An ensemble of drafts is created, each draft distilled solely on its assigned partition :
- Partitioning: ,
- Draft Training: Each draft is trained (e.g., with LoRA) using Online Stochastic Distillation, optimizing:
- Aggregation: The diversity of the resulting ensemble is quantified by JSD; the mixture approximates the Bayesian model average (Park et al., 2 Feb 2026).
DDD in the Diversify-and-Disambiguate (DivDis) Setting
For distributional robustness, DivDis implements DDD as follows:
- Diversification: network heads parameterized by shared are optimized to minimize cross-entropy loss on source-labelled data and maximize predictive mutual information loss on target-unlabeled inputs:
where enforces low pairwise mutual information on target predictions (Lee et al., 2022).
- Disambiguation: A small number of labeled target samples identify the function (head) with minimum target risk, typically by querying points of maximal head disagreement and selecting by empirical accuracy.
DDD via Fair and Diverse Subset Selection
Determinantal Point Processes (DPPs), enhanced by groupwise constraints, instantiate DDD for fair summarization:
- Probabilistic Model: For partitioned into sensitive groups , fair summary (with ) is sampled from the Partition DPP:
- Sampling Algorithm: Algorithmic efficiency is achieved via a Sample-and-Project routine under -balance assumptions (Celis et al., 2018).
3. Theoretical Guarantees and Analytical Metrics
The bias–variance decomposition underpins DDD for LLMs, providing explicit quantification of gains:
indicating up to reduction in error for epistemic uncertainty compared to diverse initialization or noise-injected baselines (Park et al., 2 Feb 2026). Data partitioning directly impacts the JSD-based variance proxy, with additional ablation showing marginal further improvement for more partitions.
For DPP-based DDD, the Sample-and-Project algorithm meets the approximation bound:
where quantifies partition balance, and is a combinatorial term; geometric diversity loss remains negligible compared to unconstrained sampling.
In DivDis, theory demonstrates that a minimal set of target-labeled queries suffices for confident head selection, with the number of required queries bounded in terms of the risk gap between top heads.
4. Empirical Performance and Benchmarks
Table: Selected Empirical Gains
| Domain | Baseline (Metric) | DDD/DivDis/Partition DPP (Metric) | Reference |
|---|---|---|---|
| Uncertainty (8B→3B) | RMSE=0.3266 (Baseline) | 0.2036 (DDD) | (Park et al., 2 Feb 2026) |
| OOD Accuracy (Waterbirds-CC) | 7% (ERM), 47% (Group DRO) | 82% (DivDis =16 labels) | (Lee et al., 2022) |
| Subset Diversity (CelebA) | Highest (-DPP) | ~equal (-DPP), perfect fairness | (Celis et al., 2018) |
| Hallucination AUROC | 0.7823 (TokUR) | 0.7839 (DDD, 6×3B+1×3B drafts) | (Park et al., 2 Feb 2026) |
On GSM8K (arithmetic reasoning), DDD yields state-of-the-art epistemic uncertainty estimation and matches the AUROC of compute-heavy full-ensemble baselines at 0.58× FLOPs. In complete-correlation vision and language tasks (Waterbirds, CelebA, MultiNLI), DivDis outperforms classical ERM and group-robust objectives, often by large margins, with minimal additional target labeled supervision (Lee et al., 2022). For fair data summarization, Partition DPPs enforce strict group proportions with negligible geometric diversity loss (Celis et al., 2018).
5. Limitations and Contextual Boundaries
While DDD maximizes ensemble diversity and quantifies epistemic uncertainty effectively, the improvements are conditioned on the partitioning regime and sufficiency of data diversity—if subpopulations are not well represented in the data splits, gains in variance proxy and robustness diminish. Standard distillation or fine-tuning without DDD can result in near-degenerate diversity (JSD near zero), leading to underestimation of epistemic risk. In fairness-driven summarization, the price of group constraints is theoretically a multiplicative factor in the sampling distribution, but observed trade-off is empirically minimal (Celis et al., 2018).
6. Extensions and Interdisciplinary Relevance
The DDD framework is extensible across supervised, semi-supervised, and unsupervised tasks. It interfaces with active learning (label-efficient disambiguation), out-of-distribution robustness, model interpretability (via head-level feature attribution), and fairness in automated data curation. For LLM inference, DDD integrates naturally with scalable distillation and draft-ensemble pipelines, while for fairness, it advances kernel-geometric approaches to ensure subgroup representation. The general principle—explicit maximization of epistemic or geometric diversity under operational constraints—offers a unifying design axis in modern robust and responsible ML practice.
Principal references: (Park et al., 2 Feb 2026, Lee et al., 2022, Celis et al., 2018).