Papers
Topics
Authors
Recent
Search
2000 character limit reached

ChatGLM Family of Models

Updated 2 March 2026
  • ChatGLM Family of Models is a collection of state-of-the-art language models that incorporate advanced data-diverse strategies for improved uncertainty estimation and robustness.
  • They employ principled techniques such as data partitioning, mutual information-based regularization, and determinantal point processes to maximize ensemble diversity and performance under distribution shifts.
  • These models demonstrate empirical gains in reducing prediction error and ensuring fairness, making them effective for large-scale language tasks and fairness-aware data summarization.

Data-Diverse Drafts (DDD) comprise a principled collection of algorithmic methodologies aimed at maximizing the diversity of models, hypotheses, or data subsets in modern machine learning. Central to their design is the use of epistemic diversity—whether among distilled network drafts, predictive hypotheses, or sampled summaries—to enhance robustness, uncertainty quantification, and fairness. DDD methods are featured in epistemic uncertainty estimation for LLMs, learning from underspecified data in distribution shift scenarios, and fairness-aware data summarization. Core techniques include strategically partitioned training data, mutual information-based regularization, and constrained determinantal point processes.

1. Theoretical Rationale for Data-Diverse Drafts

The DDD paradigm targets the amplification of model or hypothesis diversity to address fundamental sources of uncertainty and brittleness in machine learning. For token-level epistemic uncertainty in LLMs, the diversity of predictive distributions within an ensemble (quantified by Jensen–Shannon divergence, JSD) constitutes the principal "variance proxy" in bias–variance decompositions:

Ek ⁣[KL(qkpT)]=1Kk=1KKL(qkqmix)Variance Proxy (JSD)+KL(qmixpT)Bias Proxy\mathop{\mathbb{E}}_{k}\!\left[\mathrm{KL}\big(q_k \Vert p_T\big)\right] = \underbrace{\frac{1}{K}\sum_{k=1}^K \mathrm{KL}\big(q_k\Vert q_{mix}\big)}_{\text{Variance Proxy (JSD)}} + \underbrace{\mathrm{KL}\big(q_{mix}\Vert p_T\big)}_{\text{Bias Proxy}}

Here, qkq_k denotes the predictive distribution of the kk-th draft, qmixq_{mix} the uniform mixture, and pTp_T the teacher/target model. Purely initialization- or noise-based diversity is quickly destroyed by standard distillation; DDD circumvents this via distinct data partitioning for draft training, thereby maximizing the ensemble's JSD and improving uncertainty quantification (Park et al., 2 Feb 2026).

In distribution shift and underspecification settings, DDD-inspired methods such as DivDis select structurally distinct hypotheses (predictive functions) conditioned to be source-consistent but maximally disagree on target (unlabeled) data, enabling discovery of solutions robust to spurious correlations (Lee et al., 2022).

2. Algorithmic Implementations in DDD Frameworks

DDD for Uncertainty-Aware LLM Distillation

The DDD method partitions a teacher-generated dataset DD into SS disjoint subsets. An ensemble of K=SMK = S \cdot M drafts is created, each draft qs,mq_{s,m} distilled solely on its assigned partition DsD_s:

  1. Partitioning: D=D1DSD = D_1 \cup \cdots \cup D_S, DsDs=D_{s} \cap D_{s'} = \emptyset
  2. Draft Training: Each draft qs,mq_{s,m} is trained (e.g., with LoRA) using Online Stochastic Distillation, optimizing:

minϕExDsEθπT[KL(pθ(x)qs,m(x;ϕ))]\min_{\phi} \mathop{\mathbb{E}}_{x \sim D_s} \mathop{\mathbb{E}}_{\theta \sim \pi_T} \left[ \mathrm{KL}\left(p_\theta(\cdot|x) \Vert q_{s,m}(\cdot|x; \phi)\right)\right]

  1. Aggregation: The diversity of the resulting ensemble is quantified by JSD; the mixture approximates the Bayesian model average (Park et al., 2 Feb 2026).

DDD in the Diversify-and-Disambiguate (DivDis) Setting

For distributional robustness, DivDis implements DDD as follows:

  • Diversification: NN network heads parameterized by shared θ\theta are optimized to minimize cross-entropy loss on source-labelled data and maximize predictive mutual information loss on target-unlabeled inputs:

Objective:i=1NLxent(fi)+λ1i<jLMI(fi,fj)+λ2i=1NLreg(fi)\text{Objective:}\quad \sum_{i=1}^N L_{xent}(f_i) + \lambda_1 \sum_{i<j} L_{MI}(f_i, f_j) + \lambda_2 \sum_{i=1}^N L_{reg}(f_i)

where LMIL_{MI} enforces low pairwise mutual information on target predictions (Lee et al., 2022).

  • Disambiguation: A small number of labeled target samples identify the function (head) with minimum target risk, typically by querying points of maximal head disagreement and selecting by empirical accuracy.

DDD via Fair and Diverse Subset Selection

Determinantal Point Processes (DPPs), enhanced by groupwise constraints, instantiate DDD for fair summarization:

  • Probabilistic Model: For XX partitioned into pp sensitive groups XjX_j, fair summary SS (with SXj=kj|S \cap X_j| = k_j) is sampled from the Partition DPP:

q(S)det(VSVS),SB={S:SXj=kj j}q^*(S) \propto \det(V_S V_S^\top),\quad S \in \mathcal{B} = \left\{ S : |S \cap X_j| = k_j\ \forall j \right\}

  • Sampling Algorithm: Algorithmic efficiency is achieved via a Sample-and-Project routine under β\beta-balance assumptions (Celis et al., 2018).

3. Theoretical Guarantees and Analytical Metrics

The bias–variance decomposition underpins DDD for LLMs, providing explicit quantification of gains:

RMSE(DDD)=0.2036 (< 0.3029 for MiniLLM)\mathrm{RMSE}(\mathrm{DDD}) = 0.2036\ (<\ 0.3029\ \text{for MiniLLM})

indicating up to 37%37\% reduction in error for epistemic uncertainty compared to diverse initialization or noise-injected baselines (Park et al., 2 Feb 2026). Data partitioning directly impacts the JSD-based variance proxy, with additional ablation showing marginal further improvement for more partitions.

For DPP-based DDD, the Sample-and-Project algorithm meets the approximation bound:

q~(S)ηkβ2kq(S)\tilde{q}(S) \leq \eta_k \beta^{2k} q^*(S)

where β\beta quantifies partition balance, and ηk\eta_k is a combinatorial term; geometric diversity loss remains negligible compared to unconstrained sampling.

In DivDis, theory demonstrates that a minimal set of target-labeled queries suffices for confident head selection, with the number of required queries bounded in terms of the risk gap between top heads.

4. Empirical Performance and Benchmarks

Table: Selected Empirical Gains

Domain Baseline (Metric) DDD/DivDis/Partition DPP (Metric) Reference
Uncertainty (8B→3B) RMSE=0.3266 (Baseline) 0.2036 (DDD) (Park et al., 2 Feb 2026)
OOD Accuracy (Waterbirds-CC) 7% (ERM), 47% (Group DRO) 82% (DivDis mm=16 labels) (Lee et al., 2022)
Subset Diversity (CelebA) Highest (kk-DPP) ~equal (PP-DPP), perfect fairness (Celis et al., 2018)
Hallucination AUROC 0.7823 (TokUR) 0.7839 (DDD, 6×3B+1×3B drafts) (Park et al., 2 Feb 2026)

On GSM8K (arithmetic reasoning), DDD yields state-of-the-art epistemic uncertainty estimation and matches the AUROC of compute-heavy full-ensemble baselines at 0.58× FLOPs. In complete-correlation vision and language tasks (Waterbirds, CelebA, MultiNLI), DivDis outperforms classical ERM and group-robust objectives, often by large margins, with minimal additional target labeled supervision (Lee et al., 2022). For fair data summarization, Partition DPPs enforce strict group proportions with negligible geometric diversity loss (Celis et al., 2018).

5. Limitations and Contextual Boundaries

While DDD maximizes ensemble diversity and quantifies epistemic uncertainty effectively, the improvements are conditioned on the partitioning regime and sufficiency of data diversity—if subpopulations are not well represented in the data splits, gains in variance proxy and robustness diminish. Standard distillation or fine-tuning without DDD can result in near-degenerate diversity (JSD near zero), leading to underestimation of epistemic risk. In fairness-driven summarization, the price of group constraints is theoretically a multiplicative factor in the sampling distribution, but observed trade-off is empirically minimal (Celis et al., 2018).

6. Extensions and Interdisciplinary Relevance

The DDD framework is extensible across supervised, semi-supervised, and unsupervised tasks. It interfaces with active learning (label-efficient disambiguation), out-of-distribution robustness, model interpretability (via head-level feature attribution), and fairness in automated data curation. For LLM inference, DDD integrates naturally with scalable distillation and draft-ensemble pipelines, while for fairness, it advances kernel-geometric approaches to ensure subgroup representation. The general principle—explicit maximization of epistemic or geometric diversity under operational constraints—offers a unifying design axis in modern robust and responsible ML practice.


Principal references: (Park et al., 2 Feb 2026, Lee et al., 2022, Celis et al., 2018).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ChatGLM Family of Models.