Papers
Topics
Authors
Recent
Search
2000 character limit reached

Delta Decomposition: DeRS Paradigm

Updated 20 November 2025
  • Delta Decomposition is a dual-framework method that splits complex objects into a shared base and compact deltas, applicable in Boolean DNFs and neural MoE models.
  • In Boolean analysis, the approach employs polynomial factorization to achieve the finest Δ-partition, ensuring efficient and unique DNF decomposition.
  • In deep learning, the DeRS paradigm compresses expert weights using sparse, quantized, or low-rank representations, significantly reducing memory and computation costs.

Delta Decomposition (DeRS Paradigm) encompasses two distinct but conceptually related frameworks for structured decomposition: (1) the Δ-decomposition of positive Disjunctive Normal Forms (DNFs) in Boolean function analysis, as formalized using the Delta‐and‐Rooted-Semiring (DeRS) paradigm (Ponomaryov, 2018); and (2) the Decompose‐Replace‐Synthesis (DeRS) paradigm for parameter-efficient upcycled Mixture-of-Experts (MoE) models in deep learning (Huang et al., 3 Mar 2025). Both leverage a core principle: decomposing a complex object into a shared “base” and compact “deltas” or components, but the mathematical and algorithmic contexts differ substantially.

1. Definition and Theoretical Foundation

In Boolean function analysis, Δ-decomposition refers to expressing a positive DNF φ\varphi as a conjunctive product of DNFs ψ1,,ψk\psi_1,\dots,\psi_k such that all subcomponents may only intersect on a shared set of “Delta” variables Δ\Delta, with each non-Δ\Delta block non-empty. The decomposition is called finest if it admits no further nontrivial refinement.

In neural model upcycling, DeRS refers to decomposing dense expert weights WiRd×dhW_i\in\mathbb R^{d\times d_h} as Wi=Wbase+ΔiW_i = W_{\mathrm{base}} + \Delta_i, optimizing storage and computation by expressing Δi\Delta_i in a lightweight representation while WbaseW_{\mathrm{base}} remains expert-shared.

Both frameworks exploit high redundancy—either logical or algebraic—in composition, enabling a transition to more compact or structured representations without loss of essential information (Ponomaryov, 2018, Huang et al., 3 Mar 2025).

2. DeRS for Positive DNF Decomposition

A positive DNF is a disjunction of terms over Boolean variables, where terms are conjunctions of unnegated variables. In the DeRS paradigm for Boolean functions, a positive DNF φ(x1,,xn)=tTitxi\varphi(x_1,\dots,x_n)=\bigvee_{t\in T}\bigwedge_{i\in t} x_i is represented as a multilinear Boolean polynomial f(x)=tTitxif(x) = \sum_{t\in T}\prod_{i\in t} x_i. Disjunctions correspond to addition, conjunctions to multiplication, and variables are interpreted in the Boolean ring.

The key insight is that Δ\Delta-decomposition corresponds precisely to the factorization of f(x)f(x) into irreducible multilinear Boolean polynomials whose variable sets intersect only at Δ\Delta. Each irreducible factor maps back to a sub-DNF, producing the finest partitioning of the original function (Ponomaryov, 2018).

The following table summarizes the logical correspondence:

Aspect DNF Decomposition Polynomial Factorization
Object Positive DNF φ\varphi Multilinear polynomial f(x)f(x)
Decomposition φ=ψ1ψk\varphi = \psi_1\wedge\cdots\wedge\psi_k f=f1fkf = f_1\cdot \cdots\cdot f_k
Shared variables Δ\Delta Overlaps in UiUjU_i\cap U_j
Fineness No further Δ\Delta-splitting All factors irreducible

This correspondence enables exploitation of algebraic factoring algorithms for logic decomposition, yielding a unique, finest Δ\Delta-partition in polynomial time for positive DNFs.

3. Algorithmic Framework and Complexity

The DeRS algorithm for positive DNF Δ-decomposition consists of: (1) removing redundant terms, (2) computing Δ-atoms (intersections of terms with Δ\Delta), (3) for each Δ-atom pair, testing decomposition via specialized restrictions and polynomial partitioning (FindPartition subroutine), (4) constructing a partition graph from obtained blocks, and (5) extracting the finest partition via connected components.

Algorithmic steps include:

  1. Reduce φ\varphi by eliminating redundancy.
  2. Extract all Δ-atoms aja_j.
  3. For pairs (a1,a2)(a_1,a_2), restrict φ\varphi to the assignment L=a1a2L=a_1\cup a_2 (forcing other Δ\Delta variables to zero), and apply polynomial factorization (FindPartition) to the restricted DNF's polynomial form.
  4. Aggregate all variable blocks into a graph with cliques for shared blocks.
  5. Identify connected components as distinct variable blocks for DNF decomposition.
  6. Project and minimize original DNF onto blocks Δ\cup \Delta to obtain each ψi\psi_i.

The complexity is O(poly(m,n))O(poly(m,n)) for mm terms and nn variables, with the polynomial time bound achieved by efficient factoring algorithms leveraging formal derivatives and substructure exploitation (Ponomaryov, 2018).

4. Delta Decomposition in Upcycled Mixture-of-Experts Models

In upcycled MoE neural models, DeRS employs the decomposition: Wi=Wbase+ΔiW_i = W_{\mathrm{base}} + \Delta_i where WiW_i is the ii-th expert's weight matrix, WbaseW_{\mathrm{base}} the shared base (often a pretrained FFN weight), and Δi\Delta_i a small expert-specific correction. Empirical cosine similarity (cos(Wi,Wbase)>0.999)(\cos(W_i, W_{\mathrm{base}})>0.999) supports the intuition that Δi\Delta_i is structurally redundant, motivating storage reduction (Huang et al., 3 Mar 2025).

To exploit this redundancy, DeRS replaces full Δi\Delta_i with one of several lightweight encodings:

  • Sparse-matrix (DeRS-SM): Store only a small subset of nonzero entries, defined by a binary mask with high drop rate (p0.9p\geq0.9).
  • Quantized form (DeRS-Q): Uniformly quantize Δi\Delta_i to low bit-width (kKk\ll K).
  • Low-rank factorization (DeRS-LM): Represent Δi\Delta_i as UiViU_iV_i^\top, with low rr.

Each representation yields drastic reductions in parameter and memory cost.

5. Practical Algorithms for Compression and Training

Inference-Time Compression (DeRS Compression)

  1. Decompose: ΔiWiWbase\Delta_i \leftarrow W_i - W_{\mathrm{base}}.
  2. Compress: ΔiFpost(Δi)\Delta_i \mapsto \mathcal F_{\rm post}(\Delta_i) via sparsification, quantization, or low-rank.
  3. (Optional) Fine-tune compression parameters to minimize i=1NWi(Wbase+Fpost(Δi))F2+λR\sum_{i=1}^N\|W_i - (W_{\mathrm{base}} + \mathcal F_{\rm post}(\Delta_i))\|_F^2 + \lambda R.
  4. At inference, synthesize WiW_i on demand.

Training-Time Upcycling (DeRS Upcycling)

  1. Instantiate expert deltas Fpre(Δi)\mathcal F_{\rm pre}(\Delta_i) efficiently (zero-filled sparse or low-rank).
  2. Forward pass: route input, synthesize WiW_i for active experts.
  3. Backpropagate and update WbaseW_{\mathrm{base}} and compact deltas. Optionally, regularize for sparsity or rank.

This decomposed approach enables upcycled MoEs to scale in parameter count and memory footprint orders-of-magnitude below naive multi-expert allocation.

6. Empirical Results and Application Domains

DeRS achieves high compression while maintaining or slightly improving accuracy on a range of benchmarks:

Task/Model Vanilla MoE Params Added DeRS-SM Params DeRS-LM Params Accuracy Delta
MoE-LLaVA-Phi +2.52+2.52B $1.11$M $2.42$M +0.3%+0.3\%, +0.2%+0.2\%
Med-MoE-StableLM +1.24+1.24B $0.26$M $1.20$M +0.2%+0.2\%, +0.3%+0.3\%
Coder-MoE +2.43+2.43B $325$M $9.09$M +0.8%+0.8\%, +0.7%+0.7\%

Memory and computational use are reduced by up to 52.7% for model size, 21.2% for training memory, and 43.8% for inference memory in Coder-MoE scenarios (Huang et al., 3 Mar 2025). Application domains include multi-modal learning, medical VQA, program synthesis, and vision-LLMs.

7. Limitations, Extensions, and Open Problems

For Boolean DNFs, Δ-decomposition is polynomial-time tractable only in the positive (negation-free) case; extending these results to general or non-Boolean settings, or to other logical formats (BDD, CNF), remains open and is known to be coNP-hard in some cases. Selection of the “optimal” Δ\Delta is combinatorial and not addressed by existing algorithms (Ponomaryov, 2018).

In upcycled MoE models, DeRS achieves maximal efficiency when all experts are close to WbaseW_{\mathrm{base}}; large deviations may reduce representational sufficiency with sparse or low-rank Δi\Delta_i. The choice between sparse and low-rank forms trades off memory efficiency against expressivity. Maintaining WbaseW_{\mathrm{base}} as trainable is empirically preferable.

In both domains, the DeRS paradigm evidences that leveraging structural redundancy enables substantial gains in computational efficiency, minimization, and interpretability, while open questions remain on extension to less constrained settings and criteria for optimal decomposition.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Delta Decomposition (DeRS Paradigm).