Delta Decomposition: DeRS Paradigm
- Delta Decomposition is a dual-framework method that splits complex objects into a shared base and compact deltas, applicable in Boolean DNFs and neural MoE models.
- In Boolean analysis, the approach employs polynomial factorization to achieve the finest Δ-partition, ensuring efficient and unique DNF decomposition.
- In deep learning, the DeRS paradigm compresses expert weights using sparse, quantized, or low-rank representations, significantly reducing memory and computation costs.
Delta Decomposition (DeRS Paradigm) encompasses two distinct but conceptually related frameworks for structured decomposition: (1) the Δ-decomposition of positive Disjunctive Normal Forms (DNFs) in Boolean function analysis, as formalized using the Delta‐and‐Rooted-Semiring (DeRS) paradigm (Ponomaryov, 2018); and (2) the Decompose‐Replace‐Synthesis (DeRS) paradigm for parameter-efficient upcycled Mixture-of-Experts (MoE) models in deep learning (Huang et al., 3 Mar 2025). Both leverage a core principle: decomposing a complex object into a shared “base” and compact “deltas” or components, but the mathematical and algorithmic contexts differ substantially.
1. Definition and Theoretical Foundation
In Boolean function analysis, Δ-decomposition refers to expressing a positive DNF as a conjunctive product of DNFs such that all subcomponents may only intersect on a shared set of “Delta” variables , with each non- block non-empty. The decomposition is called finest if it admits no further nontrivial refinement.
In neural model upcycling, DeRS refers to decomposing dense expert weights as , optimizing storage and computation by expressing in a lightweight representation while remains expert-shared.
Both frameworks exploit high redundancy—either logical or algebraic—in composition, enabling a transition to more compact or structured representations without loss of essential information (Ponomaryov, 2018, Huang et al., 3 Mar 2025).
2. DeRS for Positive DNF Decomposition
A positive DNF is a disjunction of terms over Boolean variables, where terms are conjunctions of unnegated variables. In the DeRS paradigm for Boolean functions, a positive DNF is represented as a multilinear Boolean polynomial . Disjunctions correspond to addition, conjunctions to multiplication, and variables are interpreted in the Boolean ring.
The key insight is that -decomposition corresponds precisely to the factorization of into irreducible multilinear Boolean polynomials whose variable sets intersect only at . Each irreducible factor maps back to a sub-DNF, producing the finest partitioning of the original function (Ponomaryov, 2018).
The following table summarizes the logical correspondence:
| Aspect | DNF Decomposition | Polynomial Factorization |
|---|---|---|
| Object | Positive DNF | Multilinear polynomial |
| Decomposition | ||
| Shared variables | Overlaps in | |
| Fineness | No further -splitting | All factors irreducible |
This correspondence enables exploitation of algebraic factoring algorithms for logic decomposition, yielding a unique, finest -partition in polynomial time for positive DNFs.
3. Algorithmic Framework and Complexity
The DeRS algorithm for positive DNF Δ-decomposition consists of: (1) removing redundant terms, (2) computing Δ-atoms (intersections of terms with ), (3) for each Δ-atom pair, testing decomposition via specialized restrictions and polynomial partitioning (FindPartition subroutine), (4) constructing a partition graph from obtained blocks, and (5) extracting the finest partition via connected components.
Algorithmic steps include:
- Reduce by eliminating redundancy.
- Extract all Δ-atoms .
- For pairs , restrict to the assignment (forcing other variables to zero), and apply polynomial factorization (FindPartition) to the restricted DNF's polynomial form.
- Aggregate all variable blocks into a graph with cliques for shared blocks.
- Identify connected components as distinct variable blocks for DNF decomposition.
- Project and minimize original DNF onto blocks to obtain each .
The complexity is for terms and variables, with the polynomial time bound achieved by efficient factoring algorithms leveraging formal derivatives and substructure exploitation (Ponomaryov, 2018).
4. Delta Decomposition in Upcycled Mixture-of-Experts Models
In upcycled MoE neural models, DeRS employs the decomposition: where is the -th expert's weight matrix, the shared base (often a pretrained FFN weight), and a small expert-specific correction. Empirical cosine similarity supports the intuition that is structurally redundant, motivating storage reduction (Huang et al., 3 Mar 2025).
To exploit this redundancy, DeRS replaces full with one of several lightweight encodings:
- Sparse-matrix (DeRS-SM): Store only a small subset of nonzero entries, defined by a binary mask with high drop rate ().
- Quantized form (DeRS-Q): Uniformly quantize to low bit-width ().
- Low-rank factorization (DeRS-LM): Represent as , with low .
Each representation yields drastic reductions in parameter and memory cost.
5. Practical Algorithms for Compression and Training
Inference-Time Compression (DeRS Compression)
- Decompose: .
- Compress: via sparsification, quantization, or low-rank.
- (Optional) Fine-tune compression parameters to minimize .
- At inference, synthesize on demand.
Training-Time Upcycling (DeRS Upcycling)
- Instantiate expert deltas efficiently (zero-filled sparse or low-rank).
- Forward pass: route input, synthesize for active experts.
- Backpropagate and update and compact deltas. Optionally, regularize for sparsity or rank.
This decomposed approach enables upcycled MoEs to scale in parameter count and memory footprint orders-of-magnitude below naive multi-expert allocation.
6. Empirical Results and Application Domains
DeRS achieves high compression while maintaining or slightly improving accuracy on a range of benchmarks:
| Task/Model | Vanilla MoE Params Added | DeRS-SM Params | DeRS-LM Params | Accuracy Delta |
|---|---|---|---|---|
| MoE-LLaVA-Phi | B | $1.11$M | $2.42$M | , |
| Med-MoE-StableLM | B | $0.26$M | $1.20$M | , |
| Coder-MoE | B | $325$M | $9.09$M | , |
Memory and computational use are reduced by up to 52.7% for model size, 21.2% for training memory, and 43.8% for inference memory in Coder-MoE scenarios (Huang et al., 3 Mar 2025). Application domains include multi-modal learning, medical VQA, program synthesis, and vision-LLMs.
7. Limitations, Extensions, and Open Problems
For Boolean DNFs, Δ-decomposition is polynomial-time tractable only in the positive (negation-free) case; extending these results to general or non-Boolean settings, or to other logical formats (BDD, CNF), remains open and is known to be coNP-hard in some cases. Selection of the “optimal” is combinatorial and not addressed by existing algorithms (Ponomaryov, 2018).
In upcycled MoE models, DeRS achieves maximal efficiency when all experts are close to ; large deviations may reduce representational sufficiency with sparse or low-rank . The choice between sparse and low-rank forms trades off memory efficiency against expressivity. Maintaining as trainable is empirically preferable.
In both domains, the DeRS paradigm evidences that leveraging structural redundancy enables substantial gains in computational efficiency, minimization, and interpretability, while open questions remain on extension to less constrained settings and criteria for optimal decomposition.