DecompT5: Transformer Decomposition
- DecompT5 is a family of Transformer-based techniques that decompose tasks or model parameterizations to reduce parameters and enhance efficiency and interpretability.
- It factorizes T5 prompt embeddings into low-rank matrices, achieving up to a 9-fold reduction in tunable parameters while matching or surpassing baseline performance.
- Variants extend DecompT5 for robust natural language understanding, multi-hop reasoning, and decompiled code summarization, highlighting its versatility across domains.
DecompT5 encompasses a family of Transformer-based architectures and methodologies unified by their focus on decomposition: either decomposing tasks or decomposing model parameterizations to improve efficiency or interpretability. Within the research literature, DecompT5 appears in several distinct but conceptually aligned branches—including prompt tuning with low-rank parameterization for efficient adaptation, compositional question/task decomposition for robust NLU, and fine-tuning T5 variants for summarization of decompiled binaries. All approaches leverage the T5 encoder–decoder infrastructure, but target different forms of “decomposition” to enhance performance, efficiency, or generalization.
1. Low-Rank Prompt Tuning: DecompT5 Parameterization
The principal instantiation of DecompT5 in "Decomposed Prompt Tuning via Low-Rank Reparameterization" introduces a low-rank reparameterization of soft prompts for prompt tuning in T5. Conventional prompt tuning learns a full prompt embedding matrix , typically initialized randomly or copied from vocabulary embeddings. Empirical analysis reveals that, post-training, these matrices exhibit low intrinsic rank—most singular values collapse towards zero, indicating substantial redundancy.
DecompT5 enforces a low-rank structure from the outset by factorizing as , where and , for a small bottleneck : This structure, by Theorem A.1 (), guarantees is at most rank 0.
Key technical characteristics include:
- Initialization: Both 1 and 2 are initialized with i.i.d. Gaussian entries. SVD-based warm starts yield no additional empirical benefit.
- Optimization: Standard negative log-likelihood is optimized for downstream sequence-to-sequence tasks, using AdamW and 100 epochs of training.
- Parameter Count: The total tunable parameters reduce from 3 (vanilla PT) to 4 for DecompT5. For T5-Large (5), this implies a reduction from 102.4K to 11.2K parameters—a 9-fold decrease.
- Performance: On the SuperGLUE benchmark, DecompT5 matches or outperforms vanilla and residual prompt tuning in both high-resource and low-resource (few-shot) regimes. For T5-Large, average SuperGLUE scores are: vanilla PT: 77.08, residual PT: 76.67, DecompT5: 79.72. Few-shot advantages persist, with 1–3 point gains per task.
This suggests that directly encoding a low-rank prior into prompt tuning not only yields parameter efficiency but also stabilizes or improves downstream performance (Xiao et al., 2023).
2. Task and Hypothetical Question Decomposition with DecompT5
In the context of NLU, DecompT5 is realized as a decomposition-aware T5 model, further pre-trained on distant supervision from comparable texts such as parallel news. Explicit decomposition is operationalized by auto-regressively generating a chain of supporting facts or intermediate representations per query and employing entailment classification for final answer aggregation.
Technical highlights include:
- Model: No alteration to the T5-large (770M parameter) Transformer; only the pre-training corpus and objectives are modified.
- Pre-training: Leverages 2.6M parallel-news sentence pairs and 0.9M Gutenberg sentences for a combination of seq2seq complementary-sentence and standard T5 denoising objectives.
- Decomposition Pipeline: At each step, the model auto-regressively generates aspects/facts, applies factual correction via GPT-3 for robustness, and finally aggregates entailment predictions via a weighted majority vote.
- Empirical Gains: On semantic parsing tasks (Overnight, TORQUE), DecompT5 achieves substantial improvements over base T5 (e.g., +26.8 points Hit@1 on Overnight). On QA (StrategyQA, HotpotQA), DecompQA—a pipeline using DecompT5—yields +4.4 and +8 point improvements over strong RoBERTa baselines and outperforming GPT-3 Chain-of-Thought by +8 points on HotpotQA.
- Ablation: Distant supervision pre-training and factual correction (via GPT-3) are both essential for maximal performance, contributing 3–12 points depending on task (Zhou et al., 2022).
The explicit modeling of decompositions facilitates better compositional generalization and interpretability, particularly in tasks demanding multi-hop reasoning.
3. DecompT5 for Structured Code Summarization in Reverse Engineering
A further variant, sometimes referred to as BinT5, applies T5-based architectures to the task of summarizing decompiled binaries, critical in program analysis and reverse engineering:
- Architecture: Starts from CodeT5-base, adapted only in input sequence length and fine-tuned on a newly constructed dataset, CAPYBARA, comprising 214K (function, summary) pairs from decompiled C code at various compiler optimization levels.
- Tokenizer: No architecture-specific “binary” embeddings; decompiled pseudo-C is tokenized via SentencePiece.
- Fine-tuning: Standard cross-entropy between autoregressively generated summary tokens and target summary.
- Performance: On the decompiled C subset (with duplicates), BinT5 achieves 58.82 BLEU-4, only modestly lower than source-code summarization (60.83). Ablations reveal sharp performance drop (to 11.3 BLEU-4) with fully stripped binaries, underscoring the importance of identifier preservation for semantic summarization.
- Data Efficiency: BLEU-4 remains within 5 points even when trained on 25% of the data, indicating robustness to data scale.
- Limitations: Summarization from stripped binaries or with high obfuscation remains challenging, motivating ongoing work on enhanced function-boundary detection and pre-training on decompiler outputs (Al-Kaswan et al., 2023).
4. Decomposition Frameworks for Commonsense and Reasoning Tasks
An early use of DecompT5 (in methodology rather than a model name) is found in WinoGrande commonsense reasoning, where sequence-to-sequence models are leveraged by decomposing a multiple-choice challenge into independent hypothesis–premise “entailment” subproblems:
- Strategy: Each example is split into two inputs, each corresponding to one option, with T5 trained to output "entailment" or "contradiction.”
- Scoring: At inference, a softmax over logits for each token determines the option selection.
- Result: This “decomposition plus entailment” approach achieves 0.7673 AUC on the held-out test set, outperforming the previous RoBERTa-based state of the art by over five points.
- Ablations: “Logit-trick” and using appropriate target tokens (“entailment/contradiction” vs. “true/false”) account for significant gains (Lin et al., 2020).
The decomposition paradigm is found effective in reducing task complexity, enhancing transparency, and leveraging pre-trained entailment knowledge.
5. Parameter Efficiency and Empirical Characterization
The explicit imposition of low-rank structure in prompt tuning yields dramatic reductions in trainable parameter counts:
| Model | # Params (T5-Large) | SuperGLUE Avg |
|---|---|---|
| Vanilla PT | 102K | 77.08 |
| Residual PT | 925K | 76.67 |
| DecompT5 (b=10) | 11K | 79.72 |
In few-shot settings (32 shots), DecompT5 outperforms alternatives by ∼2 points on average, and consistently achieves 1–3 point gains on individual tasks such as WiC, CB, RTE, and COPA. This efficiency, coupled with competitive or superior performance, positions DecompT5 as a preferred parameterization for T5 prompt adaptation when downstream storage or update-friendliness are constraints (Xiao et al., 2023).
6. Limitations and Future Directions
While DecompT5’s low-rank initialization accelerates parameter efficiency and compositional variant pre-training drives advances in multi-hop reasoning and code summarization, certain limitations persist:
- Task Scope: Low-rank prompt tuning is currently validated only in encoder–decoder models and NLU tasks (SuperGLUE).
- Convergence: DecompT5, like vanilla PT, converges relatively slowly.
- Transfer: Summarization performance on highly obfuscated or stripped binaries remains limited.
- Methodological Gaps: Further embedding of adaptive bottleneck/rank selection and expansion to encoder-only or decoder-only backbones is identified as future work.
- Extensible Pre-training: Opportunities exist in joint pre-training on decompiler outputs and hybrid prompt pre-training strategies to enhance cross-domain transfer (Xiao et al., 2023, Al-Kaswan et al., 2023, Zhou et al., 2022).
A plausible implication is that further integration of decomposition-based pre-training and parameter-efficient adaptation will yield gains across compositionally complex and resource-limited scenarios.
7. Summary
DecompT5 represents a family of techniques exploiting decomposition at multiple levels of NLP modeling: parameter factorization for efficient prompt tuning, compositional output pipelines for interpretability and reasoning, and T5 adaptation for summarizing decompiled code. This versatility is unified by empirical rigor and tangible gains in both efficiency and performance across benchmarks and modalities (Xiao et al., 2023, Zhou et al., 2022, Al-Kaswan et al., 2023, Lin et al., 2020).