DLGNet-Task: Autoregressive End-to-End Transformers

Updated 7 April 2026

Autoregressive End-to-End Transformers are transformer-based architectures that serialize multi-step tasks with delimiter tokens, allowing joint optimization and transparent intermediate states.
They employ maximum likelihood training on a left-to-right token sequence, eliminating the need for separate per-module losses and enhancing end-to-end differentiability.
DLGNet-Task demonstrates competitive performance on dialogue benchmarks while reducing engineering overhead compared to traditional modular pipelines.

Autoregressive End-to-End Transformers, typified by DLGNet-Task and its descendants, denote a class of transformer-based architectures that unify the solution of multi-component real-world tasks by modeling the entire process as a single, left-to-right sequence modeling problem. These systems contrast with both classical pipelined architectures, which train and connect modules individually, and with “black-box” end-to-end schemes that do not expose or control internal structure. By leveraging a single transformer’s powerful autoregressive conditioning and explicit token delimiters, Autoregressive End-to-End Transformers enable simultaneous joint optimization, interpretable intermediate representations, and flexible integration of open-domain and task-oriented operations (Olabiyi et al., 2020).

1. Core Architectural Principles

Autoregressive End-to-End Transformers recast modular multi-step pipelines as single-sequence generation problems, where all intermediate representations (e.g., in dialogue: intent, entities, dialogue state, policy action, surface realization) are serialized with special delimiter tokens and predicted token-by-token.

Formally, for a multi-turn, multi-domain dialogue system, the model ingests a sequence

$\langle {\tt <turn\_sep>},\,{\tt usr:},\,U_t,\,{\tt <intent>},\,I_t,\dots,{\tt <sys>},\,R_t \rangle,$

where $U_t$ is the current user utterance, $I_t$ the predicted intent, and so on. Each segment is serialized into tokens and appended sequentially. The network models the joint distribution over these tokens via the standard left-to-right factorization: $P_\theta(x_1,\dots,x_N) = \prod_{i=1}^N P_\theta(x_i|x_{<i}),$ where each $x_i$ is either a natural language subword or a symbolic token representing a functional slot or action (Olabiyi et al., 2020).

This design enforces “module boundaries” via delimiters, but trains the entire inference process end-to-end via a single cross-entropy loss over the complete linearized sequence. All modules—NLU, dialogue state tracking (DST), policy, and NLG—are implemented as sub-sequences within a single transformer’s output (Olabiyi et al., 2020).

2. Probabilistic Modeling and Maximum Likelihood Training

Given the full sequence encoding user input, intermediate states, and system output, training proceeds via maximum likelihood estimation (MLE): $\mathcal{L}_{\mathrm{DLGNet}} = -\sum_{i=1}^N \log P_\theta(x_i | x_{<i}),$ with tokenization typically using byte-pair encoding (BPE) for full vocabulary coverage and subword granularity (Olabiyi et al., 2019, Olabiyi et al., 2020). Intermediate functional blocks (intent, slots, etc.) do not require individual loss terms or weights; their prediction is handled in-line as a byproduct of the global likelihood objective.

For tasks with constrained choices (e.g., slot value selection from an ontology), the model scores each candidate explicitly using a probe: $P_\theta(V^j_i | K_i, C) = \frac{\exp\left(\frac{1}{T_i} \log P_\theta(\ldots, {\tt DL_{key}}, K_i, {\tt DL_{value}}, V^j_i)\right)}{\sum_{j'} \exp\left(\frac{1}{T_i} \log P_\theta(\ldots, {\tt DL_{key}}, K_i, {\tt DL_{value}}, V^{j'}_i)\right)}$ where $T_i$ is a temperature. This mechanism can be used for runtime enforceability and verifiability (Olabiyi et al., 2020).

Random informative padding is applied at the sequence edges during training to flatten entropy curves and prevent overfitting to trivial positional artifacts, as previously shown effective in DLGNet (Olabiyi et al., 2019).

3. Preservation of Module Boundaries and Explainability

While achieving full end-to-end differentiability, Autoregressive End-to-End Transformers explicitly retain functional boundaries via delimiter tokens and the ordering of outputs in the sequence. At inference time, it is possible to extract, precede, or override any intermediate representation (such as intent, slot values, or dialogue acts) without breaking the single-model structure. This yields key explainability and verifiability properties: module behaviors can be probed, debugged, or replaced at runtime within the unified architecture (Olabiyi et al., 2020).

This token-level serial partitioning enables hybrid operation modes: the system can perform purely open-domain-to-response mappings if required, or enforce fully modular pipelines by conditioning on user-supplied or external module outputs.

4. Training Protocols, Tokenization, and Implementation

Training proceeds by flattening each turn of conversation into the full module sequence (user utterance, intent, entities, domains, slots, plans, API actions/results, dialogue acts, delexicalized responses, surface form) with special delimiter tokens. Standard transformer settings are used: GPT-2 (“small”) architecture (12 layers, 768-dim, 12 heads, 117 M parameters), BPE vocabulary (~50K tokens), and input sequences up to 1024 tokens.

Optimization typically uses Adam with learning rate $1 \times 10^{-4}$ , effective batch sizes matched to GPU capacity, and early stopping once validation perplexity stabilizes. No explicit per-module loss weighting or curriculum is necessary due to the sequencing and delimiter structure (Olabiyi et al., 2020).

5. Empirical Performance and Comparative Analysis

On MultiWOZ2.1, DLGNet-Task achieves:

Inform (Context+Results→Response): 75.15%
Success: 57.31%
BLEU: 18.34
Combined: 87.52

Under pure end-to-end (User utterance→System response):

Inform: 72.65%
Success: 56.81%
BLEU: 15.40
Combined: 80.13

These scores are closely competitive with strong modular and pipelined baselines (e.g., SOLOIST Combined = 102.5), but provide a single model for all submodules, reduced engineering effort, and debuggable intermediate states (Olabiyi et al., 2020).

6. Benefits, Trade-offs, and Flexibility

Notable benefits:

Full end-to-end trainability, minimizing development and integration overhead.
Module-level control and observability through delimiter-driven representations.
Seamless open-domain and task-oriented unification within a single model instance.
Absence of explicit curriculum or per-module loss balancing.

Trade-offs include:

Marginally lower maximum aggregate metrics compared to best-of-breed modular pipelines.
Susceptibility to annotation inconsistency: noisy labels in intermediate blocks propagate through the global sequence.
No explicit mechanism for per-task custom training, relying on representation and ordering to balance learning (Olabiyi et al., 2020).

7. Application to Diverse Tasks and Generalization

The Autoregressive End-to-End Transformer approach embodied in DLGNet-Task has inspired extensions to vision-language multi-tasking, document understanding, and end-to-end autonomous driving. These adaptations extend the paradigm of serialized, tokenized submodule outputs, maintaining delimiter-based boundaries and autoregressive training across complex, multi-relational tasks (Beyer et al., 2023, Li et al., 8 Jul 2025, Jia et al., 7 Mar 2025).

Such models have enabled unification of NLU, DST, and NLG in dialogue; multi-task learning across VQA, captioning, and classification in vision; and complete document layout + content reconstruction from images in document understanding. The approach is extensible, with evidence of applicability to any domain where modular task decomposition can be reformulated as a serial sequence generation problem governed by explicit boundary tokens.