Globally Normalized Loss in QA Models

Updated 26 May 2026

Globally normalized loss is a structured prediction technique that scores complete answer spans jointly to mitigate label bias in extractive QA models.
It employs beam search to approximate the expensive partition function, making it feasible for large-scale document contexts.
Empirical studies show that this method improves span ranking and model accuracy by allowing corrections of early decision errors during training.

Globally normalized loss is a probabilistic objective for structured prediction in neural models, notably used in the Globally Normalized Reader (GNR) for extractive question answering (QA). Unlike locally normalized objectives, globally normalized loss scores and normalizes over all complete answer hypotheses, enabling direct competition and tradeoffs between different answer spans. This results in improved learning efficiency, reduced label bias, and better span selection fidelity, with practical implementations using beam search to render the approach tractable in large-scale settings (Raiman et al., 2017).

1. Formal Definition

The globally normalized loss in the GNR framework casts extractive QA as a structured search problem over three sequential decisions: (1) select the answer’s sentence index $i \in \{1, \dots, S\}$ , (2) select the start token index $j \in \{1, \dots, L_i\}$ , and (3) select the end token index $k \in \{j, \dots, L_i\}$ for a document $D$ and question $Q$ . An answer hypothesis is denoted $a = (i, j, k)$ . Each decision is associated with a neural-network–derived score:

$s_\theta(a\,|\,D,Q) = s_\theta^{\text{sent}}(i\,|\,D,Q) + s_\theta^{\text{start}}(j\,|\,i,D,Q) + s_\theta^{\text{end}}(k\,|\,i,j,D,Q)$

The globally normalized probability of $a$ is

$P_\theta(a\,|\,D,Q) = \frac{\exp(s_\theta(a\,|\,D,Q))}{Z_\theta(D,Q)}$

where the partition function is

$Z_\theta(D,Q) = \sum_{i'=1}^S \sum_{j'=1}^{L_{i'}} \sum_{k'=j'}^{L_{i'}} \exp(s_\theta(i',j',k'\,|\,D,Q))$

Given a single gold answer $j \in \{1, \dots, L_i\}$ 0, the training loss is the negative log likelihood:

$j \in \{1, \dots, L_i\}$ 1

2. Comparison to Locally Normalized Objectives

Locally normalized, or stepwise, objectives factor the prediction probability as a chain:

$j \in \{1, \dots, L_i\}$ 2

Each component is normalized independently, e.g.:

$j \in \{1, \dots, L_i\}$ 3

This local decomposition introduces the label bias problem: Once a partial decision (such as sentence selection) is made, subsequent distributional normalization is confined to that context, preventing later evidence from correcting early errors. In contrast, global normalization computes likelihoods over all $j \in \{1, \dots, L_i\}$ 4 candidates simultaneously, mitigating irreversible premature decisions and enabling joint optimization over the entire answer span space.

3. Beam Search Approximation and Gradient Propagation

Direct computation of the global partition function $j \in \{1, \dots, L_i\}$ 5 is cubic in document length and impractical for long contexts. The GNR employs beam search to approximate both the partition function and the training signal. The procedure:

Beam expansion: Begin with an empty hypothesis; expand all sentence-choice scores and retain the top $j \in \{1, \dots, L_i\}$ 6 options. For each, expand and score all start and then end indices, each time keeping the top $j \in \{1, \dots, L_i\}$ 7 candidates at each step.
Approximate partition: The normalizer is computed as $j \in \{1, \dots, L_i\}$ 8, where $j \in \{1, \dots, L_i\}$ 9 is the final beam of size $k \in \{j, \dots, L_i\}$ 0.
Loss under beam: If the gold answer remains in the beam at all steps, the loss is

$k \in \{j, \dots, L_i\}$ 1

If the gold answer falls off at step $k \in \{j, \dots, L_i\}$ 2, an "early update" is performed immediately, forming a partial-hypothesis loss.

Gradient backpropagation: Once $k \in \{j, \dots, L_i\}$ 3 is computed, its gradient with respect to the model parameters is the sum of the gradients of the gold answer minus the expected gradients over the beam. Gradients are backpropagated through all differentiable score components.

Approximate search and early updates enable efficiency and direct focus on errors made during search, concentrating learning resources on challenging examples.

4. Parameter Update Workflow

A single parameter update for a document–question pair proceeds as follows:

Encode $k \in \{j, \dots, L_i\}$ 4 and $k \in \{j, \dots, L_i\}$ 5 via shared BiLSTM or Transformer layers.
Compute sentence scores $k \in \{j, \dots, L_i\}$ 6 for $k \in \{j, \dots, L_i\}$ 7.
Select the top- $k \in \{j, \dots, L_i\}$ 8 sentences (Beam $k \in \{j, \dots, L_i\}$ 9).
For each $D$ 0 in Beam $D$ 1, compute start token scores, retaining top- $D$ 2 (Beam $D$ 3).
For each $D$ 4 in Beam $D$ 5, compute end token scores, retaining top- $D$ 6 (Beam $D$ 7).
If the gold $D$ 8 falls off the beam, find the earliest lost step, perform a partial beam loss and backpropagate immediately.
Otherwise, compute the full-beam loss as above.
Backpropagate $D$ 9 through all scoring networks.
Update parameters: $Q$ 0 using SGD, Adam, or similar.

This procedure is repeated for all training pairs over multiple epochs.

5. Practical Advantages and Theoretical Implications

Globally normalized loss offers critical advantages:

Label bias reduction: By globally tying the sentence, start, and end decisions, poor choices in one stage can be rectified by strong evidence in another.
Enhanced span ranking: All full-answer spans $Q$ 1 compete jointly, providing robust, calibrated rankings over candidates.
Empirical improvements: In Clark & Gardner (2018), GNR with global normalization outperformed a locally normalized version by several points of EM accuracy, particularly for questions where precise span boundary identification is required.
Efficient inference: The same beam used during inference can approximate the partition during training, eliminating complex dynamic programming.
Stable learning: Early updates curb computational waste and concentrate optimization on cases where errors manifest during search.

Collectively, globally normalized loss blends tractable training (through beam search) with the modeling prowess of a full-space CRF over answer spans, ensuring both theoretical soundness and practical effectiveness in large-scale QA contexts (Raiman et al., 2017).

6. Context and Relation to Broader Methodologies

The globally normalized loss embodies key principles from conditional random fields (CRFs) adapted to neural architectures, extending classical structured prediction to modern NLP with differentiable, context-sensitive scoring functions. Its development responds to the inefficiencies and limitations encountered in bi-directional attention mechanisms and locally normalized scoring strategies, especially in extractive QA and span selection.

A plausible implication is that similar global normalization techniques could generalize to other multi-stage neural decision processes, particularly where label bias or joint optimization over large hypothesis spaces are pressing concerns. The use of beam search for tractable approximate inference and partition computation is also widely relevant in structured prediction and sequence generation tasks, indicating the broader methodological applicability of the framework proposed in the GNR (Raiman et al., 2017).

Markdown Report Issue Upgrade to Chat

References (1)

Globally Normalized Reader (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Globally Normalized Loss.