Structural Extraction Language (SEL)

Updated 21 December 2025

SEL is a tree-structured formalism that unifies diverse information extraction tasks using a schema-driven 'spot-and-associate' approach.
It employs a context-free grammar to recursively encode nested and overlapping entities, relations, events, and sentiment structures in a compact format.
Prompt-based extraction with SEL enables robust adaptation across supervised, low-resource, and few-shot settings through dynamic schema prompts and pre-training.

Structured Extraction Language (SEL) is a tree-structured formalism that unifies disparate information extraction (IE) objectives under a single, expressive syntax. Designed as the core representation for the Unified Information Extraction (UIE) framework, SEL enables compact, lossless encoding of entities, relations, n-ary events, nested or overlapping extractions, and sentiment structures. All extraction outputs are cast as a sequence of “spot-and-associate” operations, generating a parseable tree of typed spans and their inter-relationships, with a strict and explicit grammar. This formalism facilitates robust, prompt-driven IE model training and supports adaptive extraction tasks across supervised, low-resource, and few-shot settings (Lu et al., 2022).

1. SEL Formal Grammar and Structural Abstraction

SEL is founded on a context-free grammar that defines allowable extraction structures. At the atomic level, SEL consists of two primitives: spotting a span (identifying a substring corresponding to a specific type or role) and associating this span with others under role-named edges. The grammar is specified as follows:

〈SEL〉          ::= 〈NodeList〉
〈NodeList〉     ::= 〈Node〉 | 〈Node〉 〈NodeList〉
〈Node〉         ::= "(" 〈SpotName〉 ":" 〈InfoSpan〉 〈AssocList〉? ")"
〈AssocList〉    ::= 〈Association〉 | 〈Association〉 〈AssocList〉
〈Association〉  ::= "(" 〈AssoName〉 ":" 〈InfoSpan〉 ")"
〈SpotName〉     ::= token sequence (e.g. "person", "start-position")
〈AssoName〉     ::= token sequence (e.g. "work for", "employee")
〈InfoSpan〉     ::= contiguous substring of the input text

Alternatively, in LaTeX:

$\begin{aligned} \langle \mathit{SEL}\rangle\;\; &\Coloneqq\;\;\langle\mathit{NodeList}\rangle \ \langle \mathit{NodeList}\rangle\;\; &\Coloneqq\;\;\langle \mathit{Node}\rangle\;\bigl|\;\langle\mathit{Node}\rangle\;\langle\mathit{NodeList}\rangle \ \langle \mathit{Node}\rangle\;\; &\Coloneqq\;\; "("\;\langle\mathit{SpotName}\rangle\;":"\;\langle\mathit{InfoSpan}\rangle\;\langle\mathit{AssocList}\rangle^{?}\;")" \ \langle \mathit{AssocList}\rangle\;\; &\Coloneqq\;\;\langle\mathit{Association}\rangle\;\bigl|\;\langle\mathit{Association}\rangle\;\langle\mathit{AssocList}\rangle \ \langle \mathit{Association}\rangle\;\; &\Coloneqq\;\;"("\;\langle\mathit{AssoName}\rangle\;":"\;\langle\mathit{InfoSpan}\rangle\;")" \end{aligned}$

Each (SpotName : InfoSpan) identifies a root node; zero or more child associations (AssoName : InfoSpan) may attach beneath, recursively forming the tree. SpotNames and AssoNames are tokens or phrases directly drawn from the extraction schema prompt and are type-constrained accordingly.

This abstraction automatically subsumes:

Flat and nested entity recognition
Binary and n-ary relations
Event structures with role-argument lists
Sentiment triplets with arbitrary nesting

2. Unified Semantics: Spot-and-Associate

SEL semantics are schema-driven:

SpotName: specifies a semantic type or trigger, e.g., “person”, “start-position”, “aspect.” Each indicates the existence of an information span of that type within the input text.
InfoSpan: a contiguous text substring.
AssoName: a role linking its InfoSpan child to the parent spot node, e.g., “work for”, “employee”, “time.”

One standalone node (SpotName : InfoSpan) corresponds to an entity mention. A spot node with child associations encodes a relation, event, or structured sentiment. Deeper nesting handles nested or overlapping structures. Only SpotNames and AssoNames specified in the current schema prompt are valid in the output, resolving schema specificity and restricting output space, which is critical for task adaptation.

Construct	Example	Semantic Interpretation
Entity mention	(person: Steve)	“Steve” as a person entity
Relation	(person: Steve (work for: Apple))	Steve works for Apple
Event	(start-position: became (employee: Steve) (employer: Apple) (time: 1997))	Steve becomes employee at Apple in 1997
Sentiment triplet	(aspect: pizza (positive: excellent))	“Pizza” positively described as “excellent”

3. Prompt-Based Generation with Structural Schema Instructor (SSI)

SEL is embedded within a prompt-driven extraction model. For each input, a schema prompt $s$ specifies the permitted SpotNames and AssoNames as:

$s = [\texttt{[spot]},\,\text{SpotName}_1,\dots,\texttt{[asso]},\,\text{AssoName}_1,\dots,\texttt{[text]},\,x_1,\dots,x_{|x|}]$

The model autoregressively generates SEL-formatted outputs $y$ :

$y = [y_1,\dots,y_{|y|}] = \mathrm{Decoder}(\mathrm{Encoder}(s,x))$

$P(y\mid x, s) = \prod_{i=1}^{|y|} P(y_i\mid y_{<i},\,x,\,s)$

The schema prompt, referred to as the Structural Schema Instructor (SSI), constrains generation to schema-defined types. During fine-tuning or inference, the prompt $s$ determines which SpotNames/AssoNames—and thus, which extraction subspaces—the model will target. This prompt-based approach is essential for dynamic schema adaptation, multi-type unification, and efficient few-shot transfer.

4. Pre-training and Fine-tuning Objectives

SEL-based models are pre-trained using heterogeneous corpora and combined sequence-generation objectives:

Text-to-Structure ( $\mathcal D_{\text{pair}}$ ), maximizing $P(y | x, s_{\text{meta}})$ over parallel text-record pairs:

$\mathcal L_{\mathrm{Pair}} = \sum_{(x,y)\in\mathcal D_{\mathrm{pair}}} -\log P(y\mid x,\;s_{\mathrm{meta}})$

Structure-only Language Modeling ( $\mathcal D_{\mathrm{record}}$ ), maximizing likelihood of SEL outputs:

$\mathcal L_{\mathrm{Record}} = \sum_{y\in\mathcal D_{\mathrm{record}}} -\sum_{i=1}^{|y|} \log P(y_i\mid y_{<i})$

Span-corrupt Masked Language Modeling ( $\mathcal D_{\mathrm{text}}$ ):

$\mathcal L_{\mathrm{Text}} = \sum_{x\in\mathcal D_{\mathrm{text}}} -\log P(x''\mid x')$

The total pre-training objective:

$\mathcal L = \mathcal L_{\mathrm{Pair}} + \mathcal L_{\mathrm{Record}} + \mathcal L_{\mathrm{Text}}$

During downstream task fine-tuning, standard teacher-forcing cross-entropy is minimized:

$\mathcal L_{\mathrm{FT}} = \sum_{(s,x,y)\in\mathcal D_{\mathrm{task}}} -\log P(y\mid x,\,s)$

To mitigate exposure bias, a small probability of rejection noise is added at fine-tuning—randomly inserting (SpotName or AssoName : [null]) spans—training the model to ignore null-valued spans.

5. Concrete Extraction Examples

SEL enables compact, interpretable, and explicit encoding of diverse IE outputs. The following two examples illustrate SEL’s versatility.

Example 1: Relation, Entity, and Event Extraction

Text: “Steve became CEO of Apple in 1997.”
Schema prompt ( $s$ ): [spot] person, [spot] organization, [spot] time, [asso] work for, [asso] employee, [asso] employer, [asso] time, [text]…

Generated SEL output: $(\! (\mathrm{person: Steve} \;(\mathrm{work\ for: Apple})) (\mathrm{start-position: became} (\mathrm{employee: Steve}) (\mathrm{employer: Apple}) (\mathrm{time: 1997})) (\mathrm{organization: Apple}) (\mathrm{time: 1997}) )$ Decoded Extraction:

Entities: (person, “Steve”), (organization, “Apple”), (time, “1997”)
Relation: work-for(Steve, Apple)
Event: start-position = “became” (employee = Steve, employer = Apple, time = 1997)

Example 2: Sentiment Triplet Extraction

Text: “The staff were horrible but the pizza was excellent.”
Schema prompt: [spot] aspect, [spot] opinion, [asso] negative, [asso] positive, [text]…

Generated SEL output: $( (\mathrm{aspect: staff}\;(\mathrm{negative: horrible})) (\mathrm{opinion: horrible}) (\mathrm{aspect: pizza}\;(\mathrm{positive: excellent})) (\mathrm{opinion: excellent}) )$ Decoded Extraction:

(staff, horrible, negative)
(pizza, excellent, positive)

6. Inference Workflow and Adaptation to New Schemas

End-to-end SEL-based extraction (as implemented in UIE) proceeds via:

Prompt construction with relevant SpotNames and AssoNames
Prompt and text encoding using a transformer encoder
Autoregressive SEL sequence generation via a transformer decoder
Tree parsing of SEL tokens to extraction records

def UIE_extract(text, schema_spots, schema_assos):
    # 1. Build SSI prompt
    prompt = []
    for spot in schema_spots:
        prompt += ["[spot]", spot]
    for aso in schema_assos:
        prompt += ["[asso]", aso]
    prompt += ["[text]"] + tokenize(text)

    # 2. Encode
    H = TransformerEncoder(prompt)

    # 3. Autoregressive decode into SEL tokens
    y = []
    h_dec = []
    while True:
        token, h_new = TransformerDecoderStep(H, h_dec)
        if token == "<eos>": break
        y.append(token)
        h_dec.append(h_new)

    # 4. Parse SEL y into extraction records
    return parse_SEL(y)

Pre-training over joint corpora ( $\mathcal D_{\mathrm{pair}}$ , $\mathcal D_{\mathrm{record}}$ , $\mathcal D_{\mathrm{text}}$ ) instills general SEL expressiveness, robust text encoding, and universal IE capability within the transformer. For new tasks, specifying SpotNames and AssoNames in the prompt suffices to adapt the model with only a small labeled set, enabling high performance in supervised, low-resource, and few-shot settings.

7. Universality and Significance in Information Extraction

SEL achieves theoretical and practical universality for information extraction, acting as a common scaffold for all IE targets—flat and nested entities, binary and n-ary relations, complex event records, and sentiment analysis. Its schema-driven prompt mechanism enables “on-demand” focus, supporting rapid transfer to new schemas or domain-specific requirements. When coupled with prompt-based autoregressive decoding and large-scale pre-training, SEL permits the instantiation of a single transformer-based UIE model capable of broad, heterogeneous, and jointly learned extraction objectives across text corpora (Lu et al., 2022). This design unifies IE methodology, simplifies adaptation and deployment, and enhances robustness, particularly in low-data scenarios.

PDF Markdown Chat (Pro)

References (1)

Unified Structure Generation for Universal Information Extraction (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Structural Extraction Language (SEL).