Structural Extraction Language (SEL)
- SEL is a tree-structured formalism that unifies diverse information extraction tasks using a schema-driven 'spot-and-associate' approach.
- It employs a context-free grammar to recursively encode nested and overlapping entities, relations, events, and sentiment structures in a compact format.
- Prompt-based extraction with SEL enables robust adaptation across supervised, low-resource, and few-shot settings through dynamic schema prompts and pre-training.
Structured Extraction Language (SEL) is a tree-structured formalism that unifies disparate information extraction (IE) objectives under a single, expressive syntax. Designed as the core representation for the Unified Information Extraction (UIE) framework, SEL enables compact, lossless encoding of entities, relations, n-ary events, nested or overlapping extractions, and sentiment structures. All extraction outputs are cast as a sequence of “spot-and-associate” operations, generating a parseable tree of typed spans and their inter-relationships, with a strict and explicit grammar. This formalism facilitates robust, prompt-driven IE model training and supports adaptive extraction tasks across supervised, low-resource, and few-shot settings (Lu et al., 2022).
1. SEL Formal Grammar and Structural Abstraction
SEL is founded on a context-free grammar that defines allowable extraction structures. At the atomic level, SEL consists of two primitives: spotting a span (identifying a substring corresponding to a specific type or role) and associating this span with others under role-named edges. The grammar is specified as follows:
1 2 3 4 5 6 7 8 |
〈SEL〉 ::= 〈NodeList〉
〈NodeList〉 ::= 〈Node〉 | 〈Node〉 〈NodeList〉
〈Node〉 ::= "(" 〈SpotName〉 ":" 〈InfoSpan〉 〈AssocList〉? ")"
〈AssocList〉 ::= 〈Association〉 | 〈Association〉 〈AssocList〉
〈Association〉 ::= "(" 〈AssoName〉 ":" 〈InfoSpan〉 ")"
〈SpotName〉 ::= token sequence (e.g. "person", "start-position")
〈AssoName〉 ::= token sequence (e.g. "work for", "employee")
〈InfoSpan〉 ::= contiguous substring of the input text |
Alternatively, in LaTeX:
Each (SpotName : InfoSpan) identifies a root node; zero or more child associations (AssoName : InfoSpan) may attach beneath, recursively forming the tree. SpotNames and AssoNames are tokens or phrases directly drawn from the extraction schema prompt and are type-constrained accordingly.
This abstraction automatically subsumes:
- Flat and nested entity recognition
- Binary and n-ary relations
- Event structures with role-argument lists
- Sentiment triplets with arbitrary nesting
2. Unified Semantics: Spot-and-Associate
SEL semantics are schema-driven:
- SpotName: specifies a semantic type or trigger, e.g., “person”, “start-position”, “aspect.” Each indicates the existence of an information span of that type within the input text.
- InfoSpan: a contiguous text substring.
- AssoName: a role linking its InfoSpan child to the parent spot node, e.g., “work for”, “employee”, “time.”
One standalone node (SpotName : InfoSpan) corresponds to an entity mention. A spot node with child associations encodes a relation, event, or structured sentiment. Deeper nesting handles nested or overlapping structures. Only SpotNames and AssoNames specified in the current schema prompt are valid in the output, resolving schema specificity and restricting output space, which is critical for task adaptation.
| Construct | Example | Semantic Interpretation |
|---|---|---|
| Entity mention | (person: Steve) | “Steve” as a person entity |
| Relation | (person: Steve (work for: Apple)) | Steve works for Apple |
| Event | (start-position: became (employee: Steve) (employer: Apple) (time: 1997)) | Steve becomes employee at Apple in 1997 |
| Sentiment triplet | (aspect: pizza (positive: excellent)) | “Pizza” positively described as “excellent” |
3. Prompt-Based Generation with Structural Schema Instructor (SSI)
SEL is embedded within a prompt-driven extraction model. For each input, a schema prompt specifies the permitted SpotNames and AssoNames as:
The model autoregressively generates SEL-formatted outputs :
The schema prompt, referred to as the Structural Schema Instructor (SSI), constrains generation to schema-defined types. During fine-tuning or inference, the prompt determines which SpotNames/AssoNames—and thus, which extraction subspaces—the model will target. This prompt-based approach is essential for dynamic schema adaptation, multi-type unification, and efficient few-shot transfer.
4. Pre-training and Fine-tuning Objectives
SEL-based models are pre-trained using heterogeneous corpora and combined sequence-generation objectives:
- Text-to-Structure (), maximizing over parallel text-record pairs:
- Structure-only Language Modeling (), maximizing likelihood of SEL outputs:
- Span-corrupt Masked Language Modeling ():
- The total pre-training objective:
During downstream task fine-tuning, standard teacher-forcing cross-entropy is minimized:
To mitigate exposure bias, a small probability of rejection noise is added at fine-tuning—randomly inserting (SpotName or AssoName : [null]) spans—training the model to ignore null-valued spans.
5. Concrete Extraction Examples
SEL enables compact, interpretable, and explicit encoding of diverse IE outputs. The following two examples illustrate SEL’s versatility.
Example 1: Relation, Entity, and Event Extraction
- Text: “Steve became CEO of Apple in 1997.”
- Schema prompt (): [spot] person, [spot] organization, [spot] time, [asso] work for, [asso] employee, [asso] employer, [asso] time, [text]…
Generated SEL output: Decoded Extraction:
- Entities: (person, “Steve”), (organization, “Apple”), (time, “1997”)
- Relation: work-for(Steve, Apple)
- Event: start-position = “became” (employee = Steve, employer = Apple, time = 1997)
Example 2: Sentiment Triplet Extraction
- Text: “The staff were horrible but the pizza was excellent.”
- Schema prompt: [spot] aspect, [spot] opinion, [asso] negative, [asso] positive, [text]…
Generated SEL output: Decoded Extraction:
- (staff, horrible, negative)
- (pizza, excellent, positive)
6. Inference Workflow and Adaptation to New Schemas
End-to-end SEL-based extraction (as implemented in UIE) proceeds via:
- Prompt construction with relevant SpotNames and AssoNames
- Prompt and text encoding using a transformer encoder
- Autoregressive SEL sequence generation via a transformer decoder
- Tree parsing of SEL tokens to extraction records
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
def UIE_extract(text, schema_spots, schema_assos): # 1. Build SSI prompt prompt = [] for spot in schema_spots: prompt += ["[spot]", spot] for aso in schema_assos: prompt += ["[asso]", aso] prompt += ["[text]"] + tokenize(text) # 2. Encode H = TransformerEncoder(prompt) # 3. Autoregressive decode into SEL tokens y = [] h_dec = [] while True: token, h_new = TransformerDecoderStep(H, h_dec) if token == "<eos>": break y.append(token) h_dec.append(h_new) # 4. Parse SEL y into extraction records return parse_SEL(y) |
Pre-training over joint corpora (, , ) instills general SEL expressiveness, robust text encoding, and universal IE capability within the transformer. For new tasks, specifying SpotNames and AssoNames in the prompt suffices to adapt the model with only a small labeled set, enabling high performance in supervised, low-resource, and few-shot settings.
7. Universality and Significance in Information Extraction
SEL achieves theoretical and practical universality for information extraction, acting as a common scaffold for all IE targets—flat and nested entities, binary and n-ary relations, complex event records, and sentiment analysis. Its schema-driven prompt mechanism enables “on-demand” focus, supporting rapid transfer to new schemas or domain-specific requirements. When coupled with prompt-based autoregressive decoding and large-scale pre-training, SEL permits the instantiation of a single transformer-based UIE model capable of broad, heterogeneous, and jointly learned extraction objectives across text corpora (Lu et al., 2022). This design unifies IE methodology, simplifies adaptation and deployment, and enhances robustness, particularly in low-data scenarios.