RexUniNLU: Universal NLU Framework
- RexUniNLU is a universal NLU framework that unifies information extraction and classification by leveraging a recursive extraction paradigm with an Explicit Schema Instructor.
- It employs a recursive pipeline with custom query construction, isolated prompts, and advanced attention mechanisms to ensure consistent and type-correct schema-based decoding.
- The framework demonstrates state-of-the-art performance across full-shot, few-shot, and multi-modal benchmarks in multiple languages, validating its innovative design.
RexUniNLU is an encoder-only neural framework introducing a recursive extraction paradigm with an Explicit Schema Instructor (ESI) to achieve universal natural language understanding (NLU). It unifies information extraction (IE) and text classification (CLS) tasks within a single architecture, covering arbitrary extraction schemas—spanning from named entity recognition (NER) and relation extraction (RE) to previously unsolved quadruple and quintuple schemas—as well as CLS and multi-modal understanding. RexUniNLU formalizes true Universal Information Extraction (UIE) and applies schema constraints at each decoding step, ensuring consistency and type correctness for both IE and CLS, and demonstrates state-of-the-art results across diverse NLU tasks and languages (Liu et al., 2024).
1. Formal Foundation of Universal Information Extraction
RexUniNLU redefines UIE to generalize beyond previous models limited to extracting fixed-arity tuples, such as subject–object–relation triples. The RexUniNLU UIE objective addresses an arbitrary schema of arity , where extraction corresponds to identifying a sequence of span–type pairs along root-to-leaf paths in a schema tree . Let denote the input token sequence, the set of annotated tuples , with a type path and the corresponding spans.
The probabilistic extraction objective is: where denotes all extracted pairs up to depth .
This general formulation subsumes common tasks:
- NER (): extract entities as single spans.
- RE (): extract subject–object–relation tuples.
- Event Extraction ( or $3$): event-trigger and argument role extraction.
- Quadruple/Quintuple Extraction (): higher-arity schemas previously unsupported by UIE.
Classification tasks are modeled as a degenerate case where a special “[CLST]” token span encodes the entire input, yielding an objective over label types: encompassing single/multi-label classification, NLI, multiple-choice MRC, and extendable to multi-modal cases by including non-text features.
2. Model Architecture: Recursive Pipeline with Explicit Schema Instructor
The recursive pipeline operates as follows:
- Query Construction: At recursion step , construct
where captures previously extracted pairs and are the eligible types at depth . This forms the ESI prompt, explicitly guiding extraction or classification within the schema constraints.
- Encoder: A transformer encoder (e.g., DeBERTa-v2) processes using custom position IDs and attention masks to achieve "Prompts Isolation," preventing information leakage between schema branches and allowing blocks to attend only to relevant segments.
- Score Matrix: Representations inform two FFNN heads (query/key), with rotary embeddings (RoPE) encoding positional differences:
- Decoding: After thresholding at , the binary matrix is decoded via three token-linking operations: head–tail (span detection), head–type (type assignment), and type–tail (type–tail associations).
- Recursion: Newly extracted pairs are used as prefixes for the subsequent query ; recursion halts when no new extractions are made.
- Isolation Mechanism: Disjoint position ID intervals for different “[P]” blocks, and attention masks blocking cross-prefix or cross-type communication, strictly enforce schema separation.
3. Training Objectives and Decoding
Distinct loss functions are employed for IE and CLS:
- IE Training (Circle Loss):
where flattens and is the ground truth mask. Total IE loss is .
- CLS Training/Decoding:
- Apply sigmoid to , producing .
- Single-label: Prediction at position is
- Multi-label: Both directions are thresholded at (e.g., 0.9).
4. Experimental Protocol and Benchmarks
RexUniNLU is pre-trained on approximately 30 million samples (Chinese and English), including distant supervision for NER/RE (9.6M), supervised IE (NER, RE, EE, ABSA), and CLS (sentiment, NLI, match, MRC). English pre-training draws from OntoNotes, NYT, SciERC, SQuAD, HellaSwag, HyperRED, and COQE.
Downstream tasks include:
- Chinese IE: CMeEE-NER, Youku (NER); ACE05, CoNLL04, NYT, SciERC, CoAE2016 (RE); ACE05, CASIE, CCKS (EE); pCLUE, CMRC2018 (MRC IE); 14-res, 15-res, 16-res (ABSA); HyperRED (quadruple); Camera-COQE (quintuple).
- Chinese CLS: Toutiao (general), NLPCC14-SC (sentiment), AFQMC (match), OCNLI (NLI), C³ (MRC).
- English IE: ACE04, ACE05-Ent/Rel, CoNLL03, CoNLL04, NYT, SciERC, ACE05-Evt, CASIE, 14-res, 15-res, 16-res, HyperRED, Camera-COQE.
- Multi-modal NLU: PPN benchmark (20 document types), evaluated with Entity Strict F1.
Standard metrics include various strict F1 scores (Entity, Relation, Triplet, Quadruple, Universal), Trigger, Argument, and Sentiment F1.
5. Quantitative Results and Performance Analysis
Summary of Key Benchmark Results
| Model | IE Avg | CLS Avg | All Avg | Modality | Entity F1 |
|---|---|---|---|---|---|
| PromptCLUE (mT5-B) | 50.92 | 76.23 | 63.85 | text | — |
| mT5-ZSAC | 60.42 | 76.78 | 68.22 | text | — |
| SiameseUniNLU (RoB) | 60.42 | 76.01 | 68.22 | text | — |
| RexUniNLU-Base | 68.64 | 80.97 | 74.81 | text | 34.83 |
| RexUniNLU-Large | 69.24 | 81.65 | 75.45 | text | 40.96 |
| MRexUniNLU | — | — | — | text+layout+image | 66.84 |
RexUniNLU demonstrates:
- Full-shot gains: +8–10 points over previous unified models across 12 tasks.
- Few/Zero-shot: Up to +42 points gain in IE+MRC (zero-shot), e.g., 63.37 (0-shot, RexLarge) vs. 49.07 (Siamese) / 38.74 (mT5).
- Complex Schemas: +8 points over T5-UIE on quintuples (Camera-COQE); +1–2 points from additional pre-training on event extraction.
- Few-shot (English): 1-shot F1 on CoNLL03: 89.07 (RexUIE-EN) vs. 79.65 (USM).
- Zero-shot comparison: CoNLL++ (NER): 76.77 (RexUIE-EN) vs. 58.40 (ChatGPT).
In multi-modal (text+layout+image) NLU, MRexUniNLU achieves 66.84 Entity F1 (PPN), outperforming RexUniNLU-human-text or layout-only variants.
Ablation analysis shows performance drops without Prompts Isolation (−0.52), RoPE (−0.82), and both (−1.92), confirming the architectural choices. There is a positive correlation between schema complexity and F1 gains, specifically with relative gain and , where is the number of schema leaf types and is training size (Liu et al., 2024).
6. Strengths, Limitations, and Directions for Further Research
Strengths:
- Unified encoder-only framework supporting all main IE and CLS schema types, multi-modal, and multi-language tasks.
- Explicit Schema Instructor enforces type constraints and mitigates incorrect extraction, critical in low-data and complex schemas.
- Recursive decoding accommodates arbitrary schema arity without the computational cost of generative approaches.
Limitations:
- High pre-training cost due to reliance on large IE/MRC corpora; possibility for efficiency via lighter pre-training or adapter modules.
- Inference currently requires enumeration over all schema paths, limiting efficiency in rare-type queries; dynamic pruning or learned schema selection is a prospective improvement.
- Modalities beyond text, layout, and image (e.g. audio, video), wider language coverage, and open-schema IE remain open challenges.
- Incorporation of continual schema learning for evolving ontologies is an ongoing direction.
7. Significance and Outlook
RexUniNLU introduces a principled, scalable method for universal NLU, bridging longstanding divides between information extraction and classification. Its recursive, schema-constrained inference and generalization to complex and multimodal schemas provide a foundation for robust universal NLU, with performance validated under full-shot, few-shot, zero-shot, and multi-modal regimes on numerous benchmarks in Chinese and English (Liu et al., 2024).