The paper introduces RUIE (Retrieval-based Unified Information Extraction), a novel framework for unified information extraction (UIE) that leverages in-context learning with LLMs. RUIE addresses the limitations of instruction-tuning methods, such as high computational costs and poor generalization to unseen tasks, by using a retrieval-based approach to select the most beneficial demonstrations for LLMs.
The key components and contributions of RUIE are:
- A trainable retrieval framework for UIE, which, to the best of the authors' knowledge, is the first of its kind. It uses in-context learning to reduce computational costs and enable rapid generalization.
- A demonstration selection mechanism that incorporates LLM preferences for ranking candidates.
- A keyword-enhanced reward model to capture fine-grained relationships between queries and demonstrations.
The authors formally define the information extraction tasks addressed, including Named Entity Recognition (NER), Relation Extraction (RE), and Event Extraction (EE). For a given sentence , NER extracts tuples where is the entity span and is the entity type. RE extracts triples where is the relation type, and and are the head and tail entities. EE includes Event Detection (ED), extracting event triggers where is the event type ontology, and Event Argument Extraction (EAE), extracting arguments for a given event trigger , with being the role type ontology.
The RUIE framework operates as follows:
- Initialization of Candidates: A sparse retriever, BM25, is used to retrieve an initial set of candidate demonstrations from a candidate pool .
- LLM Preference Scoring: The extraction instruction , a candidate example , and an input are concatenated and fed into the LLM. The LLM assigns a score to each candidate example based on the token-level average log-likelihood of generating the ground-truth output :
* : input in the training set * : example in the candidate pool , where * : ground-truth output * : extraction instruction * : input context to be extracted from example * : structured output linearized by natural language from example * : input context to be extracted from input
The top and last examples are then selected as positive and negative examples, respectively.
- Keyword-enhanced Reward: Special tags
\<Keyword\>'' and
</Keyword>'' are added around the information snippets in context , where is the label of span . A positive example is sampled from the top-k ranked candidates, and the last n examples are taken as negative examples . A cross-encoder is trained using the cross-entropy loss:
* : cross-entropy loss * : enhanced input with keywords * : label of enhanced input * : positive example * : label of positive example * : negative examples * : label of negative examples
- UIE Retriever Training: A bi-encoder-based UIE retriever is trained using two supervision signals: the Info-NCE loss for contrastive learning between positives and in-batch negatives, and the KL divergence to align the output distributions of the reward model and the retriever, calculated as . The final training loss is:
* : final training loss * : KL divergence loss * : Info-NCE loss * : hyper-parameter to measure the importance of the two losses
During inference, the trained dense retriever selects the best demonstrations from the candidate pool and passes them to the LLM to produce the output.
The experimental setup involves training RUIE on 31 held-in datasets and evaluating its generalization ability on 8 held-out datasets. The datasets used includes NER datasets such as ACE2004, ACE2005, Broad Twitter, CoNLL2003, MultiNERD, Ontonotes, Polyglot-NER, tweetNER7, wikiANN, wikineural, AnatEM, bc2gm, bc4chemd, bc5cdr, FabNER, FindVehicle, GENIA, and HarveyNER. RE datasets such as ADE corpus, CoNLL04, GIDS, kbp37, NYT, NYT11 HRL, SciERC, semeval RE, FewRel, and Wiki-ZSL are also used. For the EE task, ACE2005, CASIE, GENIA, PHEE, CrudeOilNews, RAMS and WikiEvents are used. Span-based Micro-F1 is used as the evaluation metric.
The baseline methods selected for comparison include traditional UIE methods, instruction-tuning-based methods (InstructUIE, YAYI-UIE, LLaMA2-IEPILE), and retrieval-based methods (BM25, E5, BGE).
The experimental results demonstrate RUIE's effectiveness in generalizing to unseen tasks. RUIE achieves the best performance in four information extraction tasks. Compared to supervised fine-tuning (SFT)-based methods, RUIE exhibits better generalization ability. RUIE delivers improvements of 8.91, 14.89, 27.03, and 26.05 on NER, RE, ET, and EAE tasks, respectively, despite using smaller model. Compared to general-purpose retrievers, RUIE retrieves higher-quality examples, achieving 5.84, 4.05, and 3.07 improvements on NER, RE, and EET IE tasks respectively compared to BM25.
Ablation studies confirm the importance of the keyword enhancement and the reward model components. The removal of keyword enhancement led to a 0.62 decrease in the average F1-score across four tasks. After removing distill loss, the average f1-score of the four tasks decreases by 10.73. After removing the reward model, the average f1-score of the four tasks decreases by 8.87.
The effects of varying the number of k-shot demonstrations and using different scoring LLMs and inference LLMs were also investigated. The experiments show that task performance improves with an increase in k, but excessive examples can introduce noise. The size of the scoring LLM has a minor influence on the final performance, and the base version of the LLM is more suitable as a scoring model than the instruct version. The ability of the reasoning LLMs significantly influences the extraction performance. For the same Qwen1.5, an increase in model size from 7B to 14B resulted in an average performance enhancement of 9.21 across four tasks.
The authors discuss the limitations of RUIE, including context length limitations, performance gap between RUIE and SFT-based methods in seen tasks, and the fact that RUIE is currently only trained and tested on English data.