Papers
Topics
Authors
Recent
Search
2000 character limit reached

RUIE: Retrieval-based Unified Information Extraction using Large Language Model

Published 18 Sep 2024 in cs.CL | (2409.11673v2)

Abstract: Unified information extraction (UIE) aims to extract diverse structured information from unstructured text. While LLMs have shown promise for UIE, they require significant computational resources and often struggle to generalize to unseen tasks. We propose RUIE (Retrieval-based Unified Information Extraction), a framework that leverages in-context learning for efficient task generalization. RUIE introduces a novel demonstration selection mechanism combining LLM preferences with a keyword-enhanced reward model, and employs a bi-encoder retriever trained through contrastive learning and knowledge distillation. As the first trainable retrieval framework for UIE, RUIE serves as a universal plugin for various LLMs. Experimental results on eight held-out datasets demonstrate RUIE's effectiveness, with average F1-score improvements of 19.22 and 3.22 compared to instruction-tuning methods and other retrievers, respectively.

Summary

  • The paper introduces RUIE, a novel retrieval-based framework that improves LLM-based unified information extraction through effective in-context learning, addressing limitations of instruction tuning.
  • RUIE significantly outperforms instruction-tuning and general retrieval methods on unseen tasks, achieving substantial performance improvements across NER, RE, and Event Extraction tasks.
  • Ablation studies demonstrate the critical importance of RUIE's trainable retrieval mechanism, keyword enhancement, and reward model for selecting high-quality demonstrations and achieving robust performance.

The paper introduces RUIE (Retrieval-based Unified Information Extraction), a novel framework for unified information extraction (UIE) that leverages in-context learning with LLMs. RUIE addresses the limitations of instruction-tuning methods, such as high computational costs and poor generalization to unseen tasks, by using a retrieval-based approach to select the most beneficial demonstrations for LLMs.

The key components and contributions of RUIE are:

  • A trainable retrieval framework for UIE, which, to the best of the authors' knowledge, is the first of its kind. It uses in-context learning to reduce computational costs and enable rapid generalization.
  • A demonstration selection mechanism that incorporates LLM preferences for ranking candidates.
  • A keyword-enhanced reward model to capture fine-grained relationships between queries and demonstrations.

The authors formally define the information extraction tasks addressed, including Named Entity Recognition (NER), Relation Extraction (RE), and Event Extraction (EE). For a given sentence xx, NER extracts tuples {s,e}\{s, e\} where ss is the entity span and ee is the entity type. RE extracts triples {es,et,r}\{e_s, e_t, r\} where rr is the relation type, and ese_s and ete_t are the head and tail entities. EE includes Event Detection (ED), extracting event triggers t∈Ot \in \mathcal{O} where O\mathcal{O} is the event type ontology, and Event Argument Extraction (EAE), extracting arguments a∈Ra \in \mathcal{R} for a given event trigger tt, with R\mathcal{R} being the role type ontology.

The RUIE framework operates as follows:

  1. Initialization of Candidates: A sparse retriever, BM25, is used to retrieve an initial set of candidate demonstrations from a candidate pool PP.
  2. LLM Preference Scoring: The extraction instruction II, a candidate example ei=(xi,yi)e_i = (x_i, y_i), and an input s=(x,y)s = (x, y) are concatenated and fed into the LLM. The LLM assigns a score to each candidate example based on the token-level average log-likelihood of generating the ground-truth output yy:

    Score(s,ei)=log p(y∣I;xi;yi;x)Score(s, e_i) = log~p(y | I; x_i; y_i; x)

* ss: input in the training set (x,y)(x, y) * eie_i: example in the candidate pool PP, where ei=(xi,yi)e_i = (x_i, y_i) * yy: ground-truth output * II: extraction instruction * xix_i: input context to be extracted from example eie_i * yiy_i: structured output linearized by natural language from example eie_i * xx: input context to be extracted from input ss

The top kk and last nn examples are then selected as positive and negative examples, respectively.

  1. Keyword-enhanced Reward: Special tags \<Keyword\>'' and</Keyword>'' are added around the information snippets (s,o)(s, o) in context xx, where oo is the label of span ss. A positive example (x+′,y+′)(x'_{+}, y'_{+}) is sampled from the top-k ranked candidates, and the last n examples are taken as negative examples (x−i′,y−i′)(x'_{-i}, y'_{-i}). A cross-encoder is trained using the cross-entropy loss:

    Lreward=−loges(x′,y′,x+′,y+′)es(x′,y′,x+′,y+′)+∑i=1nes(x′,y′,x−i′,y−i′)\mathcal{L}_{reward} = -log\frac{e^{s(x',y',x'_{+}, y'_{+})}}{e^{s(x',y',x'_{+}, y'_{+})}+\sum_{i=1}^n{e^{s(x',y',x'_{-_i}, y'_{-_i})}}}

* Lreward\mathcal{L}_{reward}: cross-entropy loss * x′x': enhanced input with keywords * y′y': label of enhanced input x′x' * x+′x'_{+}: positive example * y+′y'_{+}: label of positive example x+′x'_{+} * x−i′x'_{-i}: negative examples * y−i′y'_{-i}: label of negative examples x−i′x'_{-i}

  1. UIE Retriever Training: A bi-encoder-based UIE retriever is trained using two supervision signals: the Info-NCE loss LcontrastiveL_{contrastive} for contrastive learning between positives and in-batch negatives, and the KL divergence to align the output distributions of the reward model and the retriever, calculated as LdistillL_{distill}. The final training loss is:

    Lretriever=Ldistill+αLcontrastive\mathcal{L}_{retriever} = \mathcal{L}_{distill} + \alpha\mathcal{L}_{contrastive}

* Lretriever\mathcal{L}_{retriever}: final training loss * Ldistill\mathcal{L}_{distill}: KL divergence loss * Lcontrastive\mathcal{L}_{contrastive}: Info-NCE loss * α\alpha: hyper-parameter to measure the importance of the two losses

During inference, the trained dense retriever selects the best demonstrations from the candidate pool PP and passes them to the LLM to produce the output.

The experimental setup involves training RUIE on 31 held-in datasets and evaluating its generalization ability on 8 held-out datasets. The datasets used includes NER datasets such as ACE2004, ACE2005, Broad Twitter, CoNLL2003, MultiNERD, Ontonotes, Polyglot-NER, tweetNER7, wikiANN, wikineural, AnatEM, bc2gm, bc4chemd, bc5cdr, FabNER, FindVehicle, GENIA, and HarveyNER. RE datasets such as ADE corpus, CoNLL04, GIDS, kbp37, NYT, NYT11 HRL, SciERC, semeval RE, FewRel, and Wiki-ZSL are also used. For the EE task, ACE2005, CASIE, GENIA, PHEE, CrudeOilNews, RAMS and WikiEvents are used. Span-based Micro-F1 is used as the evaluation metric.

The baseline methods selected for comparison include traditional UIE methods, instruction-tuning-based methods (InstructUIE, YAYI-UIE, LLaMA2-IEPILE), and retrieval-based methods (BM25, E5, BGE).

The experimental results demonstrate RUIE's effectiveness in generalizing to unseen tasks. RUIE achieves the best performance in four information extraction tasks. Compared to supervised fine-tuning (SFT)-based methods, RUIE exhibits better generalization ability. RUIE delivers improvements of 8.91, 14.89, 27.03, and 26.05 on NER, RE, ET, and EAE tasks, respectively, despite using smaller model. Compared to general-purpose retrievers, RUIE retrieves higher-quality examples, achieving 5.84, 4.05, and 3.07 improvements on NER, RE, and EET IE tasks respectively compared to BM25.

Ablation studies confirm the importance of the keyword enhancement and the reward model components. The removal of keyword enhancement led to a 0.62 decrease in the average F1-score across four tasks. After removing distill loss, the average f1-score of the four tasks decreases by 10.73. After removing the reward model, the average f1-score of the four tasks decreases by 8.87.

The effects of varying the number of k-shot demonstrations and using different scoring LLMs and inference LLMs were also investigated. The experiments show that task performance improves with an increase in k, but excessive examples can introduce noise. The size of the scoring LLM has a minor influence on the final performance, and the base version of the LLM is more suitable as a scoring model than the instruct version. The ability of the reasoning LLMs significantly influences the extraction performance. For the same Qwen1.5, an increase in model size from 7B to 14B resulted in an average performance enhancement of 9.21 across four tasks.

The authors discuss the limitations of RUIE, including context length limitations, performance gap between RUIE and SFT-based methods in seen tasks, and the fact that RUIE is currently only trained and tested on English data.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 9 likes about this paper.