Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InstructIE: A Bilingual Instruction-based Information Extraction Dataset (2305.11527v4)

Published 19 May 2023 in cs.CL, cs.AI, cs.IR, and cs.LG
InstructIE: A Bilingual Instruction-based Information Extraction Dataset

Abstract: LLMs can perform well on general natural language tasks, but their effectiveness is still suboptimal for information extraction (IE). Recent works indicate that the main reason lies in the lack of extensive data on IE instructions. Note that the existing datasets on IE instructions not only have limited coverage but also involve high construction costs. To address this issue, we introduce InstructIE, a bilingual instruction-based IE dataset, which covers 12 diverse domains. We propose KG2Instruction, a framework specifically for the automatic generation of such datasets. Additionally, we manually annotate the test set. Experimental results demonstrate that LLMs trained with InstructIE can not only obtain better IE capabilities but also enhance zero-shot performance compared with baselines.

InstructIE: A Bilingual Instruction-based Information Extraction Dataset

The paper introduces InstructIE, a bilingual information extraction (IE) dataset designed under an instruction-based paradigm to address the rigidity and redundancy present in traditional IE methodologies. This dataset represents a significant step forward in enhancing the adaptability of IE systems by allowing flexibility in label structure and promoting a more dynamic interaction model driven by LLMs.

Key Contributions

  1. Introduction of InstructIE Dataset: The authors present InstructIE, a novel dataset comprising 371,700 instances across Chinese and English, organized around 12 thematic categories. This dataset remedies the issue of redundant labels found in existing IE datasets by aligning with theme-specific schemas, which simplifies the label space and focuses on relevant extraction tasks only.
  2. KG2Instruction Framework: Alongside the dataset, the research introduces the KG2Instruction framework. This innovative tool is designed for the automated construction of theme-centric IE instruction datasets. It leverages existing knowledge graphs and corpora to generate instruction datasets while providing the flexibility to adjust label granularity, thus overcoming the limitations of static label sets.
  3. Experimental Evaluation: The paper explores the performance of various LLMs, including open-source models like Baichuan2, LLaMA, and mT5, as well as proprietary models like ChatGPT, on the InstructIE dataset. These evaluations occur under different learning paradigms, including zero-shot, in-context learning, and fine-tuning. The results demonstrate that fine-tuning LLMs on the InstructIE dataset enhances performance in instruction-based IE tasks, though challenges remain in achieving precise entity boundary identification and avoiding the generation of spurious relations.

Numerical Results and Model Performance

The analysis shows a substantial improvement when LLMs are fine-tuned on the InstructIE dataset compared to zero-shot and few-shot learning settings. For instance, fine-tuned Baichuan2-13B-Base attained an F1 score of 50.08% on the Chinese dataset, outperforming other models and methods. However, while these results affirm the potential of instruction-based IE, they also highlight the areas where further optimization is needed, such as managing entity misalignment and reducing erroneous relation predictions.

Implications and Future Directions

The implications of the InstructIE dataset and the KG2Instruction framework span both practical and theoretical aspects of IE. Practically, they enable the development of more adaptable and scalable IE systems that can swiftly align with evolving user requirements. Theoretically, the framework challenges the static conception of label sets, paving the way for a more fluid understanding of IE tasks where LLMs are guided by dynamically structured instructions.

The paper highlights the potential for continued development in this domain. Future research could focus on refining models' capability to delineate entity boundaries and improve relation extraction accuracy. Additionally, the exploration of broader thematic coverage and the cross-linguistic capabilities of instruction-based IE systems could further validate and extend the framework proposed in this paper.

In conclusion, the paper makes a significant contribution to the field of information extraction by offering a robust bilingual dataset and an innovative framework that addresses some of the persisting limitations of traditional IE systems. The advancements presented in this work provide a strong foundation for further exploration and innovation in dynamically responsive, instruction-driven information extraction.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Honghao Gui (8 papers)
  2. Jintian Zhang (11 papers)
  3. Hongbin Ye (16 papers)
  4. Ningyu Zhang (148 papers)
  5. Shuofei Qiao (19 papers)
  6. Mengshu Sun (41 papers)
  7. Lei Liang (37 papers)
  8. Jeff Z. Pan (78 papers)
  9. Huajun Chen (198 papers)
Citations (5)
Github Logo Streamline Icon: https://streamlinehq.com