InstructIE: A Bilingual Instruction-based Information Extraction Dataset
The paper introduces InstructIE, a bilingual information extraction (IE) dataset designed under an instruction-based paradigm to address the rigidity and redundancy present in traditional IE methodologies. This dataset represents a significant step forward in enhancing the adaptability of IE systems by allowing flexibility in label structure and promoting a more dynamic interaction model driven by LLMs.
Key Contributions
- Introduction of InstructIE Dataset: The authors present InstructIE, a novel dataset comprising 371,700 instances across Chinese and English, organized around 12 thematic categories. This dataset remedies the issue of redundant labels found in existing IE datasets by aligning with theme-specific schemas, which simplifies the label space and focuses on relevant extraction tasks only.
- KG2Instruction Framework: Alongside the dataset, the research introduces the KG2Instruction framework. This innovative tool is designed for the automated construction of theme-centric IE instruction datasets. It leverages existing knowledge graphs and corpora to generate instruction datasets while providing the flexibility to adjust label granularity, thus overcoming the limitations of static label sets.
- Experimental Evaluation: The paper explores the performance of various LLMs, including open-source models like Baichuan2, LLaMA, and mT5, as well as proprietary models like ChatGPT, on the InstructIE dataset. These evaluations occur under different learning paradigms, including zero-shot, in-context learning, and fine-tuning. The results demonstrate that fine-tuning LLMs on the InstructIE dataset enhances performance in instruction-based IE tasks, though challenges remain in achieving precise entity boundary identification and avoiding the generation of spurious relations.
Numerical Results and Model Performance
The analysis shows a substantial improvement when LLMs are fine-tuned on the InstructIE dataset compared to zero-shot and few-shot learning settings. For instance, fine-tuned Baichuan2-13B-Base attained an F1 score of 50.08% on the Chinese dataset, outperforming other models and methods. However, while these results affirm the potential of instruction-based IE, they also highlight the areas where further optimization is needed, such as managing entity misalignment and reducing erroneous relation predictions.
Implications and Future Directions
The implications of the InstructIE dataset and the KG2Instruction framework span both practical and theoretical aspects of IE. Practically, they enable the development of more adaptable and scalable IE systems that can swiftly align with evolving user requirements. Theoretically, the framework challenges the static conception of label sets, paving the way for a more fluid understanding of IE tasks where LLMs are guided by dynamically structured instructions.
The paper highlights the potential for continued development in this domain. Future research could focus on refining models' capability to delineate entity boundaries and improve relation extraction accuracy. Additionally, the exploration of broader thematic coverage and the cross-linguistic capabilities of instruction-based IE systems could further validate and extend the framework proposed in this paper.
In conclusion, the paper makes a significant contribution to the field of information extraction by offering a robust bilingual dataset and an innovative framework that addresses some of the persisting limitations of traditional IE systems. The advancements presented in this work provide a strong foundation for further exploration and innovation in dynamically responsive, instruction-driven information extraction.