Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fine-tuning and aligning question answering models for complex information extraction tasks (2309.14805v1)

Published 26 Sep 2023 in cs.CL and cs.AI
Fine-tuning and aligning question answering models for complex information extraction tasks

Abstract: The emergence of LLMs has boosted performance and possibilities in various NLP tasks. While the usage of generative AI models like ChatGPT opens up new opportunities for several business use cases, their current tendency to hallucinate fake content strongly limits their applicability to document analysis, such as information retrieval from documents. In contrast, extractive LLMs like question answering (QA) or passage retrieval models guarantee query results to be found within the boundaries of an according context document, which makes them candidates for more reliable information extraction in productive environments of companies. In this work we propose an approach that uses and integrates extractive QA models for improved feature extraction of German business documents such as insurance reports or medical leaflets into a document analysis solution. We further show that fine-tuning existing German QA models boosts performance for tailored extraction tasks of complex linguistic features like damage cause explanations or descriptions of medication appearance, even with using only a small set of annotated data. Finally, we discuss the relevance of scoring metrics for evaluating information extraction tasks and deduce a combined metric from Levenshtein distance, F1-Score, Exact Match and ROUGE-L to mimic the assessment criteria from human experts.

The paper introduces a method for improved feature extraction from German business documents using extractive question answering (QA) models integrated into an information retrieval system (IRS). The work addresses the challenge of automating feature extraction from unstructured text data in business processes, such as customer service, insurance claims assessment, and medical literature review. The authors highlight the limitations of generative LLMs (LLMs) due to their tendency to hallucinate, making them less reliable for document analysis compared to extractive QA models. The paper focuses on fine-tuning existing German QA models to enhance their performance in extracting complex linguistic features from domain-specific documents.

The authors investigate the use of QA models for extracting complex information from textual documents in specific industrial use cases, the influence of fine-tuning QA models on performance across different domains and textual features, and the appropriateness of metrics for automated performance evaluations that resemble human expert examination.

The paper classifies the difficulty levels of feature extraction into three categories: Simple (e.g., IBAN, email addresses), Dynamic (e.g., named entities), and Complex (e.g., cause of an event). QA models are particularly suitable for extracting complex features that are difficult to define with rule-based approaches.

The proposed information extraction pipeline includes several steps:

  1. Converting scanned text documents to raw image data.
  2. Detecting and classifying text blocks using region detection algorithms and OCR (optical character recognition).
  3. Saving the results as an extended document model containing region, text content, and coordinates.
  4. Restricting the search scope to relevant text regions using a rule-based approach.
  5. Querying the QA model with the extracted text and a suitable question.
  6. Performing a rule-based validation of the model answer.

For domain-specific QA fine-tuning, the authors used the model gelectra-large-germanquad, which was pre-trained on GermanQuAD. They constructed two distinct datasets: a drug leaflet dataset consisting of 170 medication leaflet documents with three QA pairs per document (Ingredient, Look, Application) and an elemental damage report dataset consisting of 47 elemental damage reports with two QA pairs per document (Damage Cause, Assessor Name). The documents were annotated using the QA annotation tool Haystack.

The fine-tuning process involved 5-fold cross-validation with 80%/20% train/test splits to find the optimal hyperparameter settings, including epoch number, batch size, learning rate, and doc stride. Model performance was compared before and after fine-tuning using automatically computable metrics.

The evaluation metrics used in the paper include:

  • Manual Expert Assessment: A ground-truth baseline where model answers are manually evaluated by experts.
  • EM (Exact Match): Measures the exact agreement of the model output with regard to the labeled answer(s) on a character basis.

    $\mathcal{L}_{EM}^{(k)} = \min\bigg\{1,\; \sum_{i=1}^{N_k} \mathbbold{1}\Big( \hat{y}^{(k)} = y_i \Big)\bigg\}$

    where:

    • LEM(k)\mathcal{L}_{EM}^{(k)} is the exact match score for question kk
    • NkN_k is the number of annotators for question kk
    • y^(k)\hat{y}^{(k)} is the model's response to question kk
    • yiy_i is the labeled answer from annotator ii for question kk
    • $\mathbbold{1}$ is the indicator function
  • Levenshtein Distance: Measures the amount of operations (insertion, deletion, substitution) that separate two strings of characters.
  • F1-score: Computes the word-wise contribution between precision and recall.

    $\mathcal{L}_{F1}^{(k)} = \frac{1}{N_k} \sum_{i=1}^{N_k} \frac{2}{\frac{|\mathcal{S}_{\hat{y}^{(k)}|}{|\mathcal{S}_{y_i}^{(k)} \bigcap \mathcal{S}_{\hat{y}^{(k)}|} + \frac{|\mathcal{S}_{y_i}^{(k)}|}{|\mathcal{S}_{y_i}^{(k)} \bigcap \mathcal{S}_{\hat{y}^{(k)}|}$

    where:

    • LF1(k)\mathcal{L}_{F1}^{(k)} is the F1-score for question kk
    • NkN_k is the number of annotators for question kk
    • Sy^(k)\mathcal{S}_{\hat{y}^{(k)}} is the set of distinct words in the model prediction for question kk
    • Syi(k)\mathcal{S}_{y_i}^{(k)} is the word-set of one of the labeled answers ii
    • S|\mathcal{S}_{\star}| is the set size, i.e. number of unique elements (words) in the set
  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation)-L: Looks for the longest common subsequence in the n-grams of two given sequences.
  • Weighted Average: A weighted average of the automated metrics as a single score.

    LWA=lCwlLllCwl\mathcal{L}_{WA} = \dfrac{\sum_{l \in C} w_l \: \mathcal{L}_l}{\sum_{l \in C} w_l}

    where:

    • LWA\mathcal{L}_{WA} is the weighted average score
    • CC is the set of metrics {EM, Lev, F1, RGE}
    • wlw_l are the weights for each metric
    • Ll\mathcal{L}_l is the value of each metric

The experimental setup involved training two final models, one for the leaflet document use case and one for the damage report use case, using 80% of the data for training and 20% for testing. Model performances were compared before and after the training process.

The results indicate a notable increase in model performance for the specific tasks, with varying degrees of improvements among the different datasets and questions. The fine-tuned models showed significant improvements in extracting features like "Look" from the leaflet dataset and "Assessor Name" from the damage report dataset. The F1 and ROUGE-L metrics showed similar trends, while the EM (Exact Match) metric did not provide much insight into human result usefulness. The weighted average score, combining Levenshtein, ROUGE-L, and F1, closely approximated the manual human expert evaluation.

The authors trained a linear model to predict the importance coefficients of the individual metrics to resemble the manual expert assessment score. The model accurately reconstructed the human scoring with 93.87% accuracy. However, a generalization of this approach over datasets and tasks from different domains could not be observed.

In conclusion, the paper demonstrates that applying extractive QA models for complex feature extraction in industrial use cases leads to good performance. Fine-tuning these QA models significantly improves performance and supports document analysis automation. A weighted average of Levenshtein, ROUGE-L, and F1 effectively approximates manual human expert evaluation.

Future work will focus on further fine-tuning QA models, optimizing prompts, experimenting with multiple-choice questions, improving page segmentation and region detection, applying rule-based post-validation strategies, and investigating multi-modal QA models. The authors also plan to integrate the best results and models into their industrial platform solution Aikido.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Matthias Engelbach (3 papers)
  2. Dennis Klau (7 papers)
  3. Felix Scheerer (1 paper)
  4. Jens Drawehn (4 papers)
  5. Maximilien Kintz (7 papers)
Citations (5)