Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Datasets for Portuguese Legal Semantic Textual Similarity: Comparing weak supervision and an annotation process approaches (2306.00007v1)

Published 29 May 2023 in cs.CL and cs.LG

Abstract: The Brazilian judiciary has a large workload, resulting in a long time to finish legal proceedings. Brazilian National Council of Justice has established in Resolution 469/2022 formal guidance for document and process digitalization opening up the possibility of using automatic techniques to help with everyday tasks in the legal field, particularly in a large number of texts yielded on the routine of law procedures. Notably, AI techniques allow for processing and extracting useful information from textual data, potentially speeding up the process. However, datasets from the legal domain required by several AI techniques are scarce and difficult to obtain as they need labels from experts. To address this challenge, this article contributes with four datasets from the legal domain, two with documents and metadata but unlabeled, and another two labeled with a heuristic aiming at its use in textual semantic similarity tasks. Also, to evaluate the effectiveness of the proposed heuristic label process, this article presents a small ground truth dataset generated from domain expert annotations. The analysis of ground truth labels highlights that semantic analysis of domain text can be challenging even for domain experts. Also, the comparison between ground truth and heuristic labels shows that heuristic labels are useful.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
Citations (1)

Summary

Exploring Approaches to Portuguese Legal Semantic Textual Similarity

The paper entitled "Datasets for Portuguese Legal Semantic Textual Similarity: Comparing weak supervision and an annotation process approaches" offers a significant contribution to the ongoing discourse surrounding the automation of legal data processing using AI techniques, particularly with respect to Portuguese legal texts. Given the enormity of the workload within the Brazilian judiciary system, as evidenced by the staggering number of cases awaiting resolution, innovative solutions to expedite legal proceedings are of paramount importance.

The authors introduce four datasets in total, specifically tailored for Portuguese legal semantic textual similarity (STS) tasks. Two datasets, TCU Votes and STJ Judgments, contain documents and metadata obtained from the Federal Court of Accounts and the Superior Tribunal of Justice, respectively. These datasets are not annotated, serving potentially as a valuable resource for varied computational analyses across legal documents. The other two datasets, TCU Votes for Textual Semantic Similarity and STJ Judgments for Textual Semantic Similarity, are developed from their predecessors and are heuristically annotated for the STS task.

Based on an innovative heuristic labeling process, these datasets aim to address the scarcity of annotated legal texts for semantic similarity analysis. This approach assigns similarity scores to pairs of documents, reflecting their semantic congruency through automatically generated labels. The heuristic method uses document metadata to derive the textual similarity score, positing base values and noise to simulate real-world annotator behavior. This positions it as a cost-effective alternative to the time-consuming manual annotation typically necessitated by domain-specific semantic analysis.

To gauge the reliability of the heuristic labels, the authors undertook an evaluation using a ground truth dataset annotated by legal experts. This dataset revealed a moderate correlation between heuristic annotations and expert opinions—a promising outcome that endorses the utility of heuristic annotations for preliminary stages of AI model development in legal contexts. Notably, this ground truth dataset also highlights the challenges involved in manual annotation, including the substantial variability in expert annotations.

The work is presented meticulously, with the authors appraising other datasets relevant for Portuguese legal texts yet lacking specific focus on semantic textual similarity. By delivering annotated datasets for such a nuanced task, this publication fills a critical gap in the resources available to researchers focusing on natural language processing in legal applications.

Looking to the future, the datasets and heuristic methods put forth by the authors could significantly impact the development of Machine Learning models geared towards legal document retrieval and analysis. Moreover, their methodology might be applied or adapted to other languages or domains, prompting further explorations into automatic annotation processes.

In sum, this paper contributes valuable datasets and a novel heuristic labeling technique for semantic textual similarity in the Portuguese legal domain. These advances could potentially drive substantial innovation in the automation of legal processes, facilitating more efficient utilization of computational methods in addressing the complexities inherent in legal data.

Youtube Logo Streamline Icon: https://streamlinehq.com