Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DALE: Generative Data Augmentation for Low-Resource Legal NLP (2310.15799v1)

Published 24 Oct 2023 in cs.CL and cs.AI

Abstract: We present DALE, a novel and effective generative Data Augmentation framework for low-resource LEgal NLP. DALE addresses the challenges existing frameworks pose in generating effective data augmentations of legal documents - legal language, with its specialized vocabulary and complex semantics, morphology, and syntax, does not benefit from data augmentations that merely rephrase the source sentence. To address this, DALE, built on an Encoder-Decoder LLM, is pre-trained on a novel unsupervised text denoising objective based on selective masking - our masking strategy exploits the domain-specific language characteristics of templatized legal documents to mask collocated spans of text. Denoising these spans helps DALE acquire knowledge about legal concepts, principles, and language usage. Consequently, it develops the ability to generate coherent and diverse augmentations with novel contexts. Finally, DALE performs conditional generation to generate synthetic augmentations for low-resource Legal NLP tasks. We demonstrate the effectiveness of DALE on 13 datasets spanning 6 tasks and 4 low-resource settings. DALE outperforms all our baselines, including LLMs, qualitatively and quantitatively, with improvements of 1%-50%.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Sreyan Ghosh (46 papers)
  2. Chandra Kiran Evuru (2 papers)
  3. Sonal Kumar (30 papers)
  4. S Ramaneswaran (6 papers)
  5. S Sakshi (11 papers)
  6. Utkarsh Tyagi (18 papers)
  7. Dinesh Manocha (366 papers)
Citations (9)

Summary

We haven't generated a summary for this paper yet.