Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BiSECT: Learning to Split and Rephrase Sentences with Bitexts (2109.05006v1)

Published 10 Sep 2021 in cs.CL

Abstract: An important task in NLP applications such as sentence simplification is the ability to take a long, complex sentence and split it into shorter sentences, rephrasing as necessary. We introduce a novel dataset and a new model for this `split and rephrase' task. Our BiSECT training data consists of 1 million long English sentences paired with shorter, meaning-equivalent English sentences. We obtain these by extracting 1-2 sentence alignments in bilingual parallel corpora and then using machine translation to convert both sides of the corpus into the same language. BiSECT contains higher quality training examples than previous Split and Rephrase corpora, with sentence splits that require more significant modifications. We categorize examples in our corpus, and use these categories in a novel model that allows us to target specific regions of the input sentence to be split and edited. Moreover, we show that models trained on BiSECT can perform a wider variety of split operations and improve upon previous state-of-the-art approaches in automatic and human evaluations.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Joongwon Kim (6 papers)
  2. Mounica Maddela (11 papers)
  3. Reno Kriz (14 papers)
  4. Wei Xu (536 papers)
  5. Chris Callison-Burch (102 papers)
Citations (23)

Summary

We haven't generated a summary for this paper yet.