Papers
Topics
Authors
Recent
Search
2000 character limit reached

Clinical Text Deduplication Practices for Efficient Pretraining and Improved Clinical Tasks

Published 29 Sep 2023 in cs.CL, cs.AI, and cs.LG | (2312.09469v1)

Abstract: Despite being a unique source of information on patients' status and disease progression, clinical notes are characterized by high levels of duplication and information redundancy. In general domain text, it has been shown that deduplication does not harm LLM (LM) pretraining, thus helping reduce the training cost. Although large LMs have proven to learn medical knowledge, they still require specialized domain adaptation for improved downstream clinical tasks. By leveraging large real-world clinical corpora, we first provided a fine-grained characterization of duplicates stemming from common writing practices and clinical relevancy. Second, we demonstrated that deduplicating clinical text can help clinical LMs encode less redundant information in a more efficient manner and do not harm classification tasks via prompt-based learning.

Citations (1)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.