Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PathologyBERT -- Pre-trained Vs. A New Transformer Language Model for Pathology Domain (2205.06885v1)

Published 13 May 2022 in cs.CL

Abstract: Pathology text mining is a challenging task given the reporting variability and constant new findings in cancer sub-type definitions. However, successful text mining of a large pathology database can play a critical role to advance 'big data' cancer research like similarity-based treatment selection, case identification, prognostication, surveillance, clinical trial screening, risk stratification, and many others. While there is a growing interest in developing LLMs for more specific clinical domains, no pathology-specific language space exist to support the rapid data-mining development in pathology space. In literature, a few approaches fine-tuned general transformer models on specialized corpora while maintaining the original tokenizer, but in fields requiring specialized terminology, these models often fail to perform adequately. We propose PathologyBERT - a pre-trained masked LLM which was trained on 347,173 histopathology specimen reports and publicly released in the Huggingface repository. Our comprehensive experiments demonstrate that pre-training of transformer model on pathology corpora yields performance improvements on Natural Language Understanding (NLU) and Breast Cancer Diagnose Classification when compared to nonspecific LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Thiago Santos (5 papers)
  2. Amara Tariq (9 papers)
  3. Susmita Das (34 papers)
  4. Kavyasree Vayalpati (1 paper)
  5. Geoffrey H. Smith (1 paper)
  6. Hari Trivedi (19 papers)
  7. Imon Banerjee (41 papers)
Citations (12)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com