Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation (2401.17904v2)

Published 31 Jan 2024 in cs.CV

Abstract: The Segment Anything Model (SAM), a profound vision foundation model pretrained on a large-scale dataset, breaks the boundaries of general segmentation and sparks various downstream applications. This paper introduces Hi-SAM, a unified model leveraging SAM for hierarchical text segmentation. Hi-SAM excels in segmentation across four hierarchies, including pixel-level text, word, text-line, and paragraph, while realizing layout analysis as well. Specifically, we first turn SAM into a high-quality pixel-level text segmentation (TS) model through a parameter-efficient fine-tuning approach. We use this TS model to iteratively generate the pixel-level text labels in a semi-automatical manner, unifying labels across the four text hierarchies in the HierText dataset. Subsequently, with these complete labels, we launch the end-to-end trainable Hi-SAM based on the TS architecture with a customized hierarchical mask decoder. During inference, Hi-SAM offers both automatic mask generation (AMG) mode and promptable segmentation (PS) mode. In the AMG mode, Hi-SAM segments pixel-level text foreground masks initially, then samples foreground points for hierarchical text mask generation and achieves layout analysis in passing. As for the PS mode, Hi-SAM provides word, text-line, and paragraph masks with a single point click. Experimental results show the state-of-the-art performance of our TS model: 84.86% fgIOU on Total-Text and 88.96% fgIOU on TextSeg for pixel-level text segmentation. Moreover, compared to the previous specialist for joint hierarchical detection and layout analysis on HierText, Hi-SAM achieves significant improvements: 4.73% PQ and 5.39% F1 on the text-line level, 5.49% PQ and 7.39% F1 on the paragraph level layout analysis, requiring $20\times$ fewer training epochs. The code is available at https://github.com/ymy-k/Hi-SAM.

Citations (4)

Summary

  • The paper introduces Hi-SAM, which adapts the Segment Anything Model through fine-tuning to perform hierarchical text segmentation across strokes, words, lines, and paragraphs.
  • It achieves state-of-the-art fgIOU scores and significantly reduces training epochs compared to previous models dedicated to joint detection and layout analysis.
  • The work establishes a unified framework for text structure analysis, paving the way for advancements in document understanding and automated content extraction.

Hi-SAM: A Unified Model for Hierarchical Text Segmentation Leveraging the Segment Anything Model

Introduction

Hierarchical text segmentation plays a crucial role in extracting meaningful structure from text data spanning various hierarchies such as strokes, words, text lines, and paragraphs. Traditional approaches have often targeted specific hierarchies in isolation, leading to fragmented solutions that fall short in comprehensively understanding text structure. This paper introduces Hi-SAM, a novel model that leverages the capabilities of the Segment Anything Model (SAM) to address hierarchical text segmentation comprehensively. Hi-SAM excels in text segmentation across strokes, words, text lines, and paragraphs, and simultaneously performs layout analysis. This work showcases significant advancements in the field by offering a unified framework that is scalable and efficient in handling complex text hierarchies.

Methodology

The core innovation within Hi-SAM is its ability to adapt the generalist SAM into a specialist model for text stroke segmentation (TSS) through parameter-efficient fine-tuning. The transformed SAM, named SAM-TSS, serves as a foundational element within Hi-SAM, demonstrating superior performance on text stroke segmentation tasks. Following the segmentation of text strokes, Hi-SAM leverages foreground points sampled from these segmented strokes to facilitate the segmentation of higher hierarchies, including words, text lines, and paragraphs. Hi-SAM is structured into several key modules, including an image encoder, a self-prompting module, stroke and hierarchical mask decoders, and a layout analysis mechanism.

Experimental Results

Extensive experiments confirm the state-of-the-art performance of SAM-TSS across multiple datasets, showcasing impressive fgIOU scores on Total-Text and TextSeg for text stroke segmentation. Moreover, compared to previous models dedicated to joint hierarchical detection and layout analysis, Hi-SAM achieves significant improvements across various metrics while requiring substantially fewer training epochs. These results underscore Hi-SAM’s efficiency and effectiveness in hierarchical text segmentation and layout analysis, outperforming existing approaches and setting new benchmarks in the domain.

Implications and Future Work

The introduction of Hi-SAM not only advances the state-of-the-art in hierarchical text segmentation but also opens new avenues for future research. The ability of Hi-SAM to perform comprehensive text structure analysis in a unified and efficient manner has significant implications for various applications, including document understanding, information retrieval, and automated content extraction. Going forward, potential areas of exploration could include optimizing Hi-SAM for real-time performance, enhancing its adaptability to unseen domains, and leveraging synthetic data or advanced generative methods to further improve the model’s robustness and accuracy.

Conclusion

In summary, Hi-SAM represents a significant leap forward in hierarchical text segmentation, offering unparalleled performance and efficiency. By effectively leveraging the Segment Anything Model and innovatively addressing the challenges of text stroke segmentation and higher-level text hierarchy segmentation, Hi-SAM establishes a new paradigm for understanding and processing complex text structures. The results of this research not only underscore the model’s potential but also set a foundation for future breakthroughs in the field of text segmentation and layout analysis.