I3CL:Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection

Published 3 Aug 2021 in cs.CV | (2108.01343v3)

Abstract: Existing methods for arbitrary-shaped text detection in natural scenes face two critical issues, i.e., 1) fracture detections at the gaps in a text instance; and 2) inaccurate detections of arbitrary-shaped text instances with diverse background context. To address these issues, we propose a novel method named Intra- and Inter-Instance Collaborative Learning (I3CL). Specifically, to address the first issue, we design an effective convolutional module with multiple receptive fields, which is able to collaboratively learn better character and gap feature representations at local and long ranges inside a text instance. To address the second issue, we devise an instance-based transformer module to exploit the dependencies between different text instances and a global context module to exploit the semantic context from the shared background, which are able to collaboratively learn more discriminative text feature representation. In this way, I3CL can effectively exploit the intra- and inter-instance dependencies together in a unified end-to-end trainable framework. Besides, to make full use of the unlabeled data, we design an effective semi-supervised learning method to leverage the pseudo labels via an ensemble strategy. Without bells and whistles, experimental results show that the proposed I3CL sets new state-of-the-art results on three challenging public benchmarks, i.e., an F-measure of 77.5% on ICDAR2019-ArT, 86.9% on Total-Text, and 86.4% on CTW-1500. Notably, our I3CL with the ResNeSt-101 backbone ranked 1st place on the ICDAR2019-ArT leaderboard. The source code will be available at https://github.com/ViTAE-Transformer/ViTAE-Transformer-Scene-Text-Detection.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (27)

View on Semantic Scholar

Summary

Intra- and Inter-Instance Collaborative Learning for Scene Text Detection

The paper introduces a new approach, termed Intra- and Inter-Instance Collaborative Learning (I3CL), dedicated to enhancing the detection of arbitrary-shaped text in various scenes. Standard techniques in text detection often succumb to fracture detection at text gaps and inaccuracies within varied backgrounds. The proposed I3CL method aims to address these challenges with a unified framework that leverages both intra-instance and inter-instance dependencies.

The intra-instance collaborative learning (Intra-CL) module embraces the concept of varying receptive fields to foster a robust feature representation, which effectively captures characters alongside the inherently significant gap regions. The convolutional module within Intra-CL is designated as a series of blocks with asymmetric horizontal and vertical kernels, ensuring adaptability to multi-oriented text. Through this ensemble of varied paths, the method refines the depiction of scattered text features, thus mitigating fracture detections efficiently.

The inter-instance collaborative learning (Inter-CL) module utilizes a transformer architecture to discern dependencies among distinct text instances and global contexts. In this setup, extracted features from text instances enhance the semantic precision, while a global context module incorporates shared background elements to further refine text features, overcoming inaccuracies prevalent in complex scenes. The collaborative learning process ensures the acquisition of discriminative text representations across instances.

Experimentation unveils the prowess of I3CL on benchmarks, notably on datasets such as ArT, Total-Text, and CTW-1500, where notable improvements are achieved. For instance, an F-measure of 77.5% on ArT notably highlights its efficacy. The inclusion of semi-supervised learning (SSL) and pseudo labeling fortifies the model's accuracy, exploiting unlabeled data to bridge detection gaps left by conventional methods.

As methods evolve, I3CL proposes a significant move toward integrating collaborative strategies in text detection, adopting a multi-faceted fusion of features to resolve positional and contextual disparities within varied environments. The findings suggest impending work in optimizing efficiency and implementing domain knowledge to refine text detection amidst multiple orientations. Furthermore, exploring the enmeshment of linguistic analytics could enhance semantic grouping and clarity in detection outcomes.

In conclusion, the research encapsulates a meaningful stride in the advancement of scene text detection. I3CL adeptly amalgamates intra- and inter-instance learning paradigms to achieve refined detection precision, opening pathways for future endeavors in complex AI-driven text recognition systems.

Markdown Report Issue