Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 163 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 36 tok/s Pro

GPT-5 High 35 tok/s Pro

GPT-4o 125 tok/s Pro

Kimi K2 208 tok/s Pro

GPT OSS 120B 445 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Recursive Text Character Splitter (RTCS)

Updated 17 October 2025

RTCS provides a recursive segmentation framework that leverages neural embeddings and SRN models to enhance character-level context retention, yielding improved F1 scores.
Techniques like virtual decompression and graph-based min-cut enable efficient segmentation directly in compressed domains and challenging handwritten documents.
Deep recursive neural architectures and convolutional auto-encoders boost OCR accuracy (up to 94.2%) while supporting complex applications such as retrieval-augmented generation.

A Recursive Text Character Splitter (RTCS) is a set of methodologies in computational linguistics, optical character recognition (OCR), and information retrieval that address the problem of segmenting text into discrete character-level units using recursive or iterative algorithms. RTCS approaches emphasize context preservation, robustness across heterogeneous data types, and efficiency, especially in compressed or large-scale document settings. Multiple instantiations and algorithmic frameworks have been proposed for RTCS systems, with implementations ranging from character-level neural embedding recomputation to compressed-domain dynamic splitting and recursive convolutional architectures.

1. Character-Level Representation Learning and Segmentation

A foundational RTCS approach centers on learning high-dimensional text embeddings directly from sequences of raw characters, bypassing the need for prior tokenization. In (Chrupała, 2013), a Simple Recurrent Network (SRN) is trained to predict the next character, where each character is represented by a one-hot vector and the hidden layer encodes the context of preceding characters via the relation:

$s_j(t) = f \left( \sum_{i=1}^{I} w_i(t) U_{ji} + \sum_{l=1}^{J} s_l(t-1) W_{jl} \right)$

with $f(a) = 1 / (1 + \exp(-a))$ . The output layer predicts the next character using a softmax formulation. The resulting SRN hidden activations form rich, context-sensitive text embeddings at each character position.

In downstream segmentation tasks, these embeddings are employed as features for Conditional Random Field (CRF) models to label spans such as code segments or language fragments. Empirical results demonstrate that SRN-based features substantially improve segmentation F1 scores over baselines using character n-grams, even yielding gains commensurate with quadrupled training data. This embedding-based RTCS paradigm is robust for tasks involving code-mixed or structurally complex inputs, as context is recursively encoded at the character level.

2. Recursive and Compressed-Domain Splitting Algorithms

Efficient realization of RTCS is addressed by techniques operating directly on run-length compressed text images, obviating decompression (Javed et al., 2014, R et al., 2019). The core mechanism is a "virtual decompression": for each compressed row (alternating white/black pixel runs), runs are decremented synchronously to simulate columnwise extraction, thereby allowing simultaneous word and character segmentation:

Horizontal Projection Profile (HPP) is derived from compressed data for line segmentation via local minima detection.
For vertical segmentation, the sequential popping of runs reconstructs columns, with sustained zero transitions marking inter-character or inter-word spaces.

A graph-based min-cut method is introduced to further refine character segmentation in handwritten compressed documents (R et al., 2019). The region of interest (ROI) is extracted, banded horizontally, and logical OR operations between bands enable the identification of minimal connecting edges—ideal splitting points for touching characters. Segmentation boundaries are dynamically corrected via insertion/deletion operations contingent on empirical span thresholds, allowing recursive refinement of character splits in noisy or variable handwriting.

3. Deep Recursive Neural Architectures for Character Splitting

Recursive convolutional neural network models introduce further parametric efficiency for RTCS in natural scene OCR (Lee et al., 2016). By applying tied convolutional weights across multiple recurrences,

$h_{(i,j,k)}(t) = \begin{cases} \sigma((w_k^{hh, untied})^T x_{(i, j)} + b_k) & \text{if } t = 0 \ \sigma((w_k^{hh, tied})^T h_{(i, j)}(t-1) + b_k) & \text{if } t > 0 \end{cases}$

the network efficiently increases receptive field and hierarchical abstraction. These features are then processed by RNNs modeling implicit character-level language dependencies, with an integrated soft-attention mechanism guiding selective focus on image regions most likely to contain character boundaries.

End-to-end training using backpropagation through time (BPTT) allows optimization of segmentation and recognition error jointly. Benchmarked on datasets such as Synth90k, SVT, IIIT5k, and ICDAR, these architectures achieve up to 94.2% accuracy and significant improvements over prior work, underscoring their suitability for context-aware, lexicon-free RTCS implementations.

4. Non-Sequential Recursive Convolutional Auto-Encoders

The byte-level recursive convolutional auto-encoder (Zhang et al., 2018) exemplifies non-sequential RTCS, where deep recurrent application of convolutional groups (with pooling in encoding, upsampling in decoding) auto-encodes text sequences in parallel. Input is padded to power-of-two lengths; recursion groups reduce the feature map size by factors of two until a fixed-size vector is reached, then the decoder reconstructs the output in a single pass.

Residual connections facilitate training at scale (up to 160 layers), and empirical evaluations on multilingual paragraph datasets yield byte-level error rates of less than 5%. These models vastly outperform sequential (LSTM-based) auto-encoders, particularly for long texts. The ability to handle recursion at sub-word levels supports recursive splitting and reassembly of textual elements, applicable to RTCS systems focused on compression and efficient generation.

5. Recursive Splitting for Information Retrieval: Comparative Evaluation

Recent research on document splitting for Retrieval-Augmented Generation (RAG) systems (Narimissa et al., 13 Sep 2024) validates recursive character-level chunking as a context-preserving strategy superior to token-based splitting for complex text. The Recursive Character Splitter (RCS) iteratively segments text into overlapping fixed-length character chunks, formally expressed as:

$\begin{align*} \text{function} \; \text{RCS}(T, C, O) \ \quad \text{if} \; \text{length}(T) \leq C \; \text{return} \; [T] \ \quad \text{chunk} \gets T[0:C] \ \quad \text{remaining} \gets T[C-O: ] \ \quad \text{return} \; [\text{chunk}] \; \oplus \; \text{RCS}(\text{remaining}, C, O) \end{align*}$

Chunk overlaps (e.g., 200 characters) ensure cross-boundary context integrity. Comparative analyses demonstrate that RCS outperforms token-based approaches, particularly for narrative coherence in novels and structural continuity in textbooks. Objective evaluation utilizes weighted scores across metrics: SequenceMatcher (0.30), BLEU (0.30), METEOR (0.20), and BERT Score (0.20)—balancing literal and semantic accuracy. Automated generation of Q&A pairs for evaluation leverages an open-source transformer model to simulate realistic retrieval scenarios, establishing rigorous standards for RAG accuracy.

6. Limitations, Challenges, and Future Directions

Limitations of RTCS methodologies include sensitivity to inconsistent input labeling (noted in (Chrupała, 2013) for HTML markup), challenges posed by vanishing/exploding gradients in recurrent architectures, and empirical dependence on compression or segmentation thresholds. Expanding corpus sizes does not always yield proportional performance gains due to concept drift. In graph-based compressed-domain approaches, threshold tuning and adaptation are critical, and skewed/noisy input may compromise recursive split quality.

Future research directions involve joint optimization of prediction and segmentation tasks, adaptive parameter selection for overlaps and chunk sizes, and architectural innovations to improve sequence modeling (e.g., leveraging multiplicative connections or advanced residual methods). Broader applications encompass multilingual text segmentation, large-scale OCR analysis, and refined document splitting for dense retrieval systems.

7. Applications and Significance

RTCS systems find application in OCR (compressed and handwritten documents), computational linguistics for languages with complex tokenization, code-mixed and technical text segmentation (forums, bug reports, emails), and document retrieval within RAG frameworks. By accommodating non-trivial segmentation challenges and heterogeneity in document structure, RTCS demonstrates efficacy in maintaining narrative, technical, or contextual integrity across myriad formats and languages. These methods underpin improved efficiency, accuracy, and scalability for text-analytic systems in both research and applied domains.