UM-DATA-200K Visual Text Dataset

Updated 20 January 2026

UM-DATA-200K is a large-scale dataset comprising 200K high-quality product poster images with diverse text layouts and paired clean backgrounds.
It utilizes a multi-stage filtering and annotation process, including OCR extraction, aesthetic scoring, and diffusion-based inpainting to generate paired clean backgrounds.
The dataset underpins the UM-Designer model pre-training, enhancing visual-text editing performance and multimodal understanding in real-world e-commerce scenarios.

UM-DATA-200K is a large-scale dataset of visual text images, constructed specifically for advancing multimodal models focused on visual text editing and context understanding within product-centric poster imagery. Designed to ground the UM-Designer visual-LLM in diverse, real-world text layouts and stylistic variations, the dataset comprises paired annotated poster images and clean backgrounds derived from genuine e-commerce product posters. The following sections delineate its structure, collection methodology, annotation pipeline, statistics, and benchmark role, providing a comprehensive reference for its composition and intended applications (Ma et al., 13 Jan 2026).

1. Dataset Structure and Image Composition

UM-DATA-200K consists of 200,000 manually verified images, each depicting a “product poster” where the main product is highlighted against a background with overlaid textual elements. The dataset originates from a large-scale web crawl of approximately 40 million product posters sourced from major e-commerce platforms. Subsequent multi-stage filtering and selection reduced this pool to 5 million candidates, finally arriving at 200,000 high-quality poster images.

All examples are drawn from e-commerce or poster-style settings, with no explicit separation into categories such as indoor, outdoor, or other scene types. Most images focus on a primary product central to the layout, surrounded by diverse textual presentations. Textual content spans both English and Chinese languages, with a range of font families, font sizes, colors, and free-form poster layouts—including both single- and multi-line text blocks. The precise distributions of these textual attributes are not reported; only their diversity is qualitatively described.

2. Data Collection and Annotation Protocols

The data curation process for UM-DATA-200K employs systematic procedures ensuring high annotation fidelity and layout quality:

Image Sourcing: All images are real photographs scraped from leading online retail platforms. No synthetic or generative techniques are involved in the initial image creation.
Paired Clean Backgrounds: After text detection and segmentation, the textual content is erased, and a “clean” background is regenerated using FLUX-Fill, a diffusion-based inpainting model. Thus, each example comprises (a) the original poster with text, and (b) a background-only image for context-aware editing tasks.
Text and Layout Annotation: PPOCRv4 is used to extract detailed bounding boxes and textual content at the word or character level across all images. This yields exact coordinates for every text region and matches resolved text strings.
Aesthetic Filtering: Candidates are passed through Aesthetic Predictor V2.5, retaining only those with high visual layout quality.
Semantic Foreground Segmentation: The Segment Anything Model (SAM2) demarcates the main product region, systematically eliminating images where text overlays coincide with essential product features and thus risk semantic or perceptual ambiguity.
Manual Verification: Human annotators check for consistency among OCR results, the textual content, and the inpainted backgrounds.

The annotation schema distinguishes three primary classes of side-information for each text block: the text content (string), the spatial layout (bounding box coordinates), and style attributes (such as font, size, and color). However, no formal hierarchical taxonomy of style (e.g., “serif vs. sans-serif”) is reported; style information is implicitly represented and freely predicted by the associated UM-Designer model.

3. Preprocessing Pipeline and Dataset Properties

The preprocessing of UM-DATA-200K tightly controls both the normalization procedures and the selection for stylistic and semantic diversity. The sequential pipeline executes as follows:

OCR extraction (PPOCRv4): bounding boxes and textual content.
Aesthetic scoring: selection of the top 5 million most visually appealing layouts.
Foreground segmentation (SAM2): removal of images with text-foreground overlap.
Text erasure and inpainting (FLUX-Fill): paired clean images for each example.
Manual spot checking: verification of annotation and image integrity, culminating in the final dataset of 200,000.

All images are resized to $512 \times 512$ pixels prior to training. Glyphs for the visual-encoder branch are rendered at $80 \times 80$ pixels. The UM-Text paper does not mention any additional data augmentations (such as color jitter or random cropping) beyond the necessary inpainting.

Language diversity is ensured by the presence of both English and Chinese texts, yet no specific balancing or per-language count is provided. Free-form layouts and a broad style range are maintained, but concrete statistical breakdowns for fonts, sizes, colors, or layout categories are omitted.

4. Quantitative Characteristics and Statistics

Statistical parameters computed over the dataset are described as follows:

Let $x_i$ $x_{i}$ be the number of separate text instances in the $i$ $i$ -th image $(i = 1, ..., 200000)$ $(i = 1, ..., 200000)$ . The mean and variance are:
- Mean text-instance count per image: $\mu = \frac{1}{N} \sum_{i=1}^N x_i$ .
- Variance: $\sigma^2 = \frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2$ .
The explicit numerical values for $\mu$ and $\sigma^2$ are not reported, nor are analogous statistics for font-size or color attribute distributions.

No information is provided regarding train/validation/test splits; in reported experiments, the full set of 200,000 samples is used to pre-train the UM-Designer model. The dataset is not partitioned for benchmarking purposes, nor is it evaluated as a held-out test set.

5. Role in Model Pre-training and Benchmarking

UM-DATA-200K is used exclusively for the pre-training phase of the UM-Designer visual-language component within the UM-Text framework. It is not a benchmark dataset and does not serve as an evaluation or comparison metric; consequently, no direct baseline accuracy, Intersection-over-Union (IoU), Fréchet Inception Distance (FID), or other recognition/generation scores are reported on the dataset itself.

Rather, quantitative and qualitative model performance is assessed using established public datasets (such as AnyText-benchmark, UDiffText, and others) following pre-training with UM-DATA-200K. Its primary function is to provide robust model grounding in authentic poster text layouts and styles encountered in real-world e-commerce settings.

6. Diversity, Limitations, and Future Directions

Diversity within UM-DATA-200K is achieved via aesthetic quality control, foreground segmentation, and language variety. However, the absence of statistical attribute breakdowns and the lack of a formal style taxonomy constrain detailed analyses of representational balance. The dataset’s focus on paired original and clean backgrounds enables context-aware text editing tasks, yet its exclusive use during pre-training precludes standalone benchmarking.

A plausible implication is that further annotation granularity—such as explicit style ontologies, finer-grained language statistics, and per-font distributions—could enhance future research utility. Additional augmentations or partitioning strategies may be required for applications involving generalization or fair evaluation protocols.

Summary Table: Key Properties of UM-DATA-200K

Attribute Type	Details	Notes
Image count	200,000	Fully manually verified
Scene category	Product posters (e-commerce)	No indoor/outdoor breakdown
Language coverage	English and Chinese	No per-language distribution given
Annotation	Text box coords, content, paired clean BG	Via PPOCRv4, FLUX-Fill, manual checks
Layout/style stats	Diverse (fonts, sizes, colors, layouts)	Distributions not numerically reported
Usage	UM-Designer pre-training	Not an evaluation test set

UM-DATA-200K constitutes a foundational resource for visual-language grounding in image understanding and editing, built with emphasis on semantic and stylistic fidelity but without exhaustively documented attribute distributions or benchmark splits (Ma et al., 13 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

UM-Text: A Unified Multimodal Model for Image Understanding (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UM-DATA-200K.