CapFilt: Aesthetic Captioning & Filtering
- CapFilt is a framework that curates high-quality, informative captions by filtering out noisy, generic user comments in aesthetic image captioning.
- It employs probabilistic n-gram analysis and LDA-based topic modeling to enhance caption relevance and improve weakly supervised CNN training.
- The approach scales to web-scale datasets and has broad applicability for refining annotations in domains with noisy, crowd-sourced data.
Captioning and Filtering (CapFilt) is a framework for processing multimodal web-scale or crowd-sourced datasets to obtain high-quality, discriminative, and relevant textual annotations—especially in domains such as aesthetic image captioning where raw user comments are typically noisy or generic. CapFilt encompasses both the generation of informative captions and the elimination of uninformative or noisy ones using automatic, probabilistic filtering methods tailored to the composition and frequency of word n-grams. In parallel, CapFilt informs training procedures that leverage the filtered annotations for weakly supervised learning, notably enabling feature representations that capture nuanced, domain-specific attributes in vision–language tasks.
1. Problem Setting and Motivation
Aesthetic image captioning tasks require models not only to recognize image content but to generate detailed, feedback-style textual critiques aligned with photographic qualities such as composition, lighting, or style. In contrast to natural image captioning—which benefits from large-scale curated datasets (e.g., MS-COCO)—aesthetic domains lack clean, comprehensive datasets. Web-crawled or community-contributed comments are abundant but typically plagued by
- typographical errors, non-standard phrasing, or non-English segments,
- excessively generic or “safe” comments (e.g., “Nice shot!”) with little discriminative content, and
- overused exclamations or pleasantries that fail to capture finer aesthetic details.
The CapFilt framework directly addresses the need for both high-quality dataset curation and the extraction of informative, diverse textual feedback for use in training and evaluation.
2. Probabilistic Caption Filtering Methodology
CapFilt’s core filtering mechanism leverages two primary intuitions: (a) meaningful word composition (as manifested in n-grams) conveys more specific information, and (b) low-frequency n-grams are typically less generic and more discriminative, echoing the term-frequency versus inverse-document-frequency approaches in information retrieval.
- N-gram Vocabulary Construction: From the initial web-crawled corpus, vocabularies are constructed for both unigrams (primarily nouns) and bigrams, where bigrams with forms like descriptor-object (e.g., “tilted horizon”, “post processing”) are prioritized for their higher informational content.
- Corpus Probability: Each n-gram is assigned a probability,
where is the corpus frequency of , and is the vocabulary size.
- Informativeness Score: For a comment decomposed into the union of its unigrams and bigrams , the informativeness score is computed as
This metric penalizes comments with high-frequency (and hence generic) n-grams, retaining those above a strict threshold (experimentally set to 20). Thus, only comments with rare or information-dense language are kept for subsequent training.
- Dataset Curation After Filtering: Application of this filter discards approximately 55% of raw comments and removes images that lack any remaining informative comments, resulting in the AVA-Captions dataset—a collection of 230,000 images and roughly 1.3–1.5 million captions (average 5–6 per image). The resulting dataset is large, diverse, and presents a higher baseline for subsequent modeling.
3. Weakly Supervised Visual Representation Learning
CapFilt supports weakly supervised learning for visual feature extractors tailored to aesthetic domains:
- Latent Topic Discovery via LDA: All captions per image are treated as a document, analyzed with Latent Dirichlet Allocation (LDA) to uncover latent aesthetic “topics” (with ). Each image is assigned a topic distribution—effectively serving as a soft label vector capturing the presence of higher-level compositional or stylistic concepts.
- CNN Training with Weak Supervision: A ResNet101 architecture is adapted by replacing the final layer with topic-outputs; the network is trained using cross-entropy loss against the LDA-inferred topic distributions. The key properties are:
- No need for labor-intensive ground-truth aesthetic labels, as supervision comes from latent structure in free-form comments.
- The resulting CNN captures aesthetic and compositional attributes (rather than simple object categories), making its features especially suitable for downstream aesthetic captioning.
- Representation of Latent Associations: The LDA process naturally clusters related visual attributes (e.g., “cute expression”, “ear”, “face”), resulting in features that abstract over redundant or overlapping concepts (e.g., “grayscale” and “black and white”).
4. Captioning Pipeline and Evaluation
- Caption Generation Model: The visual features (from either an ImageNet-trained CNN or the weakly supervised CNN) are fed into an LSTM decoder to produce aesthetic captions.
- Automatic Evaluation Metrics: Output is evaluated using BLEU, METEOR, ROUGE, CIDEr, and SPICE metrics, while diversity is quantified via n-gram uniqueness analysis.
- Findings:
- Models trained on filtered captions (both clean-supervised and clean-weakly supervised) dramatically outperform noisy-caption-trained baselines, with comparable performance between supervised ImageNet-CNN and the weakly supervised aesthetic-CNN.
- Output diversity is substantially higher for the filtered-dataset models, reflecting a reduced tendency to produce “safe,” repetitive captions.
- Generalization to out-of-domain datasets (e.g., PCCD) is robust, and subjective studies confirm that high informativeness scores are well-aligned with human judgments of aesthetic relevance.
5. Practical Implications, Limitations, and Future Directions
- Empirical Validation: CapFilt’s filtering strategy not only increases the discriminative capacity of the dataset but directly enhances downstream captioning performance—both quantitatively (standard metrics) and qualitatively (human evaluation).
- Scalability: The framework is scalable to web-scale data, can be adopted for other domains where ground-truth is noisy/unreliable, and is compatible with deeper and more complex vision architectures.
- Limitations:
- The static threshold selection for informativeness may need tuning for new domains.
- The reliance on statistical properties of n-grams may not capture subtleties such as sarcasm or context-dependent relevance.
- Broader Applications: The described CapFilt approach is directly applicable to image captioning tasks with abundant, weakly-labeled data—both for creative industries (e.g., photographic feedback) and smart content moderation pipelines.
6. Mathematical Summary and Integration
| Step | Formula | Description |
|---|---|---|
| N-gram probability | Corpus probability of n-gram | |
| Informativeness score | Score for each comment | |
| LDA Topic Assignment | LDA over caption-set -topic soft labels | Image-level topic distributions |
| Weak supervision loss | Cross-entropy between CNN output and LDA topic vector | Visual feature extractor training |
This formalism underpins a CapFilt system that robustly identifies and leverages discriminative, high-quality labels for both representation learning and automated evaluation in aesthetic image captioning.