AVA-Captions: Aesthetic Image Captioning Dataset
- AVA-Captions is a large-scale dataset for aesthetic image captioning, featuring over 240,000 images and 1.3 million refined captions derived from user critiques.
- A two-stage cleaning process using standard NLP techniques and probabilistic n-gram filtering effectively removes noise and uninformative comments from the raw data.
- A weakly-supervised learning pipeline leveraging LDA and CNNs generates attribute-rich image features that significantly boost captioning performance and generalization.
The AVA-Captions dataset is a large-scale, rigorously cleaned resource for aesthetic image captioning (AIC), targeting the benchmarking and development of models that generate critical, attribute-rich textual feedback for photographs. Derived from user-contributed critiques associated with ≈250,000 photographs on dpchallenge.com (the AVA dataset), AVA-Captions enables the training and evaluation of deep neural models that move beyond generic natural image captioning to address fine-grained photographic and aesthetic considerations (Ghosal et al., 2019).
1. Source Data and Cleaning Methodology
The initial dataset, termed AVA-raw-caption, comprises approximately 250,000 photographs accompanied by about 3 billion user comments ranging from highly specific critiques to generic praise. Free-form comments in this corpus exhibit substantial noise, including typos, non-English text, excessive punctuation, acronyms, and a high frequency of uninformative or “safe” remarks (e.g., “nice shot,” “awesome”). These artifacts pose a risk of biasing downstream captioning models toward vacuous outputs.
AVA-Captions applies a two-stage cleaning pipeline:
- Stage A: Standard NLP Cleaning uses NLTK to strip non-English tokens, remove excessive punctuation, and normalize case.
- Stage B: Probabilistic n-gram Filtering introduces an informativeness score for each comment. Unigrams (nouns) and bigrams (descriptor–object pairs) are extracted to construct vocabularies U and B. For each n-gram , a corpus probability is calculated. Every comment is scored via:
where and are the unigrams and bigrams in . Comments with are retained (threshold set via validation); others are discarded, and images with zero surviving comments are dropped.
2. Dataset Statistics and Attribute Coverage
After filtering, AVA-Captions contains 240,060 images and 1,318,359 captions, with an average of approximately 5.5 highly-informative critiques per image. Instead of hand-labeling predefined attributes such as “composition” or “lighting,” the dataset leverages Latent Dirichlet Allocation (LDA) to automatically discover aesthetic “topics.” These topics correspond to co-occurring n-gram clusters, capturing a comprehensive but long-tailed distribution of photographic feedback—ranging from common genres such as “portrait” or “landscape” to niche aspects like “motion-blur” or “cute-expression.” The resulting attribute distribution encompasses both frequent and rare photographic styles and phenomena.
3. Weakly-Supervised Learning of Aesthetic Representations
To address the absence of explicit attribute labels, AVA-Captions employs a weakly-supervised pipeline to induce aesthetic-tuned image features:
- For each image, all filtered comments are merged into a “document.”
- An LDA model with topics is trained on these documents, using a vocabulary of approximately 25,000 n-grams (excluding those present in >10% of documents to eliminate residual generic terms). The LDA output 0 is a topic distribution for each image.
- A ResNet-101 CNN is trained to predict the LDA topic distribution from image pixels by minimizing a cross-entropy loss between the CNN’s softmax output 1 and 2:
3
Upon convergence, the final classifier head is discarded, and pool5 (2048-D) activations serve as the task-specific image descriptors. These may replace or augment standard ImageNet features for captioning.
4. Model Benchmarking and Evaluation
Three captioning pipelines utilizing AVA-Captions are benchmarked using NeuralTalk2 (ResNet-101 features; LSTM decoder):
| Method | Training Data | CNN Weights | Comment Cleanliness |
|---|---|---|---|
| NS | Raw AVA | ImageNet | Noisy |
| CS | AVA-Captions | ImageNet | Cleaned |
| CWS | AVA-Captions | LDA-weakly | Cleaned + WS |
Performance on the AVA-Captions validation set (≈9,362 images) demonstrates that cleaning yields substantial improvements: CS vs. NS shows +32% CIDEr and +40% BLEU-1. The weakly-supervised CNN (CWS) achieves parity with the fully supervised approach (CS), confirming the efficacy of attribute-driven visual representations.
| Method | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE-L | CIDEr | SPICE | SPICE-1 |
|---|---|---|---|---|---|---|---|---|---|
| NS | 0.379 | 0.219 | 0.122 | 0.061 | 0.079 | 0.233 | 0.038 | 0.044 | 0.135 |
| CS | 0.500 | 0.280 | 0.149 | 0.073 | 0.105 | 0.253 | 0.060 | 0.062 | 0.144 |
| CWS | 0.535 | 0.282 | 0.150 | 0.074 | 0.107 | 0.254 | 0.059 | 0.061 | 0.144 |
Caption diversity (n-gram analysis) shows 2–3× higher lexical variety for CS/CWS versus NS across token positions. Generalization experiments using the Photo Critique Captioning Dataset (PCCD) indicate that AVA-Captions-trained models match or exceed previous baselines—even though no hand-curated aspect labels are used—suggesting robust transfer among aesthetic captioning tasks.
Subjective evaluation confirms the efficacy of the filtering process (≥80% agreement between high 4 and human judgments) and the quality of generated captions: both CS and CWS are preferred over NS captions by both experts and amateurs.
5. Data Access, Format, and Licensing
The AVA-Captions dataset is available in JSON format, structured as follows for each split (train/val):
5
Additional resources include the 25,000-entry n-gram vocabulary and LDA artifacts. The original raw images and comments are hosted by the AVA dataset creators, while the cleaned captions, code, and LDA topic distributions are provided on GitHub (https://github.com/kghosal/AVA-Captions). AVA images are released under a CC-BY-NC license; AVA-Captions annotations and code are under the MIT license.
6. Significance and Implications
AVA-Captions constitutes the first large-scale, systematically filtered dataset for aesthetic image captioning, enabling critical feedback generation at scale from weakly-labeled web sources. The simple, unsupervised n-gram filtering process eliminates 55% of “safe” or uninformative comments, producing a corpus that supports rich aesthetic modeling. The weakly-supervised LDA-to-CNN attribute learning strategy demonstrates that automatically discovered aesthetic “topics” can substitute for human-annotated labels in training visually grounded models.
A plausible implication is that similar pipelines could be extended to other domains where crowd-sourced textual feedback is abundant but highly variable in quality. The pronounced generalization of AVA-Captions-trained systems to external benchmarks, without manual aspect annotation, suggests a broader utility for weakly-supervised, topic-driven representation learning in multi-modal critique and recommendation systems (Ghosal et al., 2019).