PITA Dataset: Multi-Domain Benchmarks
- PITA dataset is a collection of distinct benchmarks across visual food analysis, logic theorem proving, and LPAD-based probabilistic inference.
- It features detailed annotation processes like ingredient canonicalization, Lean proof trace generation, and structured probabilistic rule evaluation.
- Each subset provides domain-specific insights, supporting evaluation of cross-modal estimation, reasoning trace models, and inference scaling challenges.
The designation “PITA dataset” appears as the name of separate, unrelated datasets in at least three research domains: (1) visual food analysis, (2) large-scale automated theorem proving in propositional logic, and (3) probabilistic logic programming. Each usage is closely connected to a namesake method or architectural contribution and constitutes a distinct resource. This article documents the construction, structure, and research context of all three datasets: PITA (Picture-to-Amount, visual ingredients), PITA (Propositional logic theorem proving), and PITA (Probabilistic Inference with Tabling and Answer subsumption). Each entry is scoped to its arXiv-cited usage; no cross-domain similarity is implied by nomenclature.
1. PITA: Picture-to-Amount Dataset for Visual Food Analysis
The PITA dataset introduced in "Picture-to-Amount (PITA): Predicting Relative Ingredient Amounts from Food Images" defines a novel corpus supporting relative ingredient quantity prediction from food images (Li et al., 2020). It is derived from the Recipe1M dataset, filtered and augmented through canonicalization, human annotation, and quantitative parsing.
Composition and Preprocessing
- Source: Recipe1M (Salvador et al. 2017), originally ~1 million recipes, with ~400 K having at least one image.
- Amount-annotated subset: ~80 K recipes containing ingredient amounts, all matched to at least one food image (as extracted in Li et al. 2019).
- Ingredient normalization: The initial vocabulary of over 16,000 distinct tokens (including spelling/plural variants) undergoes canonicalization to yield 1,400 unique ingredient names (95%+ coverage), further filtered to 1,362 standardized food ingredients.
- Substitution groups: 1,362 ingredients are clustered into 172 functional groups via a thresholded (cosine > 0.6) Word2Vec similarity, hand-curated, and grouped by connected components to encode substitutability classes.
- Ingredient frequency: Highly skewed, with core ingredients (e.g., salt) appearing in up to 50% of recipes, while the majority occur in less than 1%.
Annotation and Ground Truth
- Absolute amounts: Human-validated, semi-automatically parsed physical amounts (mostly grams, cups, etc.) converted using standard approximations (e.g., 1 cup flour ≈ 120g).
- Relative amounts: For a recipe with ingredients and absolute masses , the annotated vector has entries , . Absent ingredients are assigned .
- Human-in-the-loop steps: Canonicalization and substitution group validation performed by annotators.
Splits, Availability, and File Structure
- Training sets: Retrieval models train on ~371 K visual recipes, amount and detection models on 48 K from the 80 K amount-annotated set. Precise validation/test numbers are not specified; test sets are reported as held-out and presumably number 16–20 K.
- Format: Recipe data in JSON (title, ingredient lists with units, instructions) and image files organized by recipe ID.
- Access: Recipe1M is publicly available under fair-use; pointers for PITA data and demo are listed at http://foodai.cs.rutgers.edu.
Biases and Limitations
Major dataset limitations include uneven ingredient frequencies (long-tail), lack of explicit regional/cuisine labels, and the prevalence of invisible ingredients (e.g., salt, oil) requiring non-visual inference.
Modeling and Benchmarks
The PITA dataset supports a cross-modal deep architecture:
- Joint “Food Space” embedding: Uses ResNet-50 for image features, LSTM-based model for text, both projected into a shared 1024-D space.
- Ingredient detection: Binary cross-entropy over a sparse target vector, using positive weighting to balance class frequency skew.
- Amount prediction: Domain-driven Wasserstein (Earth Mover’s) loss leveraging substitution-group-based ingredient distances, with a softmax output ensuring the predicted mass vector .
- Baselines and results:
| System | CVG | IOU | EMD |
|---|---|---|---|
| PITA retrieval | 0.51 | 0.34 | 191.1 |
| ATTEN (Chen '18) | 0.47 | 0.32 | 205.2 |
| ACME (Wang '19) | 0.48 | 0.33 | 199.9 |
| PITA (full, group) | 0.75 | 0.48 | 291.2 |
| PITA (full, ingr.) | 0.63 | 0.42 | 147.3 |
Metrics include coverage (CVG), intersection-over-union (IOU), and EMD (Earth Mover’s Distance), defined over predicted ingredient sets and amount vectors. The PITA dataset enables evaluation of ingredient-level and group-level predictions in a high-class-cardinality, dense detection setting (Li et al., 2020).
2. PITA Dataset: Large-Scale Benchmark for Propositional Logic Reasoning
The PITA dataset introduced in "Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces" is a benchmark of over 23 million propositional logic statements, each paired with Lean-generated proofs (Tong et al., 16 Feb 2026). It is designed for evaluating neural models on both proof-guided (“Reasoning Trace”, RT) and non-trace (“Direct Prediction”, DP) theorem proving modalities.
Dataset Structure
- Scale: Over 23 million formulas, exceeding 95 billion tokenized Lean representations.
- Prompt/Completion format: Each sample consists of (a) an XML-encoded Lean proof goal (premises, conclusion), and (b) a completion comprising interleaved tactic applications and proof states (RT format) or immediate success/failure label (DP format).
- Connectives: Variables (), constants (), and binary connectives ().
Task Topology: Depth and Breadth
- Task depth : Number of proof states traversed by the canonical proof for an example; formally, .
- Task breadth : Number of unique formulas by size () or depth (), modulo variable renaming and associative/commutative equivalence.
Splits
| Split | Formula class | #Statements | Topology |
|---|---|---|---|
| Full | All with , 5 atoms | 3.6M | Broad, shallow (“boule”) |
| Imply | Only implication (), 7 atoms | 11.0M | Broad, shallow (“boule”) |
| Or | (“membership form”) | 8.7M | Narrow, moderate depth |
| PHP | Pigeonhole principle instances () | 0.051M | Very narrow, very deep |
- Full/Imply splits: Exponential breadth in atom count, median depths 4–8.
- Or/PHP splits: Very narrow, depths up to hundreds.
Construction Methodology
- Formula generation: Exhaustive enumeration (Full/Imply) or sampling (Or/PHP) subject to bounded atom count, modulo isomorphism by variable renaming and associative/commutative symmetries.
- Proof search: Automated generation of shortest (and backtracked) proofs in Lean, followed by translation to tactic/state pairs; backtracks injected explicitly into the trace.
- Tokenization: XML wrapping; BPE segmentation ensues.
Protocol and Benchmarks
- Training: All formulas with proof depth (median for that split).
- Length generalization: Test set is all examples with .
- Metric: Classification accuracy (evaluates only the final success/failure token, disregarding intermediate trace correctness).
Empirical Findings
- On broad, shallow splits (Full, Imply), reasoning trace (RT) models strongly outperform direct prediction (DP) models in length generalization.
- On narrow, deep splits (Or, PHP), DP models exceed RT performance; long RT chains in deep tasks induce high failure rates—up to 50 percentage point drops on PHP—demonstrating intrinsic trade-offs in reasoning trace-based LLM reasoning (Tong et al., 16 Feb 2026).
Usage Guidelines
- Practical loading: Available for direct loading via Hugging Face Datasets (e.g.,
williamtong105/pita). - Input specification: Instance records are XML + Lean-tactic sequences separated by
||. - Maximum context: 32,000 tokens, with nearly all examples in Full/Imply/Or subfits; PHP sometimes requires truncation.
- Modeling: Supports both RT-style (autoregressive trace) and DP-style (single-shot classification) fine-tuning recipes.
3. PITA: Probabilistic Inference with Tabling and Answer Subsumption—LPAD Dataset Benchmarks
In probabilistic logic programming, “PITA dataset” does not denote a monolithic corpus but refers to six benchmark domains used for evaluation of the PITA transformation algorithm under the distribution semantics for LPADs (Riguzzi et al., 2011). Each domain is structurally distinct and features its own explanation combinatorics and program characteristics.
Domains, Structure, and Scope
| Domain | Function Symbols | Query Depth/Branching | Program size | Source |
|---|---|---|---|---|
| Hidden Markov Model | Yes | Depth , branching 3 | $3(N+1)$ facts | Ven & Verlato (ICLP’04) |
| Biological Path Query | Yes/No | Varied, up to graph diameter | 200–10k edges | Raedt et al. IJCAI’07 |
| bloodtype | No | Depth 1–2, 4–6 | scaling loci | Meert et al. ILP’09 |
| growingbody | No | Depth 1, | varying | Meert et al. ILP’09 |
| growinghead | No | Depth 1, branching | varying | Meert et al. ILP’09 |
| UW‐CSE | No | Up to 3–4, mod. branching | scaled | Meert et al. ILP’09 |
- Probabilistic facts/rules: E.g., HMM:
- Query structure: Ranges from deep recurrences (HMM) to combinatorial proofs over possible genetic transmissions (bloodtype) or graph traversals (biological network).
- Program statistics: For HMM, number of explanations for a query grows as ; for growingbody, explanation count is for body length .
Evaluation Metrics and Key Findings
- Performance: Wall-clock run-time and success count (i.e., the ability to solve all instances in a given time/memory envelope) are principal metrics. PITA achieves consistently superior scaling, often solving large instances beyond the reach of ProbLog, cplint, or CVE.
- Tools and configuration: All experiments use XSB 3.3+ with PITA, tabling with answer subsumption (typically “or/3-zero/1”); run on uniform hardware (2.33 GHz Core 2 Duo, 4GB).
Probabilistic Parameters
- Numeric probabilistic values and rules are supplied directly or referenced from original sources (e.g., Meert et al., Raedt et al.; see paper for precise details).
4. Comparison of Dataset Types and Usages
While sharing the “PITA” acronym, each dataset targets disparate research goals:
| Domain | Data Type | Purpose | Key Properties |
|---|---|---|---|
| Food (PITA) | Images + Structured | Ingredient amount estimation | High ingredient cardinality, cross-modal annotation |
| Propositional logic (PITA) | Symbolic (proof traces) | RT vs. DP model generalization | Explicit depth/breadth splits, Lean proofs |
| LPAD inference (PITA) | Logic programs, facts | Probabilistic inference benchmarking | Explanation combinatorics, LPAD-centric |
In all cases, “PITA dataset” encapsulates large, structurally rich benchmarks, often paired tightly to a specific model architecture or inference paradigm.
5. Research Significance and Impact
PITA in food analysis establishes a new calibration standard for multi-label, large-scale ingredient regression, advancing cross-modal retrieval and quantity estimation. The PITA logic reasoning dataset provides a granular benchmark for LLM-based automated theorem proving, foregrounding challenges in length generalization by topology. PITA for probabilistic logic programming offers a unified stress test for distribution semantics inference engines, isolating bottlenecks associated with recursion depth, explanation count, and symbolic grounding. Collectively, these datasets reinforce the necessity of structural, diverse, and context-aware benchmarks for evaluating modern machine learning and symbolic inference systems (Li et al., 2020, Tong et al., 16 Feb 2026, Riguzzi et al., 2011).