Papers
Topics
Authors
Recent
Search
2000 character limit reached

PITA Dataset: Multi-Domain Benchmarks

Updated 17 February 2026
  • PITA dataset is a collection of distinct benchmarks across visual food analysis, logic theorem proving, and LPAD-based probabilistic inference.
  • It features detailed annotation processes like ingredient canonicalization, Lean proof trace generation, and structured probabilistic rule evaluation.
  • Each subset provides domain-specific insights, supporting evaluation of cross-modal estimation, reasoning trace models, and inference scaling challenges.

The designation “PITA dataset” appears as the name of separate, unrelated datasets in at least three research domains: (1) visual food analysis, (2) large-scale automated theorem proving in propositional logic, and (3) probabilistic logic programming. Each usage is closely connected to a namesake method or architectural contribution and constitutes a distinct resource. This article documents the construction, structure, and research context of all three datasets: PITA (Picture-to-Amount, visual ingredients), PITA (Propositional logic theorem proving), and PITA (Probabilistic Inference with Tabling and Answer subsumption). Each entry is scoped to its arXiv-cited usage; no cross-domain similarity is implied by nomenclature.

1. PITA: Picture-to-Amount Dataset for Visual Food Analysis

The PITA dataset introduced in "Picture-to-Amount (PITA): Predicting Relative Ingredient Amounts from Food Images" defines a novel corpus supporting relative ingredient quantity prediction from food images (Li et al., 2020). It is derived from the Recipe1M dataset, filtered and augmented through canonicalization, human annotation, and quantitative parsing.

Composition and Preprocessing

  • Source: Recipe1M (Salvador et al. 2017), originally ~1 million recipes, with ~400 K having at least one image.
  • Amount-annotated subset: ~80 K recipes containing ingredient amounts, all matched to at least one food image (as extracted in Li et al. 2019).
  • Ingredient normalization: The initial vocabulary of over 16,000 distinct tokens (including spelling/plural variants) undergoes canonicalization to yield 1,400 unique ingredient names (95%+ coverage), further filtered to 1,362 standardized food ingredients.
  • Substitution groups: 1,362 ingredients are clustered into 172 functional groups via a thresholded (cosine > 0.6) Word2Vec similarity, hand-curated, and grouped by connected components to encode substitutability classes.
  • Ingredient frequency: Highly skewed, with core ingredients (e.g., salt) appearing in up to 50% of recipes, while the majority occur in less than 1%.

Annotation and Ground Truth

  • Absolute amounts: Human-validated, semi-automatically parsed physical amounts (mostly grams, cups, etc.) converted using standard approximations (e.g., 1 cup flour ≈ 120g).
  • Relative amounts: For a recipe with II ingredients and absolute masses mim_i, the annotated vector vR+Iv \in \mathbb{R}_+^I has entries vi=(mi/M)×1000v_i = (m_i / M) \times 1000, M=imiM = \sum_i m_i. Absent ingredients are assigned vi=0v_i = 0.
  • Human-in-the-loop steps: Canonicalization and substitution group validation performed by annotators.

Splits, Availability, and File Structure

  • Training sets: Retrieval models train on ~371 K visual recipes, amount and detection models on 48 K from the 80 K amount-annotated set. Precise validation/test numbers are not specified; test sets are reported as held-out and presumably number 16–20 K.
  • Format: Recipe data in JSON (title, ingredient lists with units, instructions) and image files organized by recipe ID.
  • Access: Recipe1M is publicly available under fair-use; pointers for PITA data and demo are listed at http://foodai.cs.rutgers.edu.

Biases and Limitations

Major dataset limitations include uneven ingredient frequencies (long-tail), lack of explicit regional/cuisine labels, and the prevalence of invisible ingredients (e.g., salt, oil) requiring non-visual inference.

Modeling and Benchmarks

The PITA dataset supports a cross-modal deep architecture:

  • Joint “Food Space” embedding: Uses ResNet-50 for image features, LSTM-based model for text, both projected into a shared 1024-D space.
  • Ingredient detection: Binary cross-entropy over a sparse target vector, using positive weighting to balance class frequency skew.
  • Amount prediction: Domain-driven Wasserstein (Earth Mover’s) loss leveraging substitution-group-based ingredient distances, with a softmax output ensuring the predicted mass vector iv^i=1000\sum_i \hat{v}_i = 1000.
  • Baselines and results:
System CVG IOU EMD
PITA retrieval 0.51 0.34 191.1
ATTEN (Chen '18) 0.47 0.32 205.2
ACME (Wang '19) 0.48 0.33 199.9
PITA (full, group) 0.75 0.48 291.2
PITA (full, ingr.) 0.63 0.42 147.3

Metrics include coverage (CVG), intersection-over-union (IOU), and EMD (Earth Mover’s Distance), defined over predicted ingredient sets and amount vectors. The PITA dataset enables evaluation of ingredient-level and group-level predictions in a high-class-cardinality, dense detection setting (Li et al., 2020).

2. PITA Dataset: Large-Scale Benchmark for Propositional Logic Reasoning

The PITA dataset introduced in "Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces" is a benchmark of over 23 million propositional logic statements, each paired with Lean-generated proofs (Tong et al., 16 Feb 2026). It is designed for evaluating neural models on both proof-guided (“Reasoning Trace”, RT) and non-trace (“Direct Prediction”, DP) theorem proving modalities.

Dataset Structure

  • Scale: Over 23 million formulas, exceeding 95 billion tokenized Lean representations.
  • Prompt/Completion format: Each sample consists of (a) an XML-encoded Lean proof goal (premises, conclusion), and (b) a completion comprising interleaved tactic applications and proof states (RT format) or immediate success/failure label (DP format).
  • Connectives: Variables (p,q,rp, q, r), constants (,\top, \bot), and binary connectives (,,\wedge, \vee, \rightarrow).

Task Topology: Depth and Breadth

  • Task depth DD: Number of proof states traversed by the canonical proof for an example; formally, D(e)={distinct statei}D(e) = |\{\text{distinct } \text{state}_i\}|.
  • Task breadth BB: Number of unique formulas by size ss (B(s)B(s)) or depth dd (B(d)B(d)), modulo variable renaming and associative/commutative equivalence.

Splits

Split Formula class #Statements Topology
Full All {,,p,q,r}\{\top,\bot,p,q,r\} with ,,\wedge,\vee,\rightarrow, \leq5 atoms 3.6M Broad, shallow (“boule”)
Imply Only implication (\rightarrow), \leq7 atoms 11.0M Broad, shallow (“boule”)
Or pq1...qnp\to q_1\vee...\vee q_n (“membership form”) 8.7M Narrow, moderate depth
PHP Pigeonhole principle instances (m>nm>n) 0.051M Very narrow, very deep
  • Full/Imply splits: Exponential breadth in atom count, median depths 4–8.
  • Or/PHP splits: Very narrow, depths up to hundreds.

Construction Methodology

  • Formula generation: Exhaustive enumeration (Full/Imply) or sampling (Or/PHP) subject to bounded atom count, modulo isomorphism by variable renaming and associative/commutative symmetries.
  • Proof search: Automated generation of shortest (and backtracked) proofs in Lean, followed by translation to tactic/state pairs; backtracks injected explicitly into the trace.
  • Tokenization: XML wrapping; BPE segmentation ensues.

Protocol and Benchmarks

  • Training: All formulas with proof depth DDmedD \leq D_{med} (median for that split).
  • Length generalization: Test set is all examples with D>DmedD > D_{med}.
  • Metric: Classification accuracy (evaluates only the final \langlesuccess\rangle/\langlefailure\rangle token, disregarding intermediate trace correctness).

Empirical Findings

  • On broad, shallow splits (Full, Imply), reasoning trace (RT) models strongly outperform direct prediction (DP) models in length generalization.
  • On narrow, deep splits (Or, PHP), DP models exceed RT performance; long RT chains in deep tasks induce high failure rates—up to 50 percentage point drops on PHP—demonstrating intrinsic trade-offs in reasoning trace-based LLM reasoning (Tong et al., 16 Feb 2026).

Usage Guidelines

  • Practical loading: Available for direct loading via Hugging Face Datasets (e.g., williamtong105/pita).
  • Input specification: Instance records are XML + Lean-tactic sequences separated by ||.
  • Maximum context: 32,000 tokens, with nearly all examples in Full/Imply/Or subfits; PHP sometimes requires truncation.
  • Modeling: Supports both RT-style (autoregressive trace) and DP-style (single-shot classification) fine-tuning recipes.

3. PITA: Probabilistic Inference with Tabling and Answer Subsumption—LPAD Dataset Benchmarks

In probabilistic logic programming, “PITA dataset” does not denote a monolithic corpus but refers to six benchmark domains used for evaluation of the PITA transformation algorithm under the distribution semantics for LPADs (Riguzzi et al., 2011). Each domain is structurally distinct and features its own explanation combinatorics and program characteristics.

Domains, Structure, and Scope

Domain Function Symbols Query Depth/Branching Program size Source
Hidden Markov Model Yes Depth NN, branching 3 $3(N+1)$ facts Ven & Verlato (ICLP’04)
Biological Path Query Yes/No Varied, up to graph diameter 200–10k edges Raedt et al. IJCAI’07
bloodtype No Depth 1–2, \sim4–6 scaling loci Meert et al. ILP’09
growingbody No Depth 1, 2k2^k varying kk Meert et al. ILP’09
growinghead No Depth 1, branching mm varying mm Meert et al. ILP’09
UW‐CSE No Up to 3–4, mod. branching scaled Meert et al. ILP’09
  • Probabilistic facts/rules: E.g., HMM: s(0,1):13s(0,2):13s(0,3):13.s(0,1): \tfrac{1}{3} \vee s(0,2): \tfrac{1}{3} \vee s(0,3): \tfrac{1}{3}.
  • Query structure: Ranges from deep recurrences (HMM) to combinatorial proofs over possible genetic transmissions (bloodtype) or graph traversals (biological network).
  • Program statistics: For HMM, number of explanations for a query grows as 3N3^N; for growingbody, explanation count is 2k2^k for body length kk.

Evaluation Metrics and Key Findings

  • Performance: Wall-clock run-time and success count (i.e., the ability to solve all instances in a given time/memory envelope) are principal metrics. PITA achieves consistently superior scaling, often solving large instances beyond the reach of ProbLog, cplint, or CVE.
  • Tools and configuration: All experiments use XSB 3.3+ with PITA, tabling with answer subsumption (typically “or/3-zero/1”); run on uniform hardware (2.33 GHz Core 2 Duo, 4GB).

Probabilistic Parameters

  • Numeric probabilistic values and rules are supplied directly or referenced from original sources (e.g., Meert et al., Raedt et al.; see paper for precise details).

4. Comparison of Dataset Types and Usages

While sharing the “PITA” acronym, each dataset targets disparate research goals:

Domain Data Type Purpose Key Properties
Food (PITA) Images + Structured Ingredient amount estimation High ingredient cardinality, cross-modal annotation
Propositional logic (PITA) Symbolic (proof traces) RT vs. DP model generalization Explicit depth/breadth splits, Lean proofs
LPAD inference (PITA) Logic programs, facts Probabilistic inference benchmarking Explanation combinatorics, LPAD-centric

In all cases, “PITA dataset” encapsulates large, structurally rich benchmarks, often paired tightly to a specific model architecture or inference paradigm.

5. Research Significance and Impact

PITA in food analysis establishes a new calibration standard for multi-label, large-scale ingredient regression, advancing cross-modal retrieval and quantity estimation. The PITA logic reasoning dataset provides a granular benchmark for LLM-based automated theorem proving, foregrounding challenges in length generalization by topology. PITA for probabilistic logic programming offers a unified stress test for distribution semantics inference engines, isolating bottlenecks associated with recursion depth, explanation count, and symbolic grounding. Collectively, these datasets reinforce the necessity of structural, diverse, and context-aware benchmarks for evaluating modern machine learning and symbolic inference systems (Li et al., 2020, Tong et al., 16 Feb 2026, Riguzzi et al., 2011).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PITA Dataset.