PITA Dataset: Multi-Domain Benchmarks

Updated 17 February 2026

PITA dataset is a collection of distinct benchmarks across visual food analysis, logic theorem proving, and LPAD-based probabilistic inference.
It features detailed annotation processes like ingredient canonicalization, Lean proof trace generation, and structured probabilistic rule evaluation.
Each subset provides domain-specific insights, supporting evaluation of cross-modal estimation, reasoning trace models, and inference scaling challenges.

The designation “PITA dataset” appears as the name of separate, unrelated datasets in at least three research domains: (1) visual food analysis, (2) large-scale automated theorem proving in propositional logic, and (3) probabilistic logic programming. Each usage is closely connected to a namesake method or architectural contribution and constitutes a distinct resource. This article documents the construction, structure, and research context of all three datasets: PITA (Picture-to-Amount, visual ingredients), PITA (Propositional logic theorem proving), and PITA (Probabilistic Inference with Tabling and Answer subsumption). Each entry is scoped to its arXiv-cited usage; no cross-domain similarity is implied by nomenclature.

1. PITA: Picture-to-Amount Dataset for Visual Food Analysis

The PITA dataset introduced in "Picture-to-Amount (PITA): Predicting Relative Ingredient Amounts from Food Images" defines a novel corpus supporting relative ingredient quantity prediction from food images (Li et al., 2020). It is derived from the Recipe1M dataset, filtered and augmented through canonicalization, human annotation, and quantitative parsing.

Composition and Preprocessing

Source: Recipe1M (Salvador et al. 2017), originally ~1 million recipes, with ~400 K having at least one image.
Amount-annotated subset: ~80 K recipes containing ingredient amounts, all matched to at least one food image (as extracted in Li et al. 2019).
Ingredient normalization: The initial vocabulary of over 16,000 distinct tokens (including spelling/plural variants) undergoes canonicalization to yield 1,400 unique ingredient names (95%+ coverage), further filtered to 1,362 standardized food ingredients.
Substitution groups: 1,362 ingredients are clustered into 172 functional groups via a thresholded (cosine > 0.6) Word2Vec similarity, hand-curated, and grouped by connected components to encode substitutability classes.
Ingredient frequency: Highly skewed, with core ingredients (e.g., salt) appearing in up to 50% of recipes, while the majority occur in less than 1%.

Annotation and Ground Truth

Absolute amounts: Human-validated, semi-automatically parsed physical amounts (mostly grams, cups, etc.) converted using standard approximations (e.g., 1 cup flour ≈ 120g).
Relative amounts: For a recipe with $I$ ingredients and absolute masses $m_i$ , the annotated vector $v \in \mathbb{R}_+^I$ has entries $v_i = (m_i / M) \times 1000$ , $M = \sum_i m_i$ . Absent ingredients are assigned $v_i = 0$ .
Human-in-the-loop steps: Canonicalization and substitution group validation performed by annotators.

Splits, Availability, and File Structure

Training sets: Retrieval models train on ~371 K visual recipes, amount and detection models on 48 K from the 80 K amount-annotated set. Precise validation/test numbers are not specified; test sets are reported as held-out and presumably number 16–20 K.
Format: Recipe data in JSON (title, ingredient lists with units, instructions) and image files organized by recipe ID.
Access: Recipe1M is publicly available under fair-use; pointers for PITA data and demo are listed at http://foodai.cs.rutgers.edu.

Biases and Limitations

Major dataset limitations include uneven ingredient frequencies (long-tail), lack of explicit regional/cuisine labels, and the prevalence of invisible ingredients (e.g., salt, oil) requiring non-visual inference.

Modeling and Benchmarks

The PITA dataset supports a cross-modal deep architecture:

Joint “Food Space” embedding: Uses ResNet-50 for image features, LSTM-based model for text, both projected into a shared 1024-D space.
Ingredient detection: Binary cross-entropy over a sparse target vector, using positive weighting to balance class frequency skew.
Amount prediction: Domain-driven Wasserstein (Earth Mover’s) loss leveraging substitution-group-based ingredient distances, with a softmax output ensuring the predicted mass vector $\sum_i \hat{v}_i = 1000$ .
Baselines and results:

System	CVG	IOU	EMD
PITA retrieval	0.51	0.34	191.1
ATTEN (Chen '18)	0.47	0.32	205.2
ACME (Wang '19)	0.48	0.33	199.9
PITA (full, group)	0.75	0.48	291.2
PITA (full, ingr.)	0.63	0.42	147.3

Metrics include coverage (CVG), intersection-over-union (IOU), and EMD (Earth Mover’s Distance), defined over predicted ingredient sets and amount vectors. The PITA dataset enables evaluation of ingredient-level and group-level predictions in a high-class-cardinality, dense detection setting (Li et al., 2020).

2. PITA Dataset: Large-Scale Benchmark for Propositional Logic Reasoning

The PITA dataset introduced in "Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces" is a benchmark of over 23 million propositional logic statements, each paired with Lean-generated proofs (Tong et al., 16 Feb 2026). It is designed for evaluating neural models on both proof-guided (“Reasoning Trace”, RT) and non-trace (“Direct Prediction”, DP) theorem proving modalities.

Dataset Structure

Scale: Over 23 million formulas, exceeding 95 billion tokenized Lean representations.
Prompt/Completion format: Each sample consists of (a) an XML-encoded Lean proof goal (premises, conclusion), and (b) a completion comprising interleaved tactic applications and proof states (RT format) or immediate success/failure label (DP format).
Connectives: Variables ( $p, q, r$ ), constants ( $\top, \bot$ ), and binary connectives ( $\wedge, \vee, \rightarrow$ ).

Task Topology: Depth and Breadth

Task depth $D$ : Number of proof states traversed by the canonical proof for an example; formally, $D(e) = |\{\text{distinct } \text{state}_i\}|$ .
Task breadth $B$ : Number of unique formulas by size $s$ ( $B(s)$ ) or depth $d$ ( $B(d)$ ), modulo variable renaming and associative/commutative equivalence.

Splits

Split	Formula class	#Statements	Topology
Full	All $\{\top,\bot,p,q,r\}$ with $\wedge,\vee,\rightarrow$ , $\leq$ 5 atoms	3.6M	Broad, shallow (“boule”)
Imply	Only implication ( $\rightarrow$ ), $\leq$ 7 atoms	11.0M	Broad, shallow (“boule”)
Or	$p\to q_1\vee...\vee q_n$ (“membership form”)	8.7M	Narrow, moderate depth
PHP	Pigeonhole principle instances ( $m>n$ )	0.051M	Very narrow, very deep

Full/Imply splits: Exponential breadth in atom count, median depths 4–8.
Or/PHP splits: Very narrow, depths up to hundreds.

Construction Methodology

Formula generation: Exhaustive enumeration (Full/Imply) or sampling (Or/PHP) subject to bounded atom count, modulo isomorphism by variable renaming and associative/commutative symmetries.
Proof search: Automated generation of shortest (and backtracked) proofs in Lean, followed by translation to tactic/state pairs; backtracks injected explicitly into the trace.
Tokenization: XML wrapping; BPE segmentation ensues.

Protocol and Benchmarks

Training: All formulas with proof depth $D \leq D_{med}$ (median for that split).
Length generalization: Test set is all examples with $D > D_{med}$ .
Metric: Classification accuracy (evaluates only the final $\langle$ success $\rangle$ / $\langle$ failure $\rangle$ token, disregarding intermediate trace correctness).

Empirical Findings

On broad, shallow splits (Full, Imply), reasoning trace (RT) models strongly outperform direct prediction (DP) models in length generalization.
On narrow, deep splits (Or, PHP), DP models exceed RT performance; long RT chains in deep tasks induce high failure rates—up to 50 percentage point drops on PHP—demonstrating intrinsic trade-offs in reasoning trace-based LLM reasoning (Tong et al., 16 Feb 2026).

Usage Guidelines

Practical loading: Available for direct loading via Hugging Face Datasets (e.g., williamtong105/pita).
Input specification: Instance records are XML + Lean-tactic sequences separated by ||.
Maximum context: 32,000 tokens, with nearly all examples in Full/Imply/Or subfits; PHP sometimes requires truncation.
Modeling: Supports both RT-style (autoregressive trace) and DP-style (single-shot classification) fine-tuning recipes.

3. PITA: Probabilistic Inference with Tabling and Answer Subsumption—LPAD Dataset Benchmarks

In probabilistic logic programming, “PITA dataset” does not denote a monolithic corpus but refers to six benchmark domains used for evaluation of the PITA transformation algorithm under the distribution semantics for LPADs (Riguzzi et al., 2011). Each domain is structurally distinct and features its own explanation combinatorics and program characteristics.

Domains, Structure, and Scope

Domain	Function Symbols	Query Depth/Branching	Program size	Source
Hidden Markov Model	Yes	Depth $N$ , branching 3	$3(N+1)$ facts	Ven & Verlato (ICLP’04)
Biological Path Query	Yes/No	Varied, up to graph diameter	200–10k edges	Raedt et al. IJCAI’07
bloodtype	No	Depth 1–2, $\sim$ 4–6	scaling loci	Meert et al. ILP’09
growingbody	No	Depth 1, $2^k$	varying $k$	Meert et al. ILP’09
growinghead	No	Depth 1, branching $m$	varying $m$	Meert et al. ILP’09
UW‐CSE	No	Up to 3–4, mod. branching	scaled	Meert et al. ILP’09

Probabilistic facts/rules: E.g., HMM: $s(0,1): \tfrac{1}{3} \vee s(0,2): \tfrac{1}{3} \vee s(0,3): \tfrac{1}{3}.$
Query structure: Ranges from deep recurrences (HMM) to combinatorial proofs over possible genetic transmissions (bloodtype) or graph traversals (biological network).
Program statistics: For HMM, number of explanations for a query grows as $3^N$ ; for growingbody, explanation count is $2^k$ for body length $k$ .

Evaluation Metrics and Key Findings

Performance: Wall-clock run-time and success count (i.e., the ability to solve all instances in a given time/memory envelope) are principal metrics. PITA achieves consistently superior scaling, often solving large instances beyond the reach of ProbLog, cplint, or CVE.
Tools and configuration: All experiments use XSB 3.3+ with PITA, tabling with answer subsumption (typically “or/3-zero/1”); run on uniform hardware (2.33 GHz Core 2 Duo, 4GB).

Probabilistic Parameters

Numeric probabilistic values and rules are supplied directly or referenced from original sources (e.g., Meert et al., Raedt et al.; see paper for precise details).

4. Comparison of Dataset Types and Usages

While sharing the “PITA” acronym, each dataset targets disparate research goals:

Domain	Data Type	Purpose	Key Properties
Food (PITA)	Images + Structured	Ingredient amount estimation	High ingredient cardinality, cross-modal annotation
Propositional logic (PITA)	Symbolic (proof traces)	RT vs. DP model generalization	Explicit depth/breadth splits, Lean proofs
LPAD inference (PITA)	Logic programs, facts	Probabilistic inference benchmarking	Explanation combinatorics, LPAD-centric

In all cases, “PITA dataset” encapsulates large, structurally rich benchmarks, often paired tightly to a specific model architecture or inference paradigm.

5. Research Significance and Impact

PITA in food analysis establishes a new calibration standard for multi-label, large-scale ingredient regression, advancing cross-modal retrieval and quantity estimation. The PITA logic reasoning dataset provides a granular benchmark for LLM-based automated theorem proving, foregrounding challenges in length generalization by topology. PITA for probabilistic logic programming offers a unified stress test for distribution semantics inference engines, isolating bottlenecks associated with recursion depth, explanation count, and symbolic grounding. Collectively, these datasets reinforce the necessity of structural, diverse, and context-aware benchmarks for evaluating modern machine learning and symbolic inference systems (Li et al., 2020, Tong et al., 16 Feb 2026, Riguzzi et al., 2011).

Markdown Report Issue Upgrade to Chat

References (3)

Picture-to-Amount (PITA): Predicting Relative Ingredient Amounts from Food Images (2020)

Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces (2026)

Well-Definedness and Efficient Inference for Probabilistic Logic Programming under the Distribution Semantics (2011)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PITA Dataset.