PizzaCommonSense: Learning to Model Commonsense Reasoning about Intermediate Steps in Cooking Recipes (2401.06930v2)
Abstract: Understanding procedural texts, such as cooking recipes, is essential for enabling machines to follow instructions and reason about tasks, a key aspect of intelligent reasoning. In cooking, these instructions can be interpreted as a series of modifications to a food preparation. For a model to effectively reason about cooking recipes, it must accurately discern and understand the inputs and outputs of intermediate steps within the recipe. We present a new corpus of cooking recipes enriched with descriptions of intermediate steps that describe the input and output for each step. PizzaCommonsense serves as a benchmark for the reasoning capabilities of LLMs because it demands rigorous explicit input-output descriptions to demonstrate the acquisition of implicit commonsense knowledge, which is unlikely to be easily memorized. GPT-4 achieves only 26\% human-evaluated preference for generations, leaving room for future improvements.
- Abstract meaning representation for sembanking. In Proceedings of the 7th linguistic annotation workshop and interoperability with discourse, pages 178–186.
- Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
- Jump to better conclusions: Scan both left and right. EMNLP 2018, page 47.
- Deep neural networks and tabular data: A survey. IEEE Transactions on Neural Networks and Learning Systems.
- Food-101–mining discriminative components with random forests. In European conference on computer vision, pages 446–461. Springer.
- Simulating Action Dynamics with Neural Process Networks.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Recipescape: An interactive tool for analyzing cooking instructions at scale. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pages 1–12.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Aditya Gupta and Greg Durrett. 2019a. Effective use of transformer networks for entity tracking. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 759–769.
- Aditya Gupta and Greg Durrett. 2019b. Tracking discrete and continuous entity state for process understanding. NAACL HLT 2019, page 7.
- Tracking the world state with recurrent entity networks. In 5th International Conference on Learning Representations, ICLR 2017.
- Jermsak Jermsurawong and Nizar Habash. 2015. Predicting the structure of cooking recipes. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 781–786.
- Measuring compositional generalization: A comprehensive method on realistic data. In International Conference on Learning Representations.
- Globally coherent text generation with neural checklist models. In Proceedings of the 2016 conference on empirical methods in natural language processing, pages 329–339.
- Mise en place: Unsupervised interpretation of instructional recipes. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 982–992.
- Najoung Kim and Tal Linzen. 2020. Cogs: A compositional generalization challenge based on semantic interpretation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9087–9105.
- Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- Commongen: A constrained text generation challenge for generative commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1823–1840.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Flowgraph2text: Automatic sentence skeleton compilation for procedural text generation. In Proceedings of the 8th International Natural Language Generation Conference (INLG), pages 118–122.
- Flow graph corpus from recipe texts. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 2370–2377.
- Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- CHEF: Cross-modal Hierarchical Embeddings for Food Domain Retrieval.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Learning cross-modal embeddings for cooking recipes and food images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3020–3028.
- Distilling reasoning capabilities into smaller language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7059–7073.
- Understanding procedural text using interactive entity networks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7281–7290.
- Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Text-to-table: A new way of information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2518–2533, Dublin, Ireland. Association for Computational Linguistics.
- An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations.
- Tabert: Pretraining for joint understanding of textual and tabular data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8413–8426.
- Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
- Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations.
- Aissatou Diallo (10 papers)
- Antonis Bikakis (12 papers)
- Luke Dickens (14 papers)
- Anthony Hunter (28 papers)
- Rob Miller (16 papers)