Language Models of Code are Few-Shot Commonsense Learners (2210.07128v3)

Published 13 Oct 2022 in cs.CL and cs.LG

Abstract: We address the general task of structured commonsense reasoning: given a natural language input, the goal is to generate a graph such as an event -- or a reasoning-graph. To employ LLMs (LMs) for this task, existing approaches ``serialize'' the output graph as a flat list of nodes and edges. Although feasible, these serialized graphs strongly deviate from the natural language corpora that LMs were pre-trained on, hindering LMs from generating them correctly. In this paper, we show that when we instead frame structured commonsense reasoning tasks as code generation tasks, pre-trained LMs of code are better structured commonsense reasoners than LMs of natural language, even when the downstream task does not involve source code at all. We demonstrate our approach across three diverse structured commonsense reasoning tasks. In all these natural language tasks, we show that using our approach, a code generation LM (CODEX) outperforms natural-LMs that are fine-tuned on the target task (e.g., T5) and other strong LMs such as GPT-3 in the few-shot setting.

PDF Abstract

LLMs of Code as Commonsense Reasoners: Insights and Evaluation

The paper "LLMs of Code are Few-Shot Commonsense Learners" explores the novel approach of utilizing LLMs pre-trained on code (Code-LLMs) for structured commonsense reasoning tasks. This research presents a unique perspective on framing such tasks as code generation problems rather than adapting natural LLMs for this purpose. The findings reveal that Code-LLMs, when employed in this manner, can outperform established natural LLMs, even those fine-tuned specifically for the task.

The authors' approach is primarily anchored on converting structured commonsense reasoning tasks into code generation tasks. Typical structured outputs like graphs, usually serialized into text for model consumption, are instead transformed into Python code. This transformation aligns more closely with the pre-training data of Code-LLMs, thereby improving their capability to handle the inherent complexity of these tasks.

Evaluation Over Diverse Commonsense Reasoning Tasks

The paper evaluates this methodology across three distinct tasks, each requiring unique reasoning capabilities:

Script Generation (ProScript Dataset): The task involves generating a sequence of steps from a given goal, represented as a graph. The Code-LLM, Codex, is shown to outperform even fine-tuned NLP models like T5, especially in the structured output format. Metrics such as BLEU, ROUGE-L, and graph edit distance confirm Codex's superior performance in generating semantically and structurally sound scripts.
Entity State Tracking (ProPara Dataset): Here, Codex demonstrates a strong ability to track the state of entities across procedural texts, without being fine-tuned but by leveraging few-shot examples. The results on precision and recall indicate that Codex approaches state-of-the-art performance with significantly less data, demonstrating the efficiency of the code framing approach.
Argument Graph Generation (ExplaGraphs Dataset): Codex not only generates graph structures that correctly capture reasoning but does so with higher semantic correctness and structural accuracy than comparable NLP models. These results underline the model's ability to perform abstract reasoning tasks effectively when tasks are represented as code.

Implications and Future Developments

This paper implies substantial theoretical and practical implications for AI, particularly in how structured reasoning tasks can be reframed as computer code generation tasks. Practically, this approach reduces the reliance on fine-tuning large NLP models, potentially offering a more efficient path for deploying AI models in real-world applications where labeled data is scarce.

Theoretically, it opens avenues for exploring the intersection of code and natural language processing further, suggesting that the structured, logical nature of programming languages might align well with the tasks involving complex reasoning and generation of structured outputs.

Future applications could involve enhancing this methodology by integrating more sophisticated domain-specific code representations or exploring similar transformations in other structured reasoning domains. Further, the scalability of this approach to non-English languages and varied structured data formats could broaden its applicability, thereby advancing AI's capacity to understand and generate complex structured knowledge.

In conclusion, this research provides compelling evidence that code-based LLMs can serve as powerful commonsense reasoners. By aligning task outputs with the structured pre-training data of Code-LLMs, this approach not only enhances performance but also offers insights into leveraging code generation for complex AI tasks.