- The paper introduces Sketch-of-Thought (SoT), a novel framework that uses cognitive-inspired concise intermediate steps to significantly reduce LLM token usage during reasoning.
- SoT employs three distinct paradigms—Conceptual Chaining, Chunked Symbolism, and Expert Lexicons—dynamically selected by a router model to guide LLMs towards efficient reasoning expressions.
- Experiments demonstrate SoT reduces token usage by 76% while maintaining or improving accuracy across various reasoning tasks and languages, validating its efficiency.
The paper introduces Sketch-of-Thought (SoT), a novel prompting framework designed to enhance the efficiency of LLM reasoning by reducing token usage while preserving accuracy. SoT draws inspiration from cognitive science, specifically the concept of "sketches" as efficient intermediaries in cognitive processes. The core idea is to guide LLMs to express their reasoning processes more concisely, akin to how experts use abbreviated notations in their respective domains.
The authors identify that while Chain-of-Thought (CoT) prompting has been effective in improving reasoning accuracy, it often leads to verbose intermediate steps, increasing computational costs. Subsequent methods such as Self-Consistency, Tree of Thoughts, and Graph of Thoughts amplify token inefficiency. SoT addresses this by developing specialized prompt templates that guide models to generate concise reasoning steps.
SoT incorporates three distinct paradigms grounded in cognitive science principles:
- Conceptual Chaining: Creates concise logical sequences between key concepts, drawing from episodic buffer integration and associative memory networks. This paradigm is effective for commonsense reasoning, multi-hop inference, and fact-based recall tasks. For example, when asked "What is the name of the currency used in Seoul?", the model responds with “#Seoul â #South Korea â Won”.
- Chunked Symbolism: Organizes numerical and symbolic reasoning into compact, structured steps, based on working memory chunking theory. This approach condenses mathematical reasoning into dense symbolic representations. Given the question "A car accelerates at 2.5 m/s^{}2 for 10 seconds. If its initial velocity was 15 m/s, what is its final velocity?", the model outputs “a = 2.5 m/s^{}2, t = 10 s, vi = 15 m/s; vf = 15 + (2.5 Ã 10), vf = 40 m/s”.
- Expert Lexicons: Leverages domain-specific shorthand and specialized notation to condense reasoning, inspired by expert schema research. This paradigm employs domain-specific abbreviations and symbols, packing multiple concepts into single tokens. For instance, for the question "A patient with STEMI (ST-Elevation Myocardial Infarction) is given MONA (Morphine, Oxygen, Nitrates, Aspirin) therapy. They are allergic to aspirin. Are they at risk with this treatment?", the model outputs “STEMI â ST-Elevation MI, MONA â {Morphine, O2, Nitrates, Aspirin}, so Aspirin ∈ MONA”.
A key component of SoT is a lightweight router model that dynamically selects the optimal reasoning paradigm for each query. The router model, denoted as PSoT=ROUTER(q), analyzes question characteristics based on linguistic indicators and selects the most suitable approach.
PSoT is one among three reasoning paradigms and ROUTER is a smaller LLM such as DistilBERT.
The authors conducted experiments across 15 reasoning datasets, including mathematical, commonsense, logical, multi-hop, scientific, and medical tasks. They used the Qwen-2.5 family of models in 7B, 14B, and 32B parameter sizes, as well as the Qwen-2.5-VL 7B model for multimodal experiments. The results demonstrate that SoT reduces token usage by 76\% with minimal accuracy impact and even improves accuracy in certain domains like mathematical and multi-hop reasoning. The multilingual experiments in Korean, German, and Italian also show consistent token reductions. The multimodal experiments using the ScienceQA dataset further validate SoT's efficiency in scenarios requiring visual and textual integration.
The paper also explores the integration of Self-Consistency with SoT, showing that the combination maintains comparable accuracy to Self-Consistency with CoT (Chain-of-Thought) while using significantly fewer tokens. This highlights the potential of SoT in ensemble-based approaches where computational efficiency is critical.
The authors acknowledge some limitations, including the use of static exemplars within each paradigm and the need for domain-specific calibration in highly technical domains. Future work could involve implementing retrieval-augmented generation for dynamic exemplar selection and developing additional specialized paradigms.