Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching (2503.05179v2)

Published 7 Mar 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Recent advances in LLMs have enabled strong reasoning capabilities through Chain-of-Thought (CoT) prompting, which elicits step-by-step problem solving, but often at the cost of excessive verbosity in intermediate outputs, leading to increased computational overhead. We propose Sketch-of-Thought (SoT), a prompting framework that integrates cognitively inspired reasoning paradigms with linguistic constraints to reduce token usage while preserving reasoning accuracy. SoT is designed as a flexible, modular approach and is instantiated with three paradigms--Conceptual Chaining, Chunked Symbolism, and Expert Lexicons--each tailored to distinct reasoning tasks and selected dynamically at test-time by a lightweight routing model. Across 15 reasoning datasets spanning multiple domains, languages, and modalities, SoT achieves token reductions of up to 78% with minimal accuracy loss. In tasks such as mathematical and multi-hop reasoning, it even improves accuracy while shortening outputs.

Summary

  • The paper introduces Sketch-of-Thought (SoT), a novel framework that uses cognitive-inspired concise intermediate steps to significantly reduce LLM token usage during reasoning.
  • SoT employs three distinct paradigms—Conceptual Chaining, Chunked Symbolism, and Expert Lexicons—dynamically selected by a router model to guide LLMs towards efficient reasoning expressions.
  • Experiments demonstrate SoT reduces token usage by 76% while maintaining or improving accuracy across various reasoning tasks and languages, validating its efficiency.

The paper introduces Sketch-of-Thought (SoT), a novel prompting framework designed to enhance the efficiency of LLM reasoning by reducing token usage while preserving accuracy. SoT draws inspiration from cognitive science, specifically the concept of "sketches" as efficient intermediaries in cognitive processes. The core idea is to guide LLMs to express their reasoning processes more concisely, akin to how experts use abbreviated notations in their respective domains.

The authors identify that while Chain-of-Thought (CoT) prompting has been effective in improving reasoning accuracy, it often leads to verbose intermediate steps, increasing computational costs. Subsequent methods such as Self-Consistency, Tree of Thoughts, and Graph of Thoughts amplify token inefficiency. SoT addresses this by developing specialized prompt templates that guide models to generate concise reasoning steps.

SoT incorporates three distinct paradigms grounded in cognitive science principles:

  • Conceptual Chaining: Creates concise logical sequences between key concepts, drawing from episodic buffer integration and associative memory networks. This paradigm is effective for commonsense reasoning, multi-hop inference, and fact-based recall tasks. For example, when asked "What is the name of the currency used in Seoul?", the model responds with “#Seoul → #South Korea → Won”.
  • Chunked Symbolism: Organizes numerical and symbolic reasoning into compact, structured steps, based on working memory chunking theory. This approach condenses mathematical reasoning into dense symbolic representations. Given the question "A car accelerates at 2.5 m/s^{}2 for 10 seconds. If its initial velocity was 15 m/s, what is its final velocity?", the model outputs “a = 2.5 m/s^{}2, t = 10 s, vi = 15 m/s; vf = 15 + (2.5 × 10), vf = 40 m/s”.
  • Expert Lexicons: Leverages domain-specific shorthand and specialized notation to condense reasoning, inspired by expert schema research. This paradigm employs domain-specific abbreviations and symbols, packing multiple concepts into single tokens. For instance, for the question "A patient with STEMI (ST-Elevation Myocardial Infarction) is given MONA (Morphine, Oxygen, Nitrates, Aspirin) therapy. They are allergic to aspirin. Are they at risk with this treatment?", the model outputs “STEMI → ST-Elevation MI, MONA → {Morphine, O2, Nitrates, Aspirin}, so Aspirin \in MONA”.

A key component of SoT is a lightweight router model that dynamically selects the optimal reasoning paradigm for each query. The router model, denoted as PSoT=ROUTER(q)P_{\text{SoT} = \text{ROUTER}(q)}, analyzes question characteristics based on linguistic indicators and selects the most suitable approach.

PSoTP_{SoT} is one among three reasoning paradigms and ROUTER is a smaller LLM such as DistilBERT.

The authors conducted experiments across 15 reasoning datasets, including mathematical, commonsense, logical, multi-hop, scientific, and medical tasks. They used the Qwen-2.5 family of models in 7B, 14B, and 32B parameter sizes, as well as the Qwen-2.5-VL 7B model for multimodal experiments. The results demonstrate that SoT reduces token usage by 76\% with minimal accuracy impact and even improves accuracy in certain domains like mathematical and multi-hop reasoning. The multilingual experiments in Korean, German, and Italian also show consistent token reductions. The multimodal experiments using the ScienceQA dataset further validate SoT's efficiency in scenarios requiring visual and textual integration.

The paper also explores the integration of Self-Consistency with SoT, showing that the combination maintains comparable accuracy to Self-Consistency with CoT (Chain-of-Thought) while using significantly fewer tokens. This highlights the potential of SoT in ensemble-based approaches where computational efficiency is critical.

The authors acknowledge some limitations, including the use of static exemplars within each paradigm and the need for domain-specific calibration in highly technical domains. Future work could involve implementing retrieval-augmented generation for dynamic exemplar selection and developing additional specialized paradigms.

HackerNews