Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding (2406.18200v2)

Published 26 Jun 2024 in cs.CL

Abstract: LLMs demonstrate remarkable emergent abilities across various tasks, yet fall short of complex reasoning and planning tasks. The tree-search-based reasoning methods address this by surpassing the capabilities of chain-of-thought prompting, encouraging exploration of intermediate steps. However, such methods introduce significant inference latency due to the systematic exploration and evaluation of multiple thought paths. This paper introduces SeeD, a novel and efficient inference framework to optimize runtime speed and GPU memory management concurrently. By employing a scheduled speculative execution, SeeD efficiently handles multiple iterations for the thought generation and the state evaluation, leveraging a rounds-scheduled strategy to manage draft model dispatching. Extensive experimental evaluations on three reasoning datasets demonstrate superior speedup performance of SeeD, providing a viable path for batched inference in training-free speculative decoding.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhenglin Wang (8 papers)
  2. Jialong Wu (36 papers)
  3. Yilong Lai (6 papers)
  4. Congzhi Zhang (5 papers)
  5. Deyu Zhou (42 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets