Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Batch Prompting: Efficient Inference with Large Language Model APIs (2301.08721v2)

Published 19 Jan 2023 in cs.CL and cs.AI

Abstract: Performing inference on large volumes of samples with LLMs can be computationally and financially costly in industry and real-world use. We propose batch prompting, a simple yet effective prompting approach that enables the LLM to run inference in batches, instead of one sample at a time. Our method reduces both token and time costs while retaining downstream performance. We theoretically demonstrate that under a few-shot in-context learning setting, the inference costs decrease almost inverse linearly with the number of samples in each batch. We extensively validate the effectiveness of batch prompting on ten datasets across commonsense QA, arithmetic reasoning, and NLI/NLU: batch prompting significantly~(up to 5x with six samples in batch) reduces the LLM (Codex) inference token and time costs while achieving better or comparable performance. For state-of-the-art Chat-based LLMs, e.g., GPT-3.5 and GPT-4, we show the benefits of batch prompting also hold. Further analysis shows that the number of samples in each batch and the complexity of tasks affect its performance. Moreover, batch prompting can be applied across different reasoning methods using LLMs. Our code can be found at the site https://github.com/xlang-ai/batch-prompting.

Batch Prompting: Efficient Inference with LLM APIs

The paper, "Batch Prompting: Efficient Inference with LLM APIs," introduces a method aimed at mitigating the computational and financial burdens associated with deploying LLMs in practical settings. In applications where inference over numerous samples is required, such as enterprise-level customer service automation or extensive dataset benchmarking, the costs—both in tokens and time—can escalate swiftly. The authors propose batch prompting as an effective solution to improve inference efficiency by enabling LLMs to process multiple samples concurrently instead of sequentially.

Key Contributions

Batch prompting leverages the structure of in-context learning, typically characterized by few-shot demonstrations emphasizing reasoning. The authors present a theoretical framework showing that in this setting, increasing the number of samples per batch yields an inverse linear reduction in token usage. Moreover, the empirical validation of this methodology is robust, spanning ten diverse datasets including commonsense question answering, arithmetic reasoning, and natural language inference/understanding. Notably, the batch prompting approach achieves up to a fivefold decrease in token and time costs when six samples are processed in a batch, without compromising performance on downstream tasks.

The research further investigates how the method applies across different types of LLMs, such as Codex, GPT-3.5, and GPT-4, highlighting the applicability and benefits of batch prompting to advanced models like ChatGPT. Additionally, the paper addresses various task aspects, revealing that batch prompting performance is sensitive to factors such as task complexity and input context length.

Experimental Results

The findings from batch prompting experiments indicate comparable or even superior performance relative to conventional prompting methods across all datasets tested. For instance, on CommonsenseQA and MNLI datasets, batch prompting slightly improved or maintained accuracy while considerably reducing time and token costs. However, datasets with longer input or higher inherent complexity—such as AQuA—experienced more noticeable performance degradation, which the authors attribute to batch processing potentially leading to confusion when the model encounters lengthy input sequences.

Implications and Future Directions

By demonstrating that batch prompting substantially lowers the costs of employing LLMs in real-world applications, this paper suggests batch prompting as a practical methodology for budget-conscious deployment scenarios. Its ability to integrate seamlessly with various reasoning paradigms, including end-to-end prompting and program-based methods, underscores its versatility.

The paper invites future research to refine sample selection techniques within batch prompting, potentially improving model performance further. Additionally, understanding how batch prompting interacts with emerging LLM architectures and inference services could offer valuable insights into optimizing this technique over time.

In conclusion, batch prompting stands as a promising approach in balancing efficiency and computational demands in deploying LLMs, thereby enhancing their accessibility and applicability in industry-scale contexts. The authors provide a compelling case for its adoption, laying groundwork for advancements in efficient LLM utilization.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Zhoujun Cheng (19 papers)
  2. Jungo Kasai (38 papers)
  3. Tao Yu (282 papers)
Citations (57)