Not All Layers of LLMs Are Necessary During Inference (2403.02181v3)

Published 4 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Due to the large number of parameters, the inference phase of LLMs is resource-intensive. However, not all requests posed to LLMs are equally difficult to handle. Through analysis, we show that for some tasks, LLMs can achieve results comparable to the final output at some intermediate layers. That is, not all layers of LLMs are necessary during inference. If we can predict at which layer the inferred results match the final results (produced by evaluating all layers), we could significantly reduce the inference cost. To this end, we propose a simple yet effective algorithm named AdaInfer to adaptively terminate the inference process for an input instance. AdaInfer relies on easily obtainable statistical features and classic classifiers like SVM. Experiments on well-known LLMs like the Llama2 series and OPT, show that AdaInfer can achieve an average of 17.8% pruning ratio, and up to 43% on sentiment tasks, with nearly no performance drop (<1%). Because AdaInfer does not alter LLM parameters, the LLMs incorporated with AdaInfer maintain generalizability across tasks.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (42)

Authors (9)

Siqi Fan (31 papers)
Xin Jiang (242 papers)
Xiang Li (1003 papers)
Xuying Meng (18 papers)
Peng Han (37 papers)
Shuo Shang (30 papers)
Aixin Sun (99 papers)
Yequan Wang (44 papers)
Zhongyuan Wang (105 papers)

Citations (19)

View on Semantic Scholar

Tweets

https://twitter.com/duduwa5/status/1766006892474904900

https://twitter.com/duduwa5/status/1777679969470259434

Not All Layers of LLMs Are Necessary During Inference (2403.02181v3)

Related Papers

Tweets