Extend LIFE to VLMs, MoEs, and Speculative Decoding

Extend the LLM Inference Forecast Engine (LIFE) hardware- and dataset-agnostic analytical framework beyond dense large language models to support Vision Language Models (VLMs), Mixture-of-Experts (MoE) architectures, and speculative decoding, by developing appropriate operator-level analytical models and workload characterization that enable forecasting of inference metrics such as time-to-first-token, time-per-output-token, and tokens-per-second using only hardware specifications.

Background

The paper introduces LIFE, a lightweight, modular, operator-level analytical framework for forecasting LLM inference performance in a hardware- and dataset-agnostic manner. LIFE models compute and memory dynamics across prefill and decode phases, capturing optimizations like quantization, KV cache compression, attention variants, operator fusion, and LoRA, and forecasts metrics such as TTFT, TPOT, and TPS using hardware TOPS and bandwidth.

All analyses in this work focus on dense LLMs (e.g., Llama2-7B variants) across diverse hardware (CPU, NPU, iGPU, GPU). The authors explicitly state that extending LIFE to other model classes—Vision LLMs, Mixture-of-Experts architectures, and speculative decoding—remains unaddressed and is deferred to future exploration, indicating a concrete, unresolved extension of the framework.

References

While we showcase our study on dense LLMs, extending this to Vision LLMs (VLMs), Mixture-of-Experts (MoEs) and Speculative Decoding is left for future exploration.

— Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling (2508.00904 - Patwari et al., 29 Jul 2025) in Section 7: Conclusion

Extend LIFE to VLMs, MoEs, and Speculative Decoding

Background

References

Related Problems