Dice Question Streamline Icon: https://streamlinehq.com

Applicability of ConsMax Without Fine-Tuning

Determine whether the ConsMax softmax method proposed by Liu et al. (2024), which uses INT8 inputs/outputs with internal FP16 computations, can be applied to pretrained Transformer models without any fine-tuning of the model parameters.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper surveys prior softmax acceleration approaches and notes that several fixed-point or quantized designs often rely on fine-tuning to recover accuracy, which limits deployability on pretrained large models. In their state-of-the-art comparison, the authors discuss the ConsMax design by Liu et al. (2024), which targets INT8 inputs/outputs while performing internal computations in floating point.

While Liu et al. report that their approach can converge to the same perplexity as GPT-2 during training, the authors of the present paper emphasize that it is not established whether ConsMax can be directly used without any fine-tuning. This raises an explicit uncertainty about the deployability of ConsMax for inference on pretrained models without retraining or adjustment.

References

Although Liu et al. achieves convergence to the same perplexity as the original GPT-2 during training, it remains unclear whether this approach can be applied without fine-tuning.

VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers (2504.11227 - Wang et al., 15 Apr 2025) in Section 7: Comparison with the State-of-the-Art