Dice Question Streamline Icon: https://streamlinehq.com

Uncertainty about GPT-4 training data contents and numerical rounding behavior

Determine the composition of GPT-4’s training data and quantify the extent to which large language models round or otherwise transform continuous numerical values during generation.

Information Square Streamline Icon: https://streamlinehq.com

Background

When shifting from categorical outcomes (e.g., award winners) to continuous macroeconomic variables (e.g., monthly inflation), the authors note that model behavior with numeric outputs could be sensitive to details such as rounding and the nature of the training data—details that are opaque due to proprietary constraints.

This opacity limits interpretation of the model’s predictions and motivates a technical open question regarding the training corpus and numerical generation behavior.

References

It is unclear what is in the training data, or to what degree LLMs round continuous variables, as OpenAI has been secretive about the training data and has not shared the source code for ChatGPT-4.

Can Base ChatGPT be Used for Forecasting without Additional Optimization? (2404.07396 - Pham et al., 11 Apr 2024) in Section: Predicting Macroeconomic Variables