Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 79 tok/s

Gemini 2.5 Pro 55 tok/s Pro

GPT-5 Medium 27 tok/s Pro

GPT-5 High 26 tok/s Pro

GPT-4o 85 tok/s Pro

GPT OSS 120B 431 tok/s Pro

Kimi K2 186 tok/s Pro

2000 character limit reached

Revealing economic facts: LLMs know more than they say (2505.08662v1)

Published 13 May 2025 in cs.CL, cs.LG, econ.GN, and q-fin.EC

Abstract: We investigate whether the hidden states of LLMs can be used to estimate and impute economic and financial statistics. Focusing on county-level (e.g. unemployment) and firm-level (e.g. total assets) variables, we show that a simple linear model trained on the hidden states of open-source LLMs outperforms the models' text outputs. This suggests that hidden states capture richer economic information than the responses of the LLMs reveal directly. A learning curve analysis indicates that only a few dozen labelled examples are sufficient for training. We also propose a transfer learning method that improves estimation accuracy without requiring any labelled data for the target variable. Finally, we demonstrate the practical utility of hidden-state representations in super-resolution and data imputation tasks.

Collections

Summary

The paper shows that LLM embeddings encode detailed economic and financial data not easily retrieved through text outputs.
The study employs a Linear Model on Embeddings (LME) on county and firm-level data, achieving superior accuracy especially for less common statistics.
The work highlights practical benefits in data imputation and super-resolution, reaching near-peak performance with minimal labeled data and effective transfer learning.

This paper, "Revealing economic facts: LLMs know more than they say" (2505.08662), investigates whether the latent knowledge embedded in the hidden states (embeddings) of LLMs can be effectively used to estimate economic and financial statistics, particularly at granular levels. The central hypothesis is that LLMs, trained on vast text corpora including relevant data, store information about entities (like regions or firms) in their internal representations, even if they cannot explicitly retrieve or articulate this knowledge accurately in their text outputs.

The paper focuses on estimating county-level variables in the US, UK, EU, and Germany (e.g., unemployment, GDP per capita, population) and firm-level financial variables for US listed companies (e.g., total assets, profitability metrics). The key methodology involves training a simple linear model, termed Linear Model on Embeddings (LME), on the hidden states of the last token of a prompt that identifies the entity and the variable of interest. This approach is then compared against directly parsing numeric values from the LLM's text output for the same prompt.

For practical implementation, the authors used open-source LLMs from the Llama 3 (1B, 8B, 70B parameters) and Phi-3 (3.8B parameters) families, which provide access to hidden states. They also tested a reasoning-tuned model (Qwen QwQ) against its base model (Qwen 2.5). Prompting strategies were explored, with a simple completion prompt ("The {variable} in {region} in {year} was") generally yielding the best performance for LME. The LME is typically a ridge regression model trained on the full 4096 dimensions of the hidden state vector, often from layer 25, though performance was relatively consistent across later layers. Evaluating performance involved grouped cross-validation (splitting data by state, country, etc.) and using Spearman correlation to mitigate the impact of outliers, especially prevalent in the LLM's text outputs, which often cluster around specific values or contain parsing difficulties. Data transformations (log or cubic) were applied to skewed variables before analysis.

The core findings demonstrate that the LME approach often significantly outperforms the direct text output from the LLMs, particularly for less common statistics where the text output is less reliable. For widely known statistics like population or aggregate unemployment rates, text output can be competitive or even superior. Performance of LME generally improves with increasing model size, and while text output accuracy also increases, the performance gap where LME is superior remains substantial even for larger models like Llama 3 70B. The reasoning model's text output did not consistently beat the base model's text output and was computationally much more expensive than LME.

A crucial practical implication explored is the data efficiency of LME. Learning curve analysis revealed that the LME model can achieve near-peak performance with relatively few labeled examples (often just a few dozen), making it viable for scenarios where extensive labeled data is scarce.

The paper also investigates transfer learning. A simple approach training on embeddings of other variables was not consistently successful. However, a refined transfer learning method leveraging the LLM's own text output as noisy labels for the target variable proved effective. By including both ground truth labels (from other variables) and noisy text labels (for the target variable) in the training data for a neural network, the model learned to predict the target variable from its embeddings without requiring any ground truth labels for that specific variable. This approach consistently outperformed direct text output for the target variable.

Furthermore, the paper demonstrates the practical utility of LLM embeddings in downstream data processing tasks:

Data Imputation: Adding the first few PCA components of generic entity embeddings (obtained from prompts like just the entity name) as features to standard imputation algorithms (e.g., Gaussian copula, Bayesian Ridge, Random Forest) consistently improved imputation accuracy across all datasets compared to using only existing numeric features.
Super-resolution: Training an LME on data from a coarser geographic level (e.g., US states) and applying it to embeddings of entities at a finer level (e.g., US counties) successfully estimated statistics at the finer level. This super-resolution LME approach often outperformed both the text output for the finer level and a naive baseline that simply projected the coarser-level values down. This suggests embeddings capture sub-regional variations.

In summary, the paper provides strong evidence that LLMs possess significant latent knowledge about economic and financial entities and statistics, which is often more accessible and accurate via linear probing of hidden states than through natural language generation. The proposed LME approach, particularly with its ability to learn from limited data and even noisy text labels, offers a computationally efficient and often more accurate alternative or supplement to directly querying LLMs for specific statistics. The demonstrated improvements in data imputation and super-resolution highlight the practical value of leveraging LLM embeddings for typical economic data processing pipelines. Future work could explore even larger models, proprietary datasets, and other data tasks like outlier detection using these techniques.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (3)

Tweets

https://twitter.com/sebkrier/status/1922851964729037106

https://twitter.com/fly51fly/status/1923858421620801967

https://twitter.com/saeedamenfx/status/1922985757083337051

https://twitter.com/QFinancePapers/status/1922629545271681488

https://twitter.com/knishimae0531/status/1924054616012210223

https://twitter.com/GptMaestro/status/1927112556579094533