Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 19 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 74 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Uncovering Representation Bias for Investment Decisions in Open-Source Large Language Models (2510.05702v1)

Published 7 Oct 2025 in q-fin.CP and cs.AI

Abstract: LLMs are increasingly adopted in financial applications to support investment workflows. However, prior studies have seldom examined how these models reflect biases related to firm size, sector, or financial characteristics, which can significantly impact decision-making. This paper addresses this gap by focusing on representation bias in open-source Qwen models. We propose a balanced round-robin prompting method over approximately 150 U.S. equities, applying constrained decoding and token-logit aggregation to derive firm-level confidence scores across financial contexts. Using statistical tests and variance analysis, we find that firm size and valuation consistently increase model confidence, while risk factors tend to decrease it. Confidence varies significantly across sectors, with the Technology sector showing the greatest variability. When models are prompted for specific financial categories, their confidence rankings best align with fundamental data, moderately with technical signals, and least with growth indicators. These results highlight representation bias in Qwen models and motivate sector-aware calibration and category-conditioned evaluation protocols for safe and fair financial LLM deployment.

Summary

  • The paper demonstrates that firm size and valuation drive LLM confidence, indicating a substantial representation bias in financial AI.
  • The methodology employs round-robin prompting and token-logit aggregation across 150 U.S. firms to quantify model preferences.
  • Statistical tests reveal that industry classification and sector anchoring significantly affect bias stability, underscoring the need for calibration.

Systematic Analysis of Representation Bias in Open-Source Financial LLMs

Introduction

This paper presents a rigorous empirical investigation into representation bias in open-source Qwen LLMs applied to investment decision support. The paper addresses a critical gap in financial AI research: the extent to which LLMs encode and propagate biases related to firm size, sector, and financial characteristics, potentially distorting risk assessment and capital allocation. The authors introduce a balanced round-robin prompting protocol over a curated universe of U.S. equities, leveraging constrained decoding and token–logit aggregation to quantify firm-level confidence scores across multiple financial contexts. The analysis spans several Qwen model variants and employs robust statistical inference to dissect the determinants, stability, and empirical grounding of LLM confidence in investment scenarios.

Methodology

The experimental design encompasses approximately 150 U.S.-listed firms from 2017–2024, with monthly standardized financial features covering valuation, health, profitability, risk, market structure, growth, dividend, and technical metrics. Firms are classified by sector and industry using GICS codes. The core protocol involves pairwise firm comparisons under multiple prompt categories (e.g., fundamentals, technicals, risk, growth), with two prompt variants to control for phrasing effects. Each model returns a ticker selection and a confidence score derived from token-level log-probabilities, aggregated to yield firm-level preference intensities.

Statistical analysis includes Pearson, Spearman, and Kendall correlations (with BH-FDR correction), one-way ANOVA for categorical effects, and dispersion metrics (SD, MAD) for cross-context stability. The paper addresses three research questions: (1) Which firm-level features most influence LLM confidence? (2) How stable are LLM preferences across financial contexts? (3) Do high-confidence outputs align with superior empirical financial performance?

Determinants of LLM Confidence

The results demonstrate that LLM confidence is most strongly and consistently driven by firm size and valuation proxies—market capitalization, enterprise value, shares outstanding, float shares, and free cash flow. These features exhibit robust positive correlations with model confidence across all Qwen variants and correlation methods, whereas profitability, technical indicators, risk, and growth metrics show weaker or inconsistent associations. Figure 1

Figure 1: Determinants of LLM confidence by model and correlation type, highlighting the dominance of size and valuation features.

One-way ANOVA reveals that industry classification explains a substantial share of variance in LLM confidence (η2≈0.52\eta^2 \approx 0.52–$0.67$), with sector effects also significant but more modest (η2≈0.16\eta^2 \approx 0.16–$0.31$). This indicates that LLMs are particularly sensitive to industry category, with sector playing a secondary role. Communication Services and Technology sectors elicit higher confidence, while Consumer Defensive and Energy are less favored. At the industry level, Capital Markets, Entertainment, Internet Content, and Software Infrastructure are preferred over Tobacco, Packaged Foods, and Lodging.

Cross-Context Stability and Sectoral Anchoring

Analysis of within-firm dispersion in confidence scores across prompt categories reveals pervasive anchoring effects, with sectoral ordering highly consistent across models. Technology exhibits the highest within-sector dispersion (SD and MAD), indicating lower cross-context stability and greater heterogeneity in LLM preferences. Sectors such as Consumer Discretionary, Industrial, and Financial display tighter stability. Figure 2

Figure 2: (Left) Effect sizes across models and factors; (Middle, Right) Heatmap of cross-context stability across sectors and models, showing sectoral anchoring and variability.

Model scale influences stability: Qwen2.5-32B displays the greatest context sensitivity, suggesting that larger models adapt more flexibly to different financial contexts, while smaller models anchor more tightly. This is consistent with scaling-law effects, where increased model capacity enables more nuanced context-dependent behavior.

Alignment with Empirical Financial Metrics

Under category-specific prompts, LLM ranking preferences align most strongly with fundamental metrics, particularly free cash flow, which shows significant positive correlations across all models. Technical metrics such as average trading volume also exhibit moderate positive associations, confirming model sensitivity to market activity. Risk features (e.g., beta, volatility) display negative or weak correlations, indicating higher LLM confidence for lower-risk firms. The strength and direction of these relationships vary by model, with no clear advantage in scale or architecture.

These findings indicate that, when explicitly guided by context-specific prompts, LLMs can partially ground their preferences in relevant empirical financial data. However, the alignment is incomplete and variable, underscoring the influence of representation bias and the need for targeted calibration.

Implications and Future Directions

The evidence demonstrates that open-source Qwen LLMs internalize economically meaningful financial structures but are also shaped by representation biases favoring firm size, visibility, and sector-specific priors. For practical deployment in financial decision support, model governance should incorporate:

  • Bias Calibration: Adjust outputs to mitigate size and sector biases, especially in portfolio and risk management applications.
  • Category-Specific Prompting: Employ targeted prompts and post-hoc consistency checks to enhance reliability and fairness.
  • Stability Diagnostics: Monitor dispersion measures (SD/MAD on the logit scale) alongside performance metrics to assess model robustness.

The current analysis is limited to a specific universe of U.S. firms and pre-specified feature sets; results may differ with broader sampling or in non-U.S. markets. Correlations are descriptive, not causal, and out-of-sample trading performance is not evaluated. Future research should explore debiasing pipelines, counterfactual and mechanistic explanations of category priors, and the interplay of model scale versus architecture under controlled data curation.

Conclusion

This paper provides a systematic, multi-method assessment of representation bias in open-source financial LLMs, revealing that firm scale, valuation, and market structure are the primary drivers of model confidence, with sector and industry effects exerting substantial influence. LLMs exhibit partial grounding in empirical financial metrics under category-specific prompts but remain susceptible to biases that may compromise fairness and reliability in high-stakes investment applications. Sector-aware calibration, prompt engineering, and stability diagnostics are recommended for safe and effective deployment. The findings motivate further research into bias mitigation, model interpretability, and the development of governance frameworks for financial AI.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube