Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 73 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 218 tok/s Pro
GPT OSS 120B 460 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

NIFTY Financial News Headlines Dataset (2405.09747v1)

Published 16 May 2024 in q-fin.CP and cs.LG

Abstract: We introduce and make publicly available the NIFTY Financial News Headlines dataset, designed to facilitate and advance research in financial market forecasting using LLMs. This dataset comprises two distinct versions tailored for different modeling approaches: (i) NIFTY-LM, which targets supervised fine-tuning (SFT) of LLMs with an auto-regressive, causal language-modeling objective, and (ii) NIFTY-RL, formatted specifically for alignment methods (like reinforcement learning from human feedback (RLHF)) to align LLMs via rejection sampling and reward modeling. Each dataset version provides curated, high-quality data incorporating comprehensive metadata, market indices, and deduplicated financial news headlines systematically filtered and ranked to suit modern LLM frameworks. We also include experiments demonstrating some applications of the dataset in tasks like stock price movement and the role of LLM embeddings in information acquisition/richness. The NIFTY dataset along with utilities (like truncating prompt's context length systematically) are available on Hugging Face at https://huggingface.co/datasets/raeidsaqur/NIFTY.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Technology Association Workshop 2015, pages 84–90, 2015.
  2. Regime changes and financial markets. Annu. Rev. Financ. Econ., 4(1):313–337, 2012.
  3. Dogu Araci. Finbert: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063, 2019.
  4. Stock market prediction using deep reinforcement learning. Applied System Innovation, 6(6), 2023.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  7. Density-based clustering based on hierarchical density estimates. Proceedings of the 17th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pages 160–172, 2013.
  8. Finqa: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711, 2021.
  9. Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering. arXiv preprint arXiv:2210.03849, 2022.
  10. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  11. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.
  12. Llm-informed multi-armed bandit strategies for non-stationary environments. Electronics, 12:2814, 2023.
  13. A framework for few-shot language model evaluation. September 2021.
  14. Massimo Guidolin. Markov switching models in empirical finance. In Missing data methods: Time-series methods and applications, pages 1–86. Emerald Group Publishing Limited, 2011.
  15. Size and value anomalies under regime shifts. Journal of Financial Econometrics, 6(1):1–48, 2008.
  16. James D Hamilton. A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica: Journal of the econometric society, pages 357–384, 1989.
  17. James D Hamilton. Analysis of time series subject to changes in regime. Journal of econometrics, 45(1-2):39–70, 1990.
  18. James D Hamilton. Regime switching models. In Macroeconometrics and time series analysis, pages 202–209. Springer, 2010.
  19. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009.
  20. Hugging Face. Transformers documentation. https://huggingface.co/docs/transformers/, 2024. Accessed: 2024-02-01.
  21. Ting Jiang et al. Scaling sentence embeddings with large language models. ArXiv, 2023.
  22. A deep reinforcement learning framework for the financial portfolio management problem. arXiv preprint arXiv:1706.10059, 2017.
  23. Financial trading as a game: A deep reinforcement learning approach. Preprint submitted to arXiv, 2017.
  24. John et al. Jumper. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, Aug 2021.
  25. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
  26. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  27. Corentin Kervadec et al. Unnatural language processing. ArXiv, 2023.
  28. Reinforcement learning for optimizing rag for domain chatbots, 2024. Accepted at AAAI 2024 Workshop on Synergy of Reinforcement Learning and Large Language Models.
  29. Graphcast: Learning skillful medium-range global weather forecasting. arXiv preprint arXiv:2212.12794, 2022.
  30. Tradinggpt: Multi-agent system with layered memory and distinct characters for enhanced financial trading performance, 2023.
  31. Bbt-fin: Comprehensive construction of chinese financial domain pre-trained language model, corpus and benchmark. arXiv preprint arXiv:2302.09432, 2023.
  32. Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018, pages 1941–1942, 2018.
  33. Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology, 65(4):782–796, 2014.
  34. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv e-prints, February 2018.
  35. National Institute of Standards and Technology (NIST). Reuters dataset at trec. https://trec.nist.gov/data/reuters/reuters.html, 2024. Accessed: 2024-02-01.
  36. OpenAI. Openai api. https://www.openai.com/, 2024. Accessed: 2024-02-01.
  37. Training language models to follow instructions with human feedback, 2022. URL https://arxiv. org/abs/2203.02155, 13, 2022.
  38. Karl Pearson. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.
  39. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  40. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
  41. Sentence-bert: Sentence embeddings using siamese bert-networks, 2019.
  42. Claude Sammut and Geoffrey I. Webb, editors. TF–IDF, pages 986–987. Springer US, Boston, MA, 2010.
  43. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  44. When flue meets flang: Benchmarks and large pre-trained language model for financial domain. arXiv preprint arXiv:2211.00083, 2022.
  45. Claude E Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379–423, 1948.
  46. ShareGPT. Sharegpt. https://sharegpt.com, 2024. Accessed: 2024-02-01.
  47. Impact of news on the commodity market: Dataset and results. In Advances in Information and Communication: Proceedings of the 2021 Future of Information and Communication Conference (FICC), Volume 2, pages 589–601. Springer, 2021.
  48. Accurate stock movement prediction with self-supervised learning from sparse noisy tweets. In 2022 IEEE International Conference on Big Data (Big Data), pages 1691–1700. IEEE, 2022.
  49. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  50. True knowledge comes from practice: Aligning llms with embodied environments via reinforcement learning. arXiv preprint arXiv:2401.14151, 2024.
  51. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  52. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  53. LLaMA: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023.
  54. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  55. Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
  56. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
  57. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  58. Self-instruct: Aligning language models with self-generated instructions, 2023.
  59. Learning paraphrastic sentence embeddings from back-translated bitext. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 274–285, Copenhagen, Denmark, September 2017. Association for Computational Linguistics.
  60. Hybrid deep sequential modeling for social text-driven stock prediction. In Proceedings of the 27th ACM international conference on information and knowledge management, pages 1627–1630, 2018.
  61. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
  62. Pixiu: A large language model, instruction data and evaluation benchmark for finance. arXiv preprint arXiv:2306.05443, 2023.
  63. Stock movement prediction from tweets and historical prices. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1970–1979, 2018.
  64. Finbert: A pretrained language model for financial communications. arXiv preprint arXiv:2006.08097, 2020.
  65. Temporal data meets llm – explainable financial time series forecasting, 2023.
  66. Unveiling the potential of sentiment: Can large language models predict chinese stock price movements?, 2023.
  67. Mengzi: Towards lightweight yet ingenious pre-trained models for chinese. arXiv preprint arXiv:2110.06696, 2021.
  68. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 posts and received 0 likes.