Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Astock: A New Dataset and Automated Stock Trading based on Stock-specific News Analyzing Model (2206.06606v1)

Published 14 Jun 2022 in cs.CL and cs.LG

Abstract: Natural Language Processing(NLP) demonstrates a great potential to support financial decision-making by analyzing the text from social media or news outlets. In this work, we build a platform to study the NLP-aided stock auto-trading algorithms systematically. In contrast to the previous work, our platform is characterized by three features: (1) We provide financial news for each specific stock. (2) We provide various stock factors for each stock. (3) We evaluate performance from more financial-relevant metrics. Such a design allows us to develop and evaluate NLP-aided stock auto-trading algorithms in a more realistic setting. In addition to designing an evaluation platform and dataset collection, we also made a technical contribution by proposing a system to automatically learn a good feature representation from various input information. The key to our algorithm is a method called semantic role labeling Pooling (SRLP), which leverages Semantic Role Labeling (SRL) to create a compact representation of each news paragraph. Based on SRLP, we further incorporate other stock factors to make the final prediction. In addition, we propose a self-supervised learning strategy based on SRLP to enhance the out-of-distribution generalization performance of our system. Through our experimental study, we show that the proposed method achieves better performance and outperforms all the baselines' annualized rate of return as well as the maximum drawdown of the CSI300 index and XIN9 index on real trading. Our Astock dataset and code are available at https://github.com/JinanZou/Astock.

An Analysis of Astock: A Novel Dataset and Automated Trading Model

The paper "Astock: A New Dataset and Automated Stock Trading based on Stock-specific News Analyzing Model" presents a comprehensive paper on the integration of NLP techniques in the domain of stock trading. The research introduces the Astock dataset, enriched with stock-specific news, financial metrics, and new methodologies for evaluating stock trading algorithms, thus contributing significantly to the field of finance-driven NLP applications.

Key Contributions

This work delineates three primary contributions: the provision of a meticulously annotated dataset, the development of a semantic role labeling pooling (SRLP) mechanism, and the introduction of a self-supervised learning strategy leveraging SRLP. These elements collectively form an innovation in the field of text-based stock prediction, particularly tailored for the Chinese A-shares market.

Semantic Role Labeling Pooling (SRLP)

The utilization of SRLP to distill information from financial news is a notable technical contribution. It seeks to compactly represent contextual information from news texts using semantic role labeling (SRL), which categorizes sentence components into verb (V), proto-agent (A0), and proto-patient (A1). This process harnesses a pre-trained LLM to extract precise embeddings, improving the representation of news events and their potential impact on stock prices.

Self-Supervised Learning

The paper further enhances prediction accuracy with a self-supervised learning approach integrated into SRLP. By conducting cloze-style tasks to predict masked semantic roles, the model effectively augments its ability to generalize across different distributions. This feature significantly boosts the model’s performance in an out-of-distribution context, which is a critical requirement for real-time stock trading applications, where future market conditions can diverge significantly from those present during model training.

Empirical Evaluation

The paper provides a thorough empirical evaluation that encompasses both in-distribution and out-of-distribution tests for stock movement classification. Notably, the SRLP model, when coupled with stock factors, achieves higher accuracy (66.89%) compared to various baselines, including state-of-the-art pre-trained models like RoBERTa WWM Ext.

In terms of real-world applicability, the model’s efficacy is demonstrated through backtesting on stock data. The paper employs the annualized rate of return, maximum drawdown, and Sharpe ratio as performance metrics, revealing that their approach surpasses traditional benchmarks like the XIN9 and CSI300 indexes for a test period in 2021.

Implications and Future Prospects

The proposed dataset and model open new pathways for integrating financial textual information with quantitative data into an actionable trading framework. Practically, this promises enhanced decision-making for traders leveraging data-centric approaches. Theoretically, the introduction of SRLP with self-supervised learning modules paves the way for future research into comprehensive NLP applications in finance, extending beyond sentiment analysis to include predictive analytical tasks with historical and real-time data representation.

Future research avenues could explore the expansion of this framework to other financial markets and seek optimizations in the model’s architecture to further enhance processing speed and accuracy. Additionally, incorporating more granular sentiment analysis and integrating alternative sources of market data could improve predictive capabilities.

In summary, the authors present a robust NLP-based trading system that is tested rigorously for empirical performance. While this paper lays a strong foundation for future enhancements in automated trading systems, the practical deployment of such models will require continuous adaptations to the ever-evolving dynamics of global financial markets.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jinan Zou (6 papers)
  2. Haiyao Cao (5 papers)
  3. Lingqiao Liu (114 papers)
  4. Yuhao Lin (10 papers)
  5. Ehsan Abbasnejad (59 papers)
  6. Javen Qinfeng Shi (34 papers)
Citations (13)
Github Logo Streamline Icon: https://streamlinehq.com