Chinese Social Media Stance Detection Dataset
- The Chinese social media stance detection dataset is a large-scale, expert-verified corpus tailored for zero-shot, multi-target stance detection in open-domain texts.
- It employs the DGTA framework with integrated and two-stage fine-tuning strategies using LLMs and LoRA adaptations to handle dynamic target–stance pairs.
- Multi-dimensional evaluation metrics and detailed benchmarking reveal the dataset’s efficacy in advancing automated stance analysis in Chinese social media.
A Chinese social media stance detection dataset is a large-scale, high-quality corpus specifically constructed for modeling and benchmarking zero-shot, multi-target stance detection in open-domain Chinese social media text. Developed as part of the Dynamic Target Generation and Multi-Target Adaptation (DGTA) framework, it enables models to extract an arbitrary and previously unseen set of target–stance pairs from input social posts, where neither targets nor their count are provided in advance. The dataset comprises over 70,000 expert-verified entries curated from users on Weibo, annotated for both target identification and stance classification, and supports detailed, multi-dimensional evaluation for a new class of stance detection methodologies (Li et al., 27 Jan 2026).
1. Formal Problem Definition
Zero-shot stance detection in the wild with DGTA requires models to process social media post and output a set of target–stance pairs:
Here, varies per example, and the targets are natural-language phrases; their stances . Unlike classical stance detection tasks, DGTA does not assume any predefined target set or fixed target count. The open-world, generative sequencing approach enables direct modeling of the complex, dynamic nature of real social media discourse, where stance may be expressed toward multiple or previously unencountered entities.
2. Dataset Creation, Annotation, and Validation Protocol
The dataset was constructed via the curation of 240 Weibo users across 36 domains, resulting in an initial collection of 125,000 posts. Stringent cleaning via regular-expression and Unicode normalization reduced this to 107,000 posts. Annotation involved an LLM ensemble methodology: three LLMs (GLM4-9B, Qwen2.5-7B, Llama3-8B) generated candidate target and stance labels. Only posts where at least two models agreed on both target spans and stance were retained, ensuring high LLM consensus quality. A secondary DeepSeek-V3 LLM provided further scoring, followed by human verification from eight annotators. The final dataset achieves strong reliability (Fleiss’s ), comprising 70,931 entries.
Data splitting into training, development, and test sets adheres to an 8:1:1 ratio. Test set construction is explicitly stratified by the joint (target-count, stance) distribution across seven independent 1,000-sample draws (fixed seeds), supporting robust, granular performance evaluation across diverse stance-expression scenarios.
3. Target and Stance Evaluation Metrics
Evaluation is explicitly multi-dimensional:
- Target-Identification C-Score: A composite metric aggregating BERTScore (semantic similarity), BLEU (n-gram overlap), and ROUGE-L (longest common subsequence), with respective weights . Performance is further scaled by recall:
A target is deemed correctly identified if , , , recall , and .
- Stance Detection Metrics: Precision, recall, and are computed for stance assignment, contingent on correct target identification:
These metrics isolate stance labeling performance apart from target extraction ability.
4. Architecture and Fine-Tuning Strategies
All models are LoRA-adapted variants of pre-trained LLMs (e.g., Qwen2.5-7B-Instruct, DeepSeek-R1-Distill-Qwen-7B). The dataset supports two principal fine-tuning strategies:
- Integrated Fine-Tuning: A single LLM is instruction-fine-tuned end-to-end to jointly extract the entire target–stance pair sequence, mapping . The loss function is standard token-level cross-entropy over the output sequence.
- Two-Stage Fine-Tuning: Distinct LLMs are separately fine-tuned for target extraction and stance classification. The target extractor is optimized for sequence generation over targets only, while the stance classifier receives the post and an extracted target to predict , each with independent cross-entropy objectives. During inference, models are composed sequentially.
LoRA configurations for fine-tuning use a rank , learning rate , batch size 16, max sequence length 512, and 3 epochs.
5. Baselines, Prompted LLMs, and Experimental Protocol
Pre-trained baselines include: fine-tuned mT5 for target extraction, BERT for stance classification, pipeline RoBERTa-CRFRoBERTa-large, and end-to-end mT5 with constrained decoding. Prompted LLMs such as DeepSeek-V3, GLM4-9B, GPT-4o, and Llama3-8B are evaluated with and without chain-of-thought reasoning. Evaluation reports results as the mean and standard deviation over seven test draws, with 95% confidence intervals.
6. Quantitative Benchmarking and Performance Analysis
Fine-tuned LLMs substantially surpass both pre-trained and prompted LLM baselines. Summary results include:
| Model/Strategy | Target C-Score (%) | Stance F (%) |
|---|---|---|
| Qwen2.5-7B (two-stage) | 66.99 | -- |
| DeepSeek-R1-Distill-Qwen-7B (integrated) | -- | 79.26 |
Prompting with chain-of-thought delivers a 4–7 percentage point gain for models such as GLM4-9B and Qwen2.5-7B in both C-Score and stance F. Increased target count per post degrades C-Score from ~69% (single-target) to ~54% (multi-target), indicating graceful task difficulty scaling. Cases with implicit or highly abstract targets yield the lowest C-Score (~52%).
7. Critical Appraisal, Limitations, and Prospective Directions
The Chinese social media stance detection dataset and DGTA protocol present several unique strengths: open-world stance modeling unconstrained by predefined target sets, a robust LLM/human-centric annotation pipeline, multi-dimensional evaluation, and empirical demonstration that instruction-finetuned LLMs with reasoning distillation deliver strong zero-shot, multi-target stance extraction. Limitations include persistent challenges for highly abstract targets, semantic fragmentation (e.g., over-splitting complex entities into spurious subtargets), incomplete representation of rare or long-tail topics, and computational constraints introduced by LoRA hyperparameters, affecting large-scale reproducibility.
Suggested extensions—directly motivated by observed weaknesses—include retrieval-augmented generation for grounding of abstract targets, contrastive multi-target objectives to enhance semantic cohesion, expansion to multilingual or cross-platform corpora, and exploration of joint end-to-end multitask losses to unify target and stance learning signals.
The released dataset and LoRA-tuned LLM checkpoints establish a foundational benchmark for future zero-shot, multi-target stance detection research in open-world settings (Li et al., 27 Jan 2026).