BANSData: Diverse Datasets for NLP, Finance & Genomics
- BANSData is a collection of unique, domain-specific datasets spanning Bengali news summarization, Polish offensive speech, financial bankruptcy prediction, and Bayesian genomic network learning.
- It provides standardized benchmarks and diverse preprocessing protocols, enabling robust evaluation methods like pointer-generator models, moderator-agreement metrics, and logistic regression or Bayesian approaches.
- Researchers benefit from its well-defined splits, clear methodological transparency, and actionable insights that support reproducible studies across NLP, finance, and genomics.
BANSData refers to several distinct, domain-specific datasets in natural language processing, financial machine learning, and computational genomics. Despite the shared nomenclature, these datasets serve orthogonal research communities and tasks: (1) Bengali document summarization, (2) Polish offensive content moderation, (3) accounting-based bankruptcy prediction, and (4) multilayer genomic network modeling. Each instantiation is characterized by unique construction protocols, statistical properties, and research applications.
1. Bengali News Summarization Corpus ("BANSData") (Dhar et al., 2021)
The Bengali "BANSData" corpus is a central benchmarking dataset for abstractive text summarization in Bangla. It consists of 19,096 document–summary pairs, with each entry being a news article paired with a single human-authored summary, drawn from previously published online news corpora in Bengali. The dataset’s principal competitors are much smaller (e.g., BNLPC with 200 articles and {XL}-Sum with 10,126 articles).
No explicit length statistics (mean or variance of document/summary) are provided, though the format is headline and multi-sentence summary per news article. The BANSData corpus was not accompanied by a new “crawl-and-clean” pipeline for this use; basic NLP preprocessing such as tokenization and sentence-boundary detection is inferred as necessary for seq2seq modeling, but there is no mention of advanced filtering for non-Bengali tokens, advertisements, or HTML artifacts.
Dataset partitioning follows standard practice: 70 % train, 20 % validation, 10 % test. A fixed-size vocabulary of 50,000 word tokens is maintained, with a pointer-generator mechanism employed to enable the copying of out-of-vocabulary tokens directly from source texts. Each decoding step computes a probability , yielding a final word distribution
where is the attention distribution on the source sequence, supporting native OOV handling.
For model benchmarking, the dominant approach is a hybrid pointer-generator RNN with coverage. The best model configuration—embedding dimension 128, hidden state 512, max encoder steps 400, decoder steps 100, learning rate 0.15—was selected via validation set performance among five random initializations. On the 10% held-out test set, the model achieved ROUGE-1 F = 0.66, ROUGE-2 F = 0.41, and ROUGE-L ≈ 0.38, outperforming attention-only and Bi-LSTM extractive baselines on the same corpus.
BANSData is publicly available, though licensing is only implicitly for research use; a distinct download link is not specified in the referenced publication.
2. BAN-PL ("BANSData") Offensive Polish Social Media Corpus (Kołos et al., 2023)
The term "BANSData" is also used in the BAN-PL dataset—the largest publicly available corpus of harmful Polish social media content, constructed in cooperation with Wykop.pl. With 691,662 text items (exactly 345,831 harmful and 345,831 neutral), the corpus captures real-world moderation events from January 2019 to April 2023.
The harmful class comprises posts and comments banned by moderators for incitement, insults, violence, or broader "inappropriate" content, following a multi-stage pipeline: user reports, majority vote by five moderators (rarely 3:2 splits), and a two-level appeals process. The neutral set is a uniform random sample of unreported, publicly visible posts remaining after 48 hours.
Anonymization leverages PrivMasker (spaCy) for generic PII, PolDeepNer2 NER for Polish, regex-based URL/user matching, and custom dictionaries. Released data is JSONL-formatted with associated fields for class, text, anonymization tags, and metadata.
Length statistics (tokens per post or comment) for harmful/neutral classes are detailed:
| Class | Count | Mean | Std | 25% | 50% | 75% |
|---|---|---|---|---|---|---|
| Harmful total | 345,831 | 34.9 | 98.5 | 10 | 17 | 35 |
| Neutral | 345,831 | 40.9 | 55.9 | 13 | 24 | 47 |
Vocabulary size is not enumerated but exhibits high lexical diversity driven by slang, neologisms, and obfuscations, particularly in the harmful class. To support downstream tasks, preprocessing scripts include whitespace normalization, profanity-unmasking (using a dictionary, heuristics, and FastText lookup), and planned morphological normalization.
A 24,000-item subset (balanced, manually verified) with scripts is available under a permissive license (CC BY 4.0). Reported agreement among moderators re-annotating 134 external harmful tweets yields multi-rater κ ≈ 0.59, reflecting nontrivial annotation variance due to distinctions between internal policy and academic definitions.
BAN-PL is recommended for binary classification (offensiveness), subclass detection, profane language identification under adversarial obfuscations, bias quantification, and sociolectal-morphosyntactic analyses. Principal limitations are domain specificity (18–45, male-dominated, idiosyncratic humor), internal policy bias, and historical topical skew between 2019 and 2023.
3. BANSData Financial Ratios Dataset for Bankruptcy Prediction (Wang et al., 2024)
In the bankruptcy modeling literature, "BANSData" denotes a large, publicly available accounting-based bankruptcy-prediction dataset, as surveyed by Lombardo et al. and in "Datasets for Advanced Bankruptcy Prediction: A Survey and Taxonomy" (Wang et al., 2024). The dataset consists solely of firm-level financial ratios drawn from SEC filings, with no textual, market, or relational features.
Specifications include 78,682 firm-year records (covering 8,262 unique U.S. public companies) from 1999 to 2018. The established splits are: train 1999–2011, validation 2012–2014, test 2015–2018. Each row is a firm-fiscal year, with strict deduplication and no missing or imputed values.
Feature set includes 18 canonical financial ratios such as working capital/total assets, retained earnings/total assets, EBITDA/total assets, market value equity/book value total debt, sales/total assets, current ratio, quick ratio, debt ratio, return on assets, and additional leverage/liquidity indicators.
Dataset quality metrics (from Table 4) are:
| Metric | Value |
|---|---|
| Bankruptcy Rate | 6.63% (5,220/78,682) |
| Missing Values | 0 |
| Duplicate Record Rate | 0% |
| Data Volume | 78,682 obs., 18 features |
| Data Noise | Low (labels from Ch.7/11) |
Class imbalance (∼1:15, bankrupted:alive) necessitates remedies such as SMOTE or cost-sensitive learning. Survivorship and listing biases (publicly traded, U.S. specific) restrict generalizability to private firms or non-U.S. markets. The dataset supports logistic regression (L1/L2), Random Forest, and XGBoost; benchmarks report AUC 0.88–0.92 for RF/XGBoost and 0.82–0.85 for logistic regression. Deep learning offers marginal gains absent non-accounting features.
Feature informativeness is evaluated by Information Value (IV), Mean Decrease Impurity (MDI), and chi-squared tests, as described in the survey. BANSData offers a robust, clean baseline for accounting-ratio-based bankruptcy prediction, with strict chronological partitioning to prevent temporal leakage.
4. Simulated and Real Multi-Omic "BANSData" for Bayesian Genomic Network Learning (Ha et al., 2020)
Within the context of Bayesian Structure Learning in multi-layered genomic networks, "BANSData" refers to both simulated benchmarks and real cancer genomics compendia that support the BANS (Bayesian Node-wise Selection) algorithm under the multi-layer Gaussian graphical model (mlGGM) (Ha et al., 2020).
Simulated Data Construction
- Variables: total, partitioned into ordered layers .
- Edge Sampling: Within-layer edges are undirected (Bernoulli() per pair); between consecutive layers, directed edges are sampled as Bernoulli().
- Adjacency Matrix: Encodes undirected () and directed () connections per construction.
- Edge weights for and off-diagonal are sampled i.i.d. from Uniform().
- Diagonal elements ensure positive-definiteness: .
- Covariance: .
- Data: samples drawn i.i.d. , for multiple scenarios (e.g., ; or 200; or 10; as specified).
Real TCGA Multi-Omic Data
- Seven cancer types (e.g., LUAD, LUSC, COAD) and four platforms (CNA, DNA methylation, mRNA-seq, RPPA protein expression).
- Selection for each pathway: restrict to genes of pathway interest; retain only samples with data present for all platforms; log-transform as needed; center each feature to mean zero.
- The resulting data matrix encodes genes (columns) across four platforms (layers).
Model Formulation
- Observations: partitioned into layers.
- mlGGM assumes , ; precision .
- Undirected edges (within layer) and directed (between consecutive layers) are modeled with edge-specific slab-and-spike priors.
- The full likelihood for observed data , at the node- and global-level, is given by:
- Edge selection via Gibbs sampling yields posterior inclusion probabilities for edges, thresholded by Bayesian FDR or median probability rules.
This BANSData enables benchmarking and real-world inference for multi-layer network reconstruction in genomics.
5. Comparative Overview
| Context | Domain | Size / Features | Composition / Construction | Licensing / Access |
|---|---|---|---|---|
| Bengali Summarization | Text/NLP | 19,096 doc–summary pairs | News; 1 summary per article; prior corpora | Public, research use (implicit) |
| Polish Offensive Speech | Social/NLP | 691,662 text items (half harmful/neutral) | Wykop.pl moderation, 5-moderator review, balance | CC BY 4.0 (partial release) |
| Bankruptcy Prediction | Finance/ML | 78,682 firm-years, 18 ratios | SEC filings, US firms, no missing entries | Freely available, no restrictions |
| Genomic Network ML | Genomics | Simulated (), TCGA multi-omics | Benchmarks: ER random graph; Real: TCGA cancer | TCGA (public); sim benchmark open |
This comparison highlights domain differences and dataset-specific collection paradigms, emphasizing one-to-one naming ambiguity ("BANSData") across research fields.
6. Research Impact and Usage Recommendations
Each instantiation of BANSData is recognized as a reference dataset for its corresponding domain:
- Bengali summarization: the gold standard for evaluating generative LLMs in Bangla, supporting advancements in pointer-generator and coverage mechanisms, with clear test splits and reproducible baselines (Dhar et al., 2021).
- Polish harmful content detection: the scale and rigor of BAN-PL address critical gaps in Polish-language content moderation and support robust assessment of harm-detection architectures, including adversarial and bias-sensitive methods (Kołos et al., 2023).
- Bankruptcy forecasting: BANSData underpins methodologically sound, temporally consistent comparative evaluation of accounting-ratio bankruptcy prediction pipelines; it is ideally suited for interpretability-focused and ensemble modeling (Wang et al., 2024).
- Genomic network learning: BANSData in simulation and real TCGA contexts is essential for benchmarking Bayesian graphical models in integrative omics, enabling formal inference in high-dimensional multi-layer systems (Ha et al., 2020).
Users should be aware of the potential for confusion due to homonymous dataset naming across these distinct subfields, as well as domain-specific limitations including community and regulator bias (BAN-PL), scope restriction to publicly traded firms (finance), and preprocessing/annotation protocols (all domains).