Papers
Topics
Authors
Recent
2000 character limit reached

CONLIT: 21st-Century Anglophone Fiction Corpus

Updated 21 November 2025
  • CONLIT Dataset is a corpus representing 21st-century Anglophone fiction that provides detailed metadata and narrative features for quantitative literary analysis.
  • It supports rigorous analysis of genre distinctions and gender effects, enabling comparative studies between genre and literary fiction through hand-tagged metadata.
  • The dataset facilitates examination of narrative structure, character metrics, and stylistic features using advanced statistical methodologies and preprocessing techniques.

CONLIT is a corpus representing twenty-first-century Anglophone fiction, constructed to support large-scale computational analysis of contemporary literary production, with a specific focus on formal and institutional theories of genre, genre fiction, and literary prestige. It provides comprehensive, hand-curated metadata and narrative features for 2,754 English-language novels published between 2001 and 2021, and is designed for rigorous quantitative study of literary and generic boundaries in the modern Anglo-American literary field (Johnson, 13 Nov 2025).

1. Definition, Origin, and Purpose

CONLIT (Contemporary Literature) was assembled and released by Andrew Piper and collaborators with the explicit purpose of enabling computational research on questions central to literary studies, such as the distinction between genre fiction and literary fiction in both formal and institutional terms. Works are selected to permit both comparative (genre-genre, genre-literary) and intersectional (gender by genre) investigations. All works are in English and were published between 2001 and 2021. Hand tagging by genre, together with “prizelists” metadata indexing major literary awards, provides a corpus that is explicitly structured to facilitate both the operationalization and the critique of contemporary genre categories (Johnson, 13 Nov 2025).

2. Corpus Composition and Structure

The CONLIT corpus contains 2,754 novels, distributed as shown in the table below (after deduplication by genre tagging):

Category Count Author Gender Breakdown
Science Fiction (SF) 222 Female: 72, Male: 150
Mystery (MY) 231 Female: 117, Male: 110
Romance (ROM) 208 Female: 205, Male: 2 (two removed in some analyses)
Combined Genre (All) 661 Female: 394, Male: 262
Literary Fiction 191 Female: 80, Male: 111

Assignment to “literary fiction” is based on appearance in major prize lists (Faulkner, National Book Award, Man Booker, Giller), with the exclusion of the Governor General’s Award, resulting in 191 “literary fiction” novels from an initial pool of 258 prize-listed works [(Johnson, 13 Nov 2025), Table 1].

Gender annotation is binary, and books with unspecified author gender (N=5) are excluded from gender-partitioned analyses.

3. Metadata Features and Text-Level Representation

CONLIT’s records are organized around two broad levels: item-level bibliographic metadata, and a suite of narrative and linguistic features. Item-level metadata encompasses unique identifiers, author name, title, publication year (within 2001–2021), manual genre tags (SF, mystery, romance, prize), and binary author gender.

Text-level features, as summarized in Table 2 of (Johnson, 13 Nov 2025), capture:

  • Narrative Structure:
    • Average Speed (pace)
    • Minimum Speed (distance)
    • Circuitousness (narrative non-linearity)
    • Topical Heterogeneity (semantic spread)
    • Event Count (estimated diegetic events)
  • Character Metrics:
    • Total Characters (threshold: named ≥ 5 occurrences)
    • Protagonist Concentration (fraction of mentions by main character)
    • Probability of First-Person POV (binary encoding, intermediate cases dropped)
  • Language and Readability:
    • Token Count (words plus punctuation)
    • Average Word Length
    • Average Sentence Length
    • Tuldava Score (reading-difficulty measure)

For lexical analyses, each novel includes a unigram frequency vector over a vocabulary of 10,000 terms (selected by presence in ≥18% of texts). Static semantic representations use fastText embeddings (Common Crawl), aggregated per book by TF–IDF weighted averaging and L2 normalization, for use in stylistic and semantic distance computations [(Johnson, 13 Nov 2025), Section 4.3].

4. Preprocessing, Cleaning, and Feature Engineering

Initial corpus preprocessing consisted of deduplication (removing four books tagged concurrently as SF and mystery), and filtering for “literary fiction” by excluding works shortlisted only for the Governor General’s Award. Author gender is encoded as binary; books with missing gender and select male-authored romances are dropped for analyses requiring normal group sizes or normality [(Johnson, 13 Nov 2025), Section 4.1, Table 3].

Text-level feature engineering includes:

  • Enforcement of strict binarity on probability of first-person narration (22 books near p=0.5 excluded from related statistical tests).
  • Z-score normalization of continuous features (regression and PCA workflows), except for binary variables.
  • Multicollinearity diagnostics via Variance Inflation Factors (VIF); features with VIF > 10 (topical heterogeneity, minimum speed, total characters, Tuldava score) are excluded in reduced regression models [(Johnson, 13 Nov 2025), Section 4.2; Appendix C].

5. Statistical and Analytical Methodologies

CONLIT is expressly tailored for quantitative literary analysis and features several layers of statistical methodology:

  • Welch’s ANOVA:

One-way tests with unequal variances to assess groupwise differences in narrative features across genres and gender splits. The test statistic, as used, is:

FW=j=1kwj(XˉjXˉ)2/(k1)1+2(k2)k21j=1k(1/(nj1))(1wj)2F_W = \frac{\sum_{j=1}^k w_j(\bar X_j - \bar X)^2/(k-1)}{1 + \tfrac{2(k-2)}{k^2-1}\sum_{j=1}^k (1/(n_j - 1))(1 - w_j)^2}

where wj=nj/sj2w_j = n_j / s_j^2 with usual notation [(Johnson, 13 Nov 2025), Section 4.1].

  • Logistic Regression:

Binary outcome predicting “literary” classification as a function of standardized narrative features and author gender, both with and without gender-feature interaction terms:

log ⁣(P(Literary)1P(Literary))=β0+δG+i=1pβiXi+i=1pγi(Xi×G)\log\!\left(\frac{P(\mathrm{Literary})}{1-P(\mathrm{Literary})}\right) = \beta_0 + \delta\,G + \sum_{i=1}^p \beta_i X_i + \sum_{i=1}^p \gamma_i (X_i \times G)

with G=1G=1 for female authors [(Johnson, 13 Nov 2025), Section 4.2]. Likelihood-ratio tests establish the significance of included interaction terms.

  • Multivariate Distance Tests:

Stylistic and semantic differences are assessed using within-group dispersion (PERMDISP), nonparametric WdW_d^* and Tw2T^2_w statistics, and effect size via:

ω2=SSbetweendfbetweenσ^2SStotal+σ^2\omega^2 = \frac{\mathrm{SS}_\mathrm{between} - \mathrm{df}_\mathrm{between}\,\hat\sigma^2}{\mathrm{SS}_\mathrm{total} + \hat\sigma^2}

where σ^2=SSwithin/dfwithin\hat\sigma^2 = \mathrm{SS}_\mathrm{within}/\mathrm{df}_\mathrm{within} [(Johnson, 13 Nov 2025), Appendix D].

Principal component analysis is employed for visualization of stylistic and semantic feature distances (see Figure 1 of (Johnson, 13 Nov 2025)).

6. Access, Data Representations, and Limitations

CONLIT is disseminated in the form of metadata tables (CSV or TSV) and auxiliary feature files. For each novel, these provide bibliographic records, genre tags, award lists, and manually-coded author gender. Unigram counts (CSV format) and static embedding vectors are included for each book. No full text is distributed; analysis leverages extracted features and semantic or stylistic vectors.

Identified limitations include:

  • Genre Scope: Restricted to science fiction, mystery, and romance; not comprehensive for all genre fiction subgenres.
  • Text Access: Full texts are excluded; only feature vectors and lexical data are supplied.
  • Feature Validation: Some narrative metrics are validated primarily on non-contemporary corpora.
  • Gender Annotation: Author gender coded as binary based on perceived information; this excludes non-binary and possibly misclassifies pseudonymous authors.
  • Romance Author Gender Skew: Amazon-derived romance selection yields predominance of female authors, possibly exaggerating gender effects.
  • Literary Label: Prize-based definition of “literary fiction” may omit works perceived as literary outside major prize circuits [(Johnson, 13 Nov 2025), Section 6.3].

7. Research Applications and Impact

CONLIT provides a platform for analysis of genre and literary fiction at scale, supporting empirical tests of formalist and institutional claims about literary categorization. Applications demonstrated include:

  • Assessment of statistically significant formal markers distinguishing genres, with particular attention to the moderating effects of author gender on narrative feature/literary status relationships (Johnson, 13 Nov 2025).
  • Regression and dispersion-based modeling of stylistic and semantic boundaries among genre and literary fiction categories.
  • Critical analysis of how gendered authorship affects both the formal characteristics and the institutional gatekeeping of literary classification.

A plausible implication is that the corpus architecture and metadata framework of CONLIT allow for sophisticated extensions to explore further genre inclusion, award mechanisms, and the interplay of demographic and formal literary features in the future.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to CONLIT Dataset.