Bidirectional Readability Assessment Mechanism

Updated 3 December 2025

Bidirectional readability assessment mechanisms are techniques that integrate context from both directions to accurately gauge text complexity.
They fuse deep language models, handcrafted linguistic features, and interactive feedback through architectures like BERT fusion and hierarchical neural ranking.
Empirical evaluations show notable improvements in F1, accuracy, and QWK metrics across English, Filipino, and Chinese, proving their effectiveness in diverse and low-resource settings.

A bidirectional readability assessment mechanism refers to architectures and workflows capable of leveraging contextual information in both directions within texts to assess their reading difficulty, often by jointly modeling sentence-level and document-level signals and enabling the interactive propagation of feedback. Recent research highlights approaches that fuse bidirectional deep LLMs, handcrafted linguistic features, hierarchical neural ranking, and iterative human–machine feedback loops—substantially outperforming unidirectional or surface-formula methods in English, Filipino, and Chinese, and enabling effective application in low-resource settings.

1. Core Principles of Bidirectionality in Readability Assessment

Bidirectional mechanisms employ models where information flows in both forward and backward directions: at the token level through bidirectional encoders, at the hierarchical level by aggregating and redistributing sentence/document scores, and at the workflow level via interactive feedback incorporation. Architectures such as BERT and Bi-LSTM/transformer stacks realize this by encoding tokens or sentence units with simultaneous left-and-right context, which captures syntactic, semantic, and discourse dependencies (e.g., subordinate clauses, referential cohesion, center-embedded structures).

A corollary is that bidirectional context models internalize a broader spectrum of linguistic complexity compared to classical readability formulas or unidirectional models, allowing accurate discrimination of intermediate/advanced levels and supporting robust label propagation within long documents or evolving test suites (Imperial, 2021, Zheng et al., 26 Nov 2025, Delgado-Pérez et al., 13 Jan 2024).

2. Architectural Strategies

Three primary architectures define state-of-the-art bidirectional readability assessment:

BERT fusion pipelines: A pre-trained, frozen BERT encoder produces sentence/document embeddings via mean pooling of its final hidden layer (excluding [CLS], [SEP] tokens). These embeddings ( $e\in\mathbb{R}^{768}$ ) are concatenated with traditional linguistic feature vectors ( $f\in\mathbb{R}^{F}$ ; $F=54$ –$155$ depending on language and corpus), yielding joint representations $x=[e;f]\in\mathbb{R}^{768+F}$ for downstream classification (Imperial, 2021).
Hierarchical bidirectional neural ranking: Long documents $S=\{s_1,\ldots,s_n\}$ pass through word-embedding layers followed by Bi-LSTM stacks, generating contextual representations $h_{i,j}^t=[\overrightarrow{h}_{i,j};\overleftarrow{h}_{i,j}]\in\mathbb{R}^{2d}$ . Attention mechanisms aggregate token-level signals to sentence vectors, which traverse transformer blocks before aggregating via self-attention into document-level vectors. Document readability is predicted, propagated back to induce sentence labels, and these sentence labels are looped forward to further refine document predictions (Zheng et al., 26 Nov 2025).
Interactive evolutionary workflows: In software test generation, genetic algorithms (e.g., DynaMOSA in InterEvo-TR) periodically solicit human feedback on the readability of candidate minimized test cases. Scores are stored and used to bias the search toward readable patterns, implementing a feedback loop: GA generates tests → human rates → model seeds preference → readable suite construction (Delgado-Pérez et al., 13 Jan 2024).

3. Feature Engineering and Fusion

Handcrafted linguistic features enrich model representations beyond what deep encoders capture, particularly in low-resource languages. For English, feature sets (F≈155) include lexical diversity, syntactic tree metrics, morphosyntactic counts, and psycholinguistic norms. Filipino corpora utilize surface, orthographic, syllable, morphological, and basic POS-derived features (F≈54). Feature extraction follows domain protocols (Vajjala & Meurers for English, Imperial & Ong for Filipino) (Imperial, 2021).

Fusion occurs via direct concatenation: joint vectors $x=[e;f]$ enable classifiers (logistic regression, SVM, random forest) to leverage both all-purpose semantic/syntactic cues (from BERT/Bi-LSTM) and engineered predictors of text complexity. Ablation studies confirm that bidirectional embeddings alone can sometimes substitute for explicit linguistic features, with the largest F1 gains observed in low-resource scenarios (Imperial, 2021).

4. Hierarchical Aggregation and Label Propagation

Hierarchical ranking neural mechanisms apply bidirectional context modeling at word, sentence, and document levels. After encoding sentences, aggregated vectors $U^s=[u_1^s,\ldots,u_n^s]$ are compressed into document vectors $d$ with source2token attention. Document readability is classified ( $r=\text{softmax}(W^d d+b^d)$ ), but a reverse leg propagates document scores back to derive unsupervised sentence-level labels, using multi-head difficulty embeddings and attention-weighted reconstructions. KL divergence enforces consistency between supervised and unsupervised flows, with the combined loss ensuring bidirectional synergy (Zheng et al., 26 Nov 2025).

Pairwise (ranking) heads model the strict ordinal structure of readability grades by learning document-level label differences across all pairs, which informs voting schemes for final label assignment. This approach yields notable performance increases in both English and Chinese datasets across accuracy, weighted F1, and QWK metrics (Zheng et al., 26 Nov 2025).

5. Interactive and Bidirectional Human–Machine Feedback Loops

In interactive evolutionary test generation (InterEvo-TR), the bidirectional mechanism is realized by direct tester involvement. The algorithm schedules interaction moments according to global coverage progress and fixed generation intervals. At each pause, multiple diverse targets are selected; minimized test cases (unique and unseen) are presented as code for human scoring. Scores are stored in the Readability Archive and top-rated candidates populate the Preference Archive. Subsequent genetic algorithm reproduction can draw from these archives with a tunable probability, causing the search to converge on test suites that balance coverage and human-perceived readability (Delgado-Pérez et al., 13 Jan 2024).

Empirical studies show significant gains in perceived readability, with 90% of testers rating interactive suites higher than automated benchmarks. The feedback loop is fully bidirectional: not only is model output evaluated, but human preference guides future generations, leading to superior final outcomes. This paradigm addresses both skepticism of automated test comprehensibility and offers granular customization for tester engagement.

6. Datasets, Evaluation Protocols, and Empirical Performance

Bidirectional readability mechanisms have been evaluated on a spectrum of corpora:

English: OneStopEnglish (OSE; 3 levels; 567 texts), Common Core Exemplars (CCE; 3 bins; 168 texts) (Imperial, 2021), Cambridge (A2–C2; 5 levels) (Zheng et al., 26 Nov 2025)
Filipino: Adarna House (265 books; grades 1–3) (Imperial, 2021)
Chinese: CMER (12 levels), CLT (9 levels), CTRDG (6 levels) (Zheng et al., 26 Nov 2025)

Texts are normalized, tokenized, and featurized as dictated by language/corpus. Evaluation employs 5-fold cross-validation, with weighted F1 as the principal metric to manage class imbalance. Document/sentence neural approaches also utilize accuracy, adjacent accuracy, precision, recall, and QWK.

The BERT-fusion pipeline yields a 2.63–6.23% F1 improvement over linguistic features alone in English, and a 12.4% F1 gain for Filipino (feature-only baseline: ~0.389, fused: 0.554). Neural hierarchical ranking models achieve up to 89.5%/94.1% accuracy/qwk on English, and improvements from 26.5% to 48.9% in Chinese accuracy (qwk 85.1%) (Imperial, 2021, Zheng et al., 26 Nov 2025). Interactive evolutionary suites are preferred by raters in 74% of blinded comparisons, confirming the bidirectional effect (Delgado-Pérez et al., 13 Jan 2024).

7. Significance and Implications

Bidirectional readability assessment mechanisms occupy a central position in modern computational linguistics, software engineering, and applied education research. By enabling models to both encode nuanced two-way context and iteratively integrate human feedback, these systems demonstrably outperform static, unidirectional, or surface metrics. For low-resource languages, bidirectional embeddings can substantially reduce dependence on explicit linguistic annotations or advanced parsers, maintaining high performance via internalized syntactic and semantic patterns.

This suggests that future developments may further intertwine hierarchical neural modeling and interactive human-in-the-loop protocols, ensuring robust, context-aware, and user-acceptable readability assessment across diverse domains and language resources.