CF-BiLSTM: Contextual Fusion in Bidirectional LSTMs

Updated 28 January 2026

CF-BiLSTM is a neural architecture that integrates bidirectional LSTM sequence encoding with early and late fusion of heterogeneous context sources.
It employs attention mechanisms and numeric features to merge local (sentence-level) and global (document or spatial) contexts, enhancing performance in tasks like citation-worthiness and scene labeling.
Empirical results demonstrate improved precision, recall, and F1 scores over baseline models, validating its applicability across diverse domains.

The Contextual-Fusion Bidirectional Long Short-Term Memory (CF-BiLSTM) architecture refers to a class of neural models that combine bidirectional LSTM sequence encoding with context aggregation and cross-modal or document-level fusion mechanisms. These architectures integrate rich contextual information from multiple sources—such as spatial regions in multimodal data or document structure in NLP tasks—by leveraging bidirectional processing and memory-based fusion, to enhance prediction accuracy in complex structured prediction settings. Notable instantiations are found in citation-worthiness detection for scientific writing (Zeng et al., 2024) and RGB-D scene labeling (&&&1&&&), each adapted to the specific distributional characteristics and context signals of their respective domains.

1. Architectural Overview and Variants

CF-BiLSTM architectures universally employ a multi-stage encoding and fusion pipeline. In the citation-worthiness task (Zeng et al., 2024), the model operates over scientific texts, encoding sentence- and section-level context with both character- and word-level embeddings. Target, previous, and next sentences, as well as section labels, are independently embedded and processed via a shared BiLSTM. Per-segment outputs are then aggregated with an attention mechanism to yield fixed-length representations. These are subsequently fused, along with numeric contextual features, through concatenation and fed to a multilayer perceptron (MLP) classifier.

In the RGB-D scene labeling context (Li et al., 2016), CF-BiLSTM is instantiated as the LSTM-CF. Here, photometric (RGB) and depth features are independently processed with convolutional backbones, vertically encoded with BiLSTM layers, and fused horizontally in spatial 2D using a BiLSTM fusion layer. The global fused context vector is ultimately concatenated with fine-grained convolutional features before pixel-wise labeling.

2. Core Sequence Encoding and Attention Pooling

All CF-BiLSTM instantiations utilize BiLSTM units for context-aware sequence encoding. The standard LSTM cell consists of the usual input, forget, and output gates and a memory cell, as described by: $\begin{aligned} i_t &= \sigma(W_i x_t + U_i h_{t-1} + b_i) \ f_t &= \sigma(W_f x_t + U_f h_{t-1} + b_f) \ o_t &= \sigma(W_o x_t + U_o h_{t-1} + b_o) \ \tilde{c}_t &= \tanh(W_c x_t + U_c h_{t-1} + b_c) \ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \ h_t &= o_t \odot \tanh(c_t) \end{aligned}$ Bidirectionality is implemented via dual LSTMs traversing the sequence in both directions, concatenating outputs at each position.

Pooling over BiLSTM hidden states is achieved through attention. For a sequence of hidden states $H = [h_1, ..., h_L]$ , the attention mechanism computes: $\begin{aligned} e_t &= \mathrm{score}(q, h_t) \ \alpha_t &= \frac{\exp(e_t)}{\sum_{k=1}^L \exp(e_k)} \ z &= \sum_{t=1}^L \alpha_t h_t \end{aligned}$ Score functions include dot-product, cosine, and scaled dot-product. Empirically, cosine scoring performs best for citation-worthiness (Zeng et al., 2024). In the RGB-D variant, no explicit token-level attention is used, but full spatial context is captured via bidirectional propagation.

3. Contextual Fusion Strategies

A defining feature of CF-BiLSTM is early and/or late fusion of heterogeneous context sources:

Textual Fusion (Zeng et al., 2024):
- The contextual representation $Z$ is formed by concatenating attended summaries of previous sentence ( $z_{n-1}$ ), target sentence ( $z_n$ ), next sentence ( $z_{n+1}$ ), and section label ( $z^{sec}$ ), resulting in $Z = [z_{n-1}; z_n; z_{n+1}; z^{sec}]$ .
- Numeric features, including sentence/character lengths, neighbor citation flags, and cosine similarities between BiLSTM representations of adjacent sentences, are appended to $Z$ before classification.
RGB-D Fusion (Li et al., 2016):
- Vertical context encodings for each modality $C_{RGB}$ and $C_{depth}$ at spatial position $(i, j)$ are concatenated as $x_{i,j} = [C_{RGB}(i,j); C_{depth}(i,j)]$ .
- Horizontal BiLSTM across each row fuses these modality-wise vectors into global 2D spatial contexts.

Fusion is data-driven: The recurrent fusion layer in LSTM-CF learns data-dependent merging, outperforming static concatenation.

4. Training Protocols and Data Regimes

Key training parameters for citation-worthiness (CF-BiLSTM) are summarized as follows (Zeng et al., 2024):

Loss: cross-entropy with $L_2$ regularization ( $\lambda=1 \times 10^{-7}$ ).
Optimizer: Adam, learning rate 0.001.
Batch size: 64; dropout rate: 0.5 (BiLSTM outputs, MLP hidden layers).
Early stopping based on validation $F_1$ .
Embeddings: GloVe (300-dim), trainable, concatenated with char-BiLSTM embeddings (30-dim).
Data: ACL-ARC (10k papers, 1.2M sentences), PMOA-CITE (PubMed OA, 2M+ papers, 1M sampled sentences), maintaining natural class imbalance ( $\approx$ 1:4).

For LSTM-CF (Li et al., 2016):

Loss: per-pixel softmax cross-entropy.
Optimization: SGD, momentum 0.9, weight decay $5 \times 10^{-4}$ .
Learning rate schedule discriminates pretrained (VGG) and new layers.
Batch size: 1 (GPU memory bound).
Data: SUNRGBD, NYUDv2 for indoor scene labeling, multi-class segmentation.

5. Empirical Results and Ablation Insights

Results for CF-BiLSTM in citation-worthiness prediction (Zeng et al., 2024):

Model	Dataset	Precision	Recall	$F_1$
CNN-w2v-update (Bonab et al. 2018)	ACL-ARC	—	—	0.426
Att-BiLSTM $_\text{(cos)}$ (no ctx)	ACL-ARC	0.720	0.391	0.507
Att-BiLSTM $_\text{(cos)}$ (no ctx)	PMOA-CITE	0.883	0.795	0.837
Contextual-Att-BiLSTM $_\text{(cos)}$	PMOA-CITE	0.907	0.811	0.856

Contextual fusion yields a notable absolute $F_1$ gain (+0.019) on large-scale PMOA-CITE.
Ablations confirm the necessity of both local (sentence-level) and global (section, document-structural) context, as well as numeric context features.
Transfer learning across datasets is ineffective unless both source and target distributions are jointly trained.

LSTM-CF (scene labeling) achieves 48.1% IoU on SUNRGBD (prior state-of-the-art: 45.9%) and 49.4% on NYUDv2 (vs. 43.8%). Ablating the fusion BiLSTM, modality encoders, or multi-scale context each degrades accuracy substantially (Li et al., 2016).

6. Interpretability, Visualization, and Error Analysis

Citation-worthiness predictions are interpretable via:

Elastic-net logistic regression (ENLR) and random forest (RF) baselines, which reveal section type tokens (e.g., "introduction," "background") and specific lexical cues as dominant features.
Attention visualizations that correlate model focus with these significant features; e.g., frequent occurrence of "previously," "studies," or "reported" prompts citation predictions.
Manual review of high-probability predictions reveals both source annotation errors (e.g., missing ref tags) and actual missing citations, relevant for automatic QA applications (Zeng et al., 2024).

A plausible implication is that CF-BiLSTM's attention and context mechanisms enable it to discover citation anomalies that elude human reviewers or template-based rule systems.

7. Impact, Applications, and Extensions

CF-BiLSTM architectures are foundational in domains where prediction requires integration of heterogeneous context and structured fusion of evidence:

Citation-worthiness modeling: Enables automated QA of scientific manuscripts, aiding pre-submission, and archival checks for citation omissions or errors (Zeng et al., 2024).
Scene labeling: In semantic segmentation, joint vertical and horizontal context fusion with cross-modality integration sets a new performance baseline on RGB-D benchmarks (Li et al., 2016).
Extensibility: The general contextual-fusion BiLSTM paradigm can be adapted to other domains where structured context signals (neighboring units, multiple data streams, explicit document structure) affect prediction. A plausible implication is applicability to tasks like discourse parsing or multimodal event detection.

The CF-BiLSTM family demonstrates that memory-based recurrent fusion of both local and global context systematically outperforms architectures limited to local context or post hoc feature concatenation, in both vision and language domains.

Markdown Report Issue Upgrade to Chat

References (2)

Modeling citation worthiness by using attention-based bidirectional long short-term memory networks and interpretable models (2024)

LSTM-CF: Unifying Context Modeling and Fusion with LSTMs for RGB-D Scene Labeling (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contextual-Fusion Bidirectional Long Short-Term Memory (CF-BiLSTM).