Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchical Sentence Factorization

Updated 23 February 2026
  • Hierarchical sentence factorization is a method that decomposes complex sentences into nested semantic units reflecting their compositional syntax and semantics.
  • It employs diverse frameworks—including AMR-based parsing, rule-driven splits, and neural multi-head attention—to capture both coarse and fine-grained sentence structures.
  • This approach boosts NLP performance on tasks like semantic similarity, sentence simplification, and information extraction through explicit and interpretable representations.

Hierarchical sentence factorization refers to a class of techniques for decomposing complex sentences into multi-level structural representations that reflect the ordered, nested, and compositional properties of natural language semantics and syntax. These methods enable downstream algorithms—including both classical and deep learning architectures—to operate at multiple levels of abstraction (from predicate-argument structures to minimal atomic propositions) and to leverage explicit interrelations among constituents. Applications span sentence matching, semantic similarity, structural simplification, information extraction, discourse analysis, and compositional learning. Hierarchical factorization methods differ in their theoretical underpinnings, ranging from semantics-driven predicate-argument normalization using AMR (Abstract Meaning Representation), to syntactic/semantic splitting guided by hand-crafted rhetorical rules, to end-to-end neural architectures with learned hierarchical attention. Research in this area demonstrates measurable gains in both interpretability and empirical performance across a range of benchmarks (Liu et al., 2018, Niklaus et al., 2019, Pislar et al., 2020).

1. Formal Definitions and Representational Paradigms

Hierarchical sentence factorization can take several forms, depending on linguistic and computational motivations:

  • Predicate-argument hierarchical trees: Sentences are factored into a rooted, ordered tree TSfT^f_S of maximum depth DD, where each node NN at depth dd encodes a nonempty semantic unit U(N)U(N)—typically a reordered subsequence of the original tokens. The root node U(N0)U(N_0) forms a normalized predicate-argument structure over all content tokens, recursively decomposed at lower levels while maintaining coverage of the input sentence through all root-to-leaf paths. The factorization is characterized formally by a tuple (V,E,idx,U)(V, E, \text{idx}, U), describing the set of nodes, child links, canonical indices, and semantic unit strings, respectively (Liu et al., 2018).
  • Semantic hierarchy of simplified sentences: Complex English sentences are transformed into a shallow discourse tree D(S)=(V,E)D(S) = (V, E), where VV is a set of node-sentences (each a minimal, stand-alone proposition) and EE encodes directed, labeled links specifying rhetorical relations (e.g., Attribution, Contrast, Temporal). Each node is assigned a "core" or "context" label, identifying its information status (primary or supplementary). The resultant structure is a two-layer semantic hierarchy preserving both content and the semantic linkage among essential and ancillary information (Niklaus et al., 2019).
  • Neural attention-based compositionality: Hierarchical factorization is induced dynamically within multi-head attention networks such as MHAL, where each head softly isolates different compositional aspects (e.g., phrase, label, or role) in a sentence. The architecture jointly encodes sentence and token-level representations, encouraging consistency between fine-grained (token) and coarse-grained (sentence) decisions—thus, factorizing sentence meaning into partial summaries at multiple levels (Pislar et al., 2020).

2. Algorithmic Frameworks for Construction

Approaches for constructing hierarchical sentence representations diverge according to their linguistic emphasis and degree of supervision:

  • AMR-based predicate-argument trees: Factorization begins with parsing and aligning the input sentence SS to its AMR graph, which is then converted into a purified tree TpT^p via token alignment and node pruning. A deterministic index transformation maps TpT^p into the desired factorization tree TSfT^f_S, followed by completion steps to ensure uniform depth and branching. The semantic unit U(N)U(N) at each non-leaf node is constructed via depth-first traversal and concatenation of its leaf tokens in predicate-argument order (Liu et al., 2018).
  • Rule-based semantic splitting: The approach utilizes a pipeline of 35 hand-crafted transformation rules, matched in sequence against the phrase-structure parse of an input sentence. On each recursive pass, applicable rules extract sub-spans based on Tregex patterns, generate minimal sentences via rephrasing templates, and assign rhetorical relations and core/context status. The recursion halts when no rules fire, yielding a shallow tree of atomic, interrelated sentences embedded within a minimal core/context discourse structure (Niklaus et al., 2019).
  • Neural multi-head attention models: MHAL and similar architectures instantiate factorization algorithmically by projecting token-level BiLSTM encodings into multiple head-specific keys, queries, and values. Raw evidence scores for each head at each position are computed (e.g., aih=qhâ‹…kiha_i^h = q^h \cdot k_i^h), yielding both token-level (via softmax over heads) and sentence-level (via head-specific attention pooling) outputs. Joint losses align and reinforce both levels, enforcing soft hierarchical decomposition during end-to-end neural training (Pislar et al., 2020).

3. Order- and Structure-Aware Metrics for Semantic Matching

Hierarchical representations enable sophisticated, structure-preserving similarity and matching operations:

  • Ordered Word Mover’s Distance (OWMD): OWMD is a penalized optimal transport metric operating between root-level predicate-argument representations of sentence pairs. It minimizes a transport cost ⟨T,D⟩\langle T, D \rangle subject to constraints on mass movement, penalizes misalignment from diagonal (i.e., order) via an Inverse Difference Moment I(T)I(T) and KL-divergence to a positional prior PP. Efficient Sinkhorn iterations solve for TT, yielding a similarity measure that is sensitive both to lexical semantics and to the argument-predicate order (Liu et al., 2018). OWMD achieves higher Pearson and Spearman correlation coefficients than baseline methods (BoW, word2vec, vanilla WMD) on sentence similarity benchmarks.
  • Semantic hierarchy alignment: In Discourse Tree–based models, evaluation of the alignment between simplified core/context nodes and rhetorical relations provides a basis for downstream information extraction and semantic role labeling. Metrics such as SARI, SAMSA, and human structural simplicity ratings are sensitive to accurate segmentation and preservation of informative interrelations (Niklaus et al., 2019).

4. Integration with Deep Learning Architectures

Hierarchical factorizations serve as inputs or inductive biases for both unsupervised and supervised models:

  • Multi-scale Siamese CNN/LSTM: At each scale (depth dd), a fixed number of semantic units Ud,iU_{d,i} is embedded per sentence, yielding matrices XS,dX_{S,d}. Identical encoder networks process each granularity in parallel for a sentence pair, matched vectors [∣H1,d−H2,d∣;H1,d⊙H2,d;H1,d+H2,d][|H_{1,d} - H_{2,d}|; H_{1,d} \odot H_{2,d}; H_{1,d} + H_{2,d}] are formed, and outputs are stacked before final classification or regression. Only the factorized, reordered representations are used as model input—standard Siamese models are augmented solely at the representation level (Liu et al., 2018).
  • Multi-head joint classification (MHAL): Token and sentence supervision are blended via shared attention heads, with per-head token tag distributions and head-specific global sentence vectors. Auxiliary losses—including attention consistency and query diversity—further enforce orthogonality and consistency among compositional components. This modeling allows efficient transfer between label granularities, robust semi-supervised learning, and—in the absence of token-level supervision—zero-shot sequence labeling performance exceeding random baselines (Pislar et al., 2020).

5. Evaluation Methodologies and Empirical Outcomes

Empirical validation employs a range of sentence matching, simplification, and tagging benchmarks:

  • Semantic textual similarity and paraphrase identification: Benchmarks include STSbenchmark, SICK, MSRvid, and MSRP. On unsupervised similarity (Pearson/Spearman), OWMD outperforms Word2Vec/BoW/WMD ([STSbenchmark]: OWMD rr=0.61, BoW rr=0.57, WMD rr=0.42). Multi-scale Siamese models incorporating hierarchical input representations achieve +6–10 points relative Acc/F1 or Pearson/Spearman gains over their baseline counterparts (MaLSTM, HCTI) (Liu et al., 2018).
  • Structural simplification: The DISSIM system achieves state-of-the-art performance on Wikilarge, Newsela, and WikiSplit, with best-in-class SARI and SAMSA scores. Human judgments also confirm increased structural simplicity without loss of grammaticality or meaning (Niklaus et al., 2019). When applied as a preprocessing step to Open Information Extraction, DISSIM leads to substantial precision and recall improvements—e.g., Stanford OIE achieves +346% precision and +52% recall on processed input.
  • Joint sequence and sentence tagging: The MHAL model, equipped with hierarchical factorization via multiple attention heads, systematically outperforms BiLSTM+attention and BiLSTM-CRF baselines for sentence and token-level classification. Semi-supervised and zero-shot experiments further demonstrate strong two-way transfer between compositional levels, with the model capable of non-trivial word-level predictions in the absence of direct token supervision (Pislar et al., 2020).

6. Interpretability, Scalability, and Applications

Hierarchical sentence factorization frameworks provide explicit, decomposable representations:

  • Outputs encode atomic propositions and logical or rhetorical interrelationships, supporting clear auditability and post-hoc analysis.
  • Rule-based discourse-tree models are efficient (approximately tens of milliseconds per sentence in practice), fully transparent, and maintain domain portability, requiring only parser adaptation for new domains (Niklaus et al., 2019).
  • Hierarchical neural models admit introspection via attention maps and studied head specialization, yielding insights into the dynamics of compositional representation learning (Pislar et al., 2020).
  • Applications span text matching, paraphrase detection, machine translation pre-processing, semantic role labeling, information extraction, summarization, and complex-to-simple sentence generation.

7. Research Trajectories and Open Directions

Open research directions include generalizing factorization to deeper or more flexible hierarchies, inducing unsupervised factor structures within neural attention, extending cross-modal factorization (e.g., for image-caption pairs), and investigating task-dependent hierarchical granularities. Comparative studies of linguistically motivated versus purely data-driven factorization remain a central question, as do developments in richly interpretable, scalable, and domain-adaptive hierarchical encodings (Liu et al., 2018, Niklaus et al., 2019, Pislar et al., 2020).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Sentence Factorization.