MuTSE: A Human-in-the-Loop Multi-use Text Simplification Evaluator

Published 10 Apr 2026 in cs.CL and cs.AI | (2604.08947v1)

Abstract: As LLMs become increasingly prevalent in text simplification, systematically evaluating their outputs across diverse prompting strategies and architectures remains a critical methodological challenge in both NLP research and Intelligent Tutoring Systems (ITS). Developing robust prompts is often hindered by the absence of structured, visual frameworks for comparative text analysis. While researchers typically rely on static computational scripts, educators are constrained to standard conversational interfaces -- neither paradigm supports systematic multi-dimensional evaluation of prompt-model permutations. To address these limitations, we introduce \textbf{MuTSE}\footnote{The project code and the demo have been made available for peer review at the following anonymized URL. https://osf.io/njs43/overview?view_only=4b4655789f484110a942ebb7788cdf2a, an interactive human-in-the-loop web application designed to streamline the evaluation of LLM-generated text simplifications across arbitrary CEFR proficiency targets. The system supports concurrent execution of $P \times M$ prompt-model permutations, generating a comprehensive comparison matrix in real-time. By integrating a novel tiered semantic alignment engine augmented with a linearity bias heuristic ($λ$), MuTSE visually maps source sentences to their simplified counterparts, reducing the cognitive load associated with qualitative analysis and enabling reproducible, structured annotation for downstream NLP dataset construction.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents a novel human-in-the-loop platform that enables real-time, parallel evaluation of text simplification models using asynchronous architecture and hierarchical semantic alignment.
The methodology employs a tiered alignment engine combining transformer embeddings, TF-IDF, and positional bias to robustly map source and simplified sentences, enhancing evaluation precision.
The system offers comprehensive educational diagnostics including readability metrics and customizable annotations, facilitating reproducible research through exportable session data.

MuTSE: A Human-in-the-Loop Multi-Use Evaluator for Systematic Text Simplification Assessment

Introduction and Motivation

Text simplification, as utilized in educational and language-learning platforms, demands robust, multi-faceted evaluation frameworks to ensure outputs generated by LLMs align with varying learner proficiencies and pedagogical needs. Despite the significant progress in LLM-based simplification, the comparative evaluation of multiple models and prompt strategies remains fragmented and operationally inefficient. Existing methodologies, such as EASSE and TS-ANNO, focus on static metric benchmarking or post-hoc manual annotation, lacking true real-time, visual, and human-in-the-loop functionality for concurrent multi-model, multi-prompt analysis.

MuTSE addresses these deficiencies through an interactive platform facilitating real-time, parallelized comparative evaluations of $P \times M$ prompt-model pairs, with dense integration of semantic alignment visualizations and educational diagnostics. This approach bridges the gap between purely computational benchmarking and nuanced human-centered qualitative evaluation frameworks.

System Architecture: Asynchronous Concurrency for Throughput

The MuTSE platform embodies a decoupled asynchronous architecture built atop FastAPI (backend) and Vue.js 3 (frontend), orchestrating parallel inferential tasks across locally and remotely hosted LLMs. Each text simplification request is split into independent, distributed processes for every prompt-model permutation, with stringent state isolation to prevent interference or race conditions during concurrent execution. Synchronization overhead is bounded by the slowest process in the batch, optimizing throughput for high-dimensional evaluation matrices. Upon task completion, outputs, alignments, and metrics are aggregated for downstream client visualization.

Tiered Semantic Alignment Engine

A central innovation is MuTSE's hierarchical semantic alignment engine for visually mapping source to simplified sentences. This engine employs a multi-tiered fallback mechanism:

Semantic Tier: Utilizes paraphrase-multilingual-MiniLM-L12-v2 for dense sentence-embedding and cosine similarity matrices.
Lexical Tier: Applies TF-IDF (including word and character n-grams) as a fallback when transformer embeddings are unreliable.
Positional Tier: Falls back to purely positional correspondence when both previous tiers yield indeterminate results.

A key challenge in alignment arises from the high risk of false positives using embedding-based similarity, especially when simplifications are not strictly monotonic. MuTSE introduces a tunable linearity bias ( $\lambda$ ), penalizing alignments that violate relative sentence ordering, and thus enforcing near-monotonic correspondences critical for high-precision educational evaluation.

Figure 2: The effect of linearly increasing the positional penalty ( $\lambda$ ), revealing how strict linearity resolves semantically spurious cross-alignments and enforces monotonic structure.

Integration of the linearity factor allows robust alignment even with low-dimensional embeddings, significantly reducing computational requirements and enabling efficient client-side recomputation. Alignment matrices with adjustable $\lambda$ provide analysts immediate, visual control over the trade-off between semantic closeness and structural monotonicity.

Educational Diagnostics, Readability, and Annotation

Beyond alignment, MuTSE incorporates a comprehensive real-time statistics module for linguistic and educational diagnostics. Key features include:

Readability Metrics: Automatic calculation of Flesch Reading Ease and Flesch-Kincaid Grade Level.
Lexical Statistics: Real-time word frequency, compression ratio, and sentence-length analysis, aiding immediate assessment of output accessibility.
Figure 1: The statistics module computes educational metrics (Flesch-Kincaid, Reading Ease) and diagnostics across multiple prompt-model outputs, informing prompt efficacy and model selection.

A configurable annotation suite enables practitioners to define arbitrary scoring criteria and scales—ranging from binary, categorical, Likert, to continuous metrics—complete with individually weighted impact on alignment scoring and performance aggregation. This alleviates the lack of standardization in human evaluation protocols identified in recent NLG assessment literature, advancing both reproducibility and evaluator autonomy.

Interactive Visualization and Comparative Workflow

The MuTSE frontend operationalizes multidimensional comparative analysis with a side-by-side visualization interface. Key workflow features:

Selective Column Filtering: Analysts dynamically toggle prompts and models, reducing cognitive overhead in high-dimensional comparisons.
Alignment Highlighting: Interactive traversal, where selecting a sentence in any column instantaneously highlights aligned sentences across all active columns, leveraging precomputed semantic correspondence.
Immediate Metric Feedback: Inline display of alignment scores, readability, and compression metrics per output column.

All evaluation sessions—encompassing the full $P \times M$ matrix, alignment graphs, and human annotations—are exportable in structured JSON and CSV formats, supporting downstream corpus building, EDA, and model fine-tuning. This export capability ensures that both qualitative and quantitative data generated during human-in-the-loop evaluations are readily accessible for reproducible NLP research.

Limitations and Perspectives

Key constraints of the current MuTSE deployment include the non-scalability of the local JSON persistence layer for multi-user or institutional environments and the reliance on Python/Node.js for end-user operation. Scaling to lab and classroom scenarios would require database-backed architecture and streamlined deployment pipelines.

The modularity of the MuTSE alignment engine positions it for extensions into cross-lingual and MT evaluation—already supported by multilingual embeddings and n-gram-based lexical fallback. Applying MuTSE’s methodology to machine translation and summarization, with appropriate recalibration of linearity constraints, would further generalize the platform for broader natural language generation assessment.

Conclusion

MuTSE represents a significant methodological advance in text simplification evaluation by uniting asynchronous multi-model inference, hierarchical semantic alignment, granular educational diagnostics, and customizable human annotation in a cohesive, accessible platform. This integration empowers both educators and NLP researchers to conduct reproducible, multidimensional comparative analyses at scale, while offering a foundation for systematic dataset expansion in natural language generation. Adaptability to emerging LLMs, model-agnostic metrics, and exportable session data position MuTSE as a cornerstone infrastructure for future educational and computational research in LLM-based text simplification and beyond.

Markdown Report Issue