Papers
Topics
Authors
Recent
2000 character limit reached

PUCP-Metrix: Spanish Text Analysis Toolkit

Updated 28 November 2025
  • PUCP-Metrix is an open-source repository providing 182 linguistic metrics for Spanish text analysis, covering lexical, syntactic, semantic, cohesion, psycholinguistic, and readability dimensions.
  • Its Python-based, modular architecture seamlessly integrates with spaCy and supports batch processing for scalable research and production applications.
  • The toolkit addresses limitations of previous systems by enhancing interpretability and extending metric coverage, benefiting educational assessment, authorship analysis, and readability detection.

PUCP-Metrix is an open-source repository providing 182 linguistic metrics designed for comprehensive, fine-grained, and interpretable text analysis in Spanish. Developed to address the limitations of previous toolkits such as Coh-Metrix-Esp and MultiAzterTest, PUCP-Metrix spans multiple linguistic dimensions, including lexical diversity, syntactic and semantic complexity, cohesion, psycholinguistic norms, and readability. Its extensible Python-based architecture, robust integration with spaCy, and modular analysis interface position it as a versatile resource for natural language processing research, educational assessment, authorship analysis, and other language technology applications (Luis et al., 21 Nov 2025).

1. Motivation, Objectives, and Scope

PUCP-Metrix was conceptualized in response to the growing dominance of end-to-end neural models within NLP, where interpretability and transparency in feature attribution remain major concerns. Handcrafted linguistic features are critical for applications demanding explainability, including educational evaluation, style profiling, AI-generated-text detection, and readability assessment. Existing Spanish NLP feature repositories are restricted in coverage and extensibility, omitting many modern psycholinguistic, cohesion, or complexity indices relevant for academic and industrial research. PUCP-Metrix aims to bridge these gaps by providing:

  • Broad metric coverage across lexical, syntactic, semantic, and psycholinguistic domains
  • Extensible and modular design, allowing the integration of user-defined metrics and resources
  • Compatibility with research and production pipelines through high-performance batch and parallel processing

Its core design principles are coverage, interpretability, and utility in transparency-sensitive NLP tasks (Luis et al., 21 Nov 2025).

2. Metric Inventory and Linguistic Dimensions

PUCP-Metrix organizes its 182 metrics into 13 categories, each targeting key linguistic properties. Core categories and representative metrics include:

Category Example Metric (Formula) Feature Target
Lexical Diversity TTR=VN\mathrm{TTR} = \frac{V}{N}, I=104ifi2NN2I = 10^4 \frac{\sum_i f_i^2 - N}{N^2} Vocabulary richness
Syntactic Complexity SYNCLS2=#sentences with 2 clausestotal sentences\mathrm{SYNCLS2} = \frac{\# \text{sentences with 2 clauses}}{\text{total sentences}} Clause density
Referential Cohesion CRFNO1=Nouns(si)Nouns(si+1)Nouns(si)\mathrm{CRFNO1} = \frac{|\mathrm{Nouns}(s_i)\cap \mathrm{Nouns}(s_{i+1})|}{|\mathrm{Nouns}(s_i)|} Noun overlap
Semantic Cohesion SECLOSadj=1S1i=1S1cos(v(si),v(si+1))\mathrm{SECLOSadj} = \frac{1}{S-1}\sum_{i=1}^{S-1}\cos(v(s_i),v(s_{i+1})) Semantic flow
Psycholinguistics Concr=1Vi=1Vci\mathrm{Concr} = \frac{1}{V}\sum_{i=1}^{V} c_i Concreteness
Readability FS=206.83562.3(syllableswords)(wordssentences)\mathrm{FS} = 206.835 - 62.3 \left( \frac{\text{syllables}}{\text{words}} \right) - \left(\frac{\text{words}}{\text{sentences}}\right) Readability

Other metric sets include part-of-speech specific indices (noun/verb/adverb/adjective TTRs), content-word density, syntactic pattern density, connectives, word frequency, and indices quantifying the simplicity or information content of words. Psycholinguistic features draw from EsPal and Stadthagen-González et al. (2017). Readability metrics implement both classic and Spanish-adapted formulae, such as Flesch-Szigriszt and Szigriszt-Pazos Perspicuity Index (Luis et al., 21 Nov 2025).

3. System Architecture and Extensibility

The repository is implemented as a Python package (iapucp-metrix), centered around an Analyzer class. Its architecture features:

  • Integration with spaCy for tokenization, POS tagging, and dependency parsing
  • Multiprocessing for scalable batch metric computation
  • Methods compute_metrics(texts, workers, batch_size) and compute_grouped_metrics(texts, groups) for comprehensive or category-specific feature extraction
  • Modular extensibility: new metrics are registered by subclassing Analyzer and defining decorated methods
  • Support for user-supplied lexicons or embeddings, accommodating adaptation to new domains or research needs

Dependencies include spaCy, NumPy, scikit-learn, and optionally transformers for certain neural baselines. All operations are batch-optimized for research and production-scale data (Luis et al., 21 Nov 2025).

4. Empirical Evaluation and Benchmarking

PUCP-Metrix undergoes quantitative evaluation on two major tasks: Automated Readability Assessment (ARA) and Machine-Generated Text Detection.

Automated Readability Assessment (ARA):

  • Datasets: CAES, Coh-Metrix-Esp, Kwiziq, HablaCultura (\sim32K texts, multi-label readability annotation)
  • Tasks: binary (“simple” vs. “complex”) and ternary (“basic”, “intermediate”, “advanced”) readability classification
  • Models: Logistic Regression, XGBoost, SVM, Random Forest using PUCP-Metrix metrics, compared against MultiAzter and a fine-tuned RoBERTa-BNE baseline
  • Results: XGBoost with PUCP-Metrix achieves F1 = 97.46 (2-label) and 96.72 (3-label), outperforming MultiAzter features and nearly matching RoBERTa-BNE (98.30, 98.13) (Luis et al., 21 Nov 2025)

Machine-Generated Text Detection:

  • Dataset: AuTexTification 2023 (52K documents, diverse domains)
  • Models and pipeline as above
  • Results: XGBoost with PUCP-Metrix F1 = 71.06, compared to 61.63 (MultiAzter), 66.61 (RoBERTa-BNE), and 70.77 (shared-task top model). Metrics excel in capturing cues linked to frequency, readability, and cohesion (Luis et al., 21 Nov 2025).

This suggests that handcrafted metrics remain empirically competitive and valuable for tasks requiring feature transparency or interpretability.

5. Representative Use Cases

Applications of PUCP-Metrix include:

  • Educational Assessment: Automated essay grading, CEFR-level classification, fine-grained readability prediction, and learner feedback.
  • Style and Authorship Analysis: Computational forensics, detection of AI-generated or plagiarized content, and domain register profiling.
  • Readability and Simplification: UX evaluation, publishing workflows, and development of simplification tools.
  • Linguistic Research: Analyses of cross-genre complexity, psycholinguistic property distributions, and cohesion/discourse structure studies (Luis et al., 21 Nov 2025).

A plausible implication is that the resource supports not only legacy feature-driven pipelines but can be integrated into hybrid neural-symbolic frameworks.

6. Deployment, Limitations, and Future Directions

Deployment is facilitated via a Python package install and spaCy LLM download:

1
2
pip install iapucp-metrix
python -m spacy download es_core_news_sm

Sample usage for metric extraction and grouping:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from iapucp_metrix.analyzer import Analyzer

analyzer = Analyzer()
texts = [
  "Este es un texto de ejemplo.",
  "La lingüística computacional estudia cómo los humanos procesan el lenguaje."
]
metrics_list = analyzer.compute_metrics(texts, workers=4, batch_size=2)

for text, m in zip(texts, metrics_list):
    print(f"Texto: {text}")
    print(f" – TTR: {m['LDTTRa']:.3f}")
    print(f" – Fernández-Huertas FS: {m['RDFHGL']:.2f}")
    print(f" – Concreteness (avg): {m['PSYC']:.2f}")
    print()

Identified limitations include:

  • Current tuning to European and Latin-American Spanish; adaptation to other varieties may require updated norms
  • Dependence on spaCy parsing quality; upstream errors affect metric accuracy

Future work is anticipated to address pragmatic and discourse-level metrics, integration with multilingual resources, and direct hybridization with pre-trained LLMs through metric-based embedding or finetuning (Luis et al., 21 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to PUCP-Metrix.