Pelican NLP: Reproducible Linguistic Processing
- Pelican NLP is a modular Python package that enforces reproducible, end-to-end linguistic data processing using LPDS standards.
- It features an automated CLI for configurable preprocessing, feature extraction, and metadata aggregation of both linguistic and acoustic data.
- Its unified workflow supports rigorous output provenance and promotes FAIR (Findable, Accessible, Interoperable, Reusable) practices in language research.
Pelican NLP refers to “pelican_nlp”, a modular Python package designed to enable reproducible, end-to-end linguistic data processing workflows anchored by formal standardisation principles. Developed as a companion tool to the Language Processing Data Structure (LPDS), it enforces best practices in methodological transparency, data organisation, and feature extraction for quantitative linguistic analysis. pelican_nlp integrates automatic data discovery, configurable preprocessing, and standardised feature extraction (covering both advanced linguistic and acoustic metrics) with rigorous output provenance, providing a paradigm for FAIR (Findable, Accessible, Interoperable, Reusable) language data analysis (Pauli et al., 19 Nov 2025).
1. Motivation and Conceptual Foundations
Modern NLP research is challenged by methodological fragmentation: heterogeneous folder hierarchies, inconsistent file formats, and undocumented preprocessing steps result in irreproducible outcomes and undermine the comparability of analyses. Empirical studies reveal that even when applied to identical datasets, different analysts often reach divergent results attributable to unstandardised pipelines. pelican_nlp addresses these issues by providing not merely a toolchain, but a workflow centering on two intertwined pillars:
- LPDS Standard: A data organisation protocol inspired by the Brain Imaging Data Structure (BIDS), specifying folder layouts, filename entity conventions, and metadata schemas for linguistic research assets (audio, transcripts, task metadata).
- Reproducible Processing Pipeline: A Python package empowering full-pipeline processing (ingestion, cleaning, feature extraction, aggregation) driven entirely by a declarative configuration file, ensuring all methodological choices and analytic steps are version-controlled and recoverable for downstream users (Pauli et al., 19 Nov 2025).
These design choices facilitate robust data management, multi-site collaboration, and full transparency—critical in clinical, behavioral, and large-scale linguistic projects.
2. System Architecture and Workflow
pelican_nlp’s logical pipeline is structured around modular, discoverable components and includes:
- Command-Line Interface (CLI): The main entry point,
pelican-run, auto-discovers the project’sconfig.ymland LPDS input directory, orchestrating the workflow without custom scripting. - Core Components:
- Loader: Scans the LPDS-compliant participant folders (e.g.,
participants/part-*/*_<suffix>.*), extracting entity tuples (part,ses,task, etc.) into a unified metadata DataFrame. - Preprocessing Pipeline: Configurable chain of modules inheriting from a Preprocessor base class. Supported routines include timestamp stripping, character and whitespace cleaning, lowercasing, speaker diarisation, and task-specific segmentation.
- Feature Extractors: Standardised interfaces to:
- Linguistic Features: Semantic embeddings (BERT, RoBERTa, Llama3 via Hugging Face; fastText via Gensim), cosine similarity metrics, and model logits.
- Acoustic Features: eGeMAPS and ComParE feature sets via openSMILE, prosodic metrics via Praat/Prosogram integration.
- Aggregators: Combine frame- or utterance-level features into participant- or group-level tables.
- Utilities: YAML configuration parser, LPDS entity parser, logging, and version stamping.
- Loader: Scans the LPDS-compliant participant folders (e.g.,
The typical data flow is: LPDS files ⟶ metadata index ⟶ sequential preprocessing ⟶ feature extraction ⟶ per-file derivatives ⟶ aggregate exports (Pauli et al., 19 Nov 2025).
3. Integration with the Language Processing Data Structure (LPDS)
Strict adherence to LPDS is required. Data must reside under a fixed hierarchy:
1 2 3 4 5 6 7 8 9 10 11 |
project_root/
participants/
part-01/
(ses-01/)/
interview/
part-01_ses-01_task-interview_recording.wav
part-01_ses-01_task-interview_transcript.txt
dataset_description.json
participants.tsv
README
CHANGES |
part-01, task-interview, ses-01), appended with functional suffixes (_recording, _transcript, _embeddings, _features). Parsing these keys, pelican_nlp generates a unified metadata schema, ensuring uniform handling of raw and derivative files, and enabling the exact lineage and provenance of extracted data to be tracked automatically (Pauli et al., 19 Nov 2025).
4. Configuration and Extensibility
The entire workflow is specified via a single YAML configuration file (config.yml). This file encodes all steps, parameters, and feature extraction settings, enabling fully version-controlled analyses without ad hoc scripting. A representative snippet is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
input_dir: participants output_dir: derivatives preprocessing: clean_text: remove_timestamps: true strip_special_chars: true lowercase: true remove_punctuation: true speaker_diarisation: pattern: 'Speaker [AB]' fluency: enabled: true features: embeddings: model: roberta-base aggregation: mean similarity: model: roberta-base window: sentence metric: cosine logits: model: roberta-base acoustic: openSMILE: config_file: 'egemaps.conf' prosogram: enabled: true aggregation: across: [participants] metrics: [embeddings, similarity, logits, acoustic] logging: level: INFO to_file: true |
Extensibility is achieved via a preprocessor and extractor base class architecture, allowing users to register new modules and plugins. Each module specifies its own schema for the YAML validator, guaranteeing parameter completeness and type safety. Plugins can be auto-discovered by placement in the pelican_nlp/plugins/ directory and are integrated via entry points in setup.py. This enables seamless incorporation of novel linguistic or acoustic feature extractors and custom preprocessing steps (Pauli et al., 19 Nov 2025).
5. Feature Extraction: Linguistic and Acoustic Metrics
pelican_nlp supports comprehensive feature pipelines:
- Linguistic Features:
- Semantic Embeddings: Extraction of token-level, word-level, or utterance-level vectors via models such as BERT, RoBERTa, and Llama3. Aggregation via mean or max pooling.
- Semantic Similarity: Pairwise or windowed cosine similarity:
- Model Logits: Storage of pre-softmax activation vectors, supporting uncertainty quantification and custom calibration.
Acoustic Features:
- openSMILE: Extraction of standard acoustic feature sets (eGeMAPS, ComParE), including jitter, shimmer, MFCCs, and formant measures.
- Prosogram (Praat): Calculation of prosodic events, mean pitch, pitch range, and tonal segmentation.
Feature outputs are written as LPDS-conformant files in the derivatives directory, each uniquely named by participant/task/model/metric, with full parameter and version provenance recorded in a JSON sidecar or CSV header (Pauli et al., 19 Nov 2025).
6. Output Structure and Reproducibility
All outputs are placed in project_root/derivatives/, segregated by feature type:
preprocessing/: Cleaned or diarised transcripts.embeddings/,similarity/,logits/,acoustic/: Per-file, per-feature tables.aggregations/: Group-level or across-participant metrics.logs/: Execution summaries and processing logs.
Each derivative file incorporates machine-readable metadata detailing the pelican_nlp version, configuration hash, and dependencies, supporting full backward compatibility. Results are reproducible to the byte level by specifying the pelican_nlp package version (e.g., pelican-nlp==0.3.4). LPDS folder and naming conventions ensure clear lineage from raw to processed outputs (Pauli et al., 19 Nov 2025).
7. Demonstrated Applications and Evaluation
pelican_nlp modules have been adopted in three major projects:
- Uncertainty modelling in multimodal speech analysis across the psychosis spectrum (Rohanian et al. 2025): Logistic regression combining linguistic logits and acoustic metrics predicts clinical status.
- Semantic fluency in the TRUSTING project (Hüppi et al.): Extraction of fastText embeddings and semantic similarity to compare fluency metrics across categories (“animals” vs. “fruits”).
- Speech-based relapse prediction in psychosis (Ciampelli et al. 2025): Integration of openSMILE and Prosogram-derived features with explainable AI via shapley values.
In all cases, pelican_nlp has reduced complex, multi-step analysis into a unified, fully reproducible execution, markedly streamlining collaborative research and data sharing (Pauli et al., 19 Nov 2025).
8. Installation and Technical Requirements
The minimal hardware specification is 16 GB RAM; a modern GPU with ≥16 GB VRAM is recommended for efficient transformer-based embedding extraction. The software stack includes Python ≥3.10, PyTorch, Transformers, Gensim, scikit-learn, openSMILE bindings, PyYAML, jsonschema, tqdm, and loguru. Installation follows a standard Conda and pip workflow:
1 2 3 4 5 |
conda create -n pelican-nlp python=3.10 conda activate pelican-nlp conda install pip pip install pelican-nlp==0.3.4 pelican-run --help |
Linux (Ubuntu) is recommended; macOS and Windows are supported, with GPU acceleration caveats on non-Linux systems (Pauli et al., 19 Nov 2025).
In summary, pelican_nlp operationalises rigorous data and methodological standardisation for linguistic analysis, unifying preprocessing, feature extraction, and output aggregation within a single, extensible, and reproducible pipeline. Its adoption of LPDS maximises data interoperability, and its programmatic, configuration-driven architecture directly addresses key reproducibility challenges in computational linguistics and behavioral science (Pauli et al., 19 Nov 2025).