FT+KB Classifier Overview
- FT+KB classifier is a hybrid technique combining fine-tuned LLM embeddings with knowledge base entity representations to boost classification accuracy.
- It uses truncated SVD for dimensionality reduction and an AutoML pipeline to optimize feature extraction and model selection.
- Applied to both document and functional data, it demonstrates robust performance improvements even in few-shot and noisy settings.
The FT+KB classifier refers to a class of machine learning techniques integrating fine-tuned representations from large neural models (FT: Fine-Tuned, typically LLMs) with explicit knowledge-bases (KB: Knowledge Base), or, alternatively, functional depth-based approaches leveraging both functional and kernel-based information. These classifiers combine heterogeneous feature sources to enhance classification efficacy under heterogeneous data regimes, including document, tabular, and functional modalities.
1. Architectures and Formalization
Two distinct but rigorously grounded FT+KB classifier families have been advanced.
Document Classification (LLM+KB Fusion):
This architecture leverages both fine-tuned LLM embeddings and knowledge-base-derived entity embeddings for document representation (Koloski et al., 2024). Each input document is simultaneously processed:
- LLM Encoder: %%%%1%%%%; e.g., (Angle), $4096$ (LLM2Vec-LLaMa3), $1536/3072$ (OpenAI).
- Entity Linking: An external entity linker (Babelfy) extracts entities , each mapped to a KB embedding (often using RotatE/Wikidata).
- Entity Representation: The document-level entity embedding is .
The two representations are concatenated into .
- Dimensionality Reduction: Truncated SVD is applied for low-dimensional projection: where contains the top principal components.
- Classification: is input to an AutoML pipeline (TPOT), automatically searching and tuning among a broad class of estimators (logistic regression, random forests, SVMs, etc.), feature selectors, and preprocessors.
Functional Data (Spatial Depth – WMD+KFSD):
For the classification of functional data (curves/functions), the FT+KB label designates the WMD+KFSD procedure, which employs kernelized functional spatial depth (Sguera et al., 2013):
- Functional Spatial Depth (FSD):
for random function in Hilbert space , with for .
- Kernelized Extension (KFSD):
Functions are embedded into a feature space via kernel (usually Gaussian), and depth is computed as
with the implicit feature mapping.
- Classification Rule: For labeled groups , assign to the group with the larger within-group depth .
2. Methodology and Implementation
LLM+KB Fusion Implementation:
- Embeddings: HuggingFace transformers, OpenAI API, GraphVite (KB lookup);
- Entity Linking: Babelfy API.
- Dimensionality Reduction: Scikit-learn SVD.
- Classification: TPOT, with hyperparameters (population=100, generations=100, 5-fold CV).
- Resource Footprint: Up to 1 hour runtime, 16 cores, 256 GB RAM per run.
Functional KFSD:
- Kernel Choice: Gaussian kernel with bandwidth selected as a percentile of pairwise curve distances; 7 candidate percentiles (15–85%).
- KFSD Computation: Matrix-based approach on Gram matrices for efficiency; see pseudocode for details (Sguera et al., 2013).
- Classifier Variants: Distance to Trimmed Mean (DTM), Weighted Average Distance (WAD), but WMD+KFSD consistently yields top performance.
3. Empirical Performance and Benchmarks
Document FT+KB:
- Datasets: Books, DVD, Music (binary sentiment), Hate speech, MLDoc (4-way), XGENRE (9-way).
- Baselines: Ridge-penalized classifiers on pure LLM embeddings.
- Results: Average accuracy gain +0.52% (Wilcoxon ); largest gains with Angle (+2.25 pp), mxbai (+1.50 pp), and LLaMa3 (+0.63 pp); OpenAI decreased marginally on hate speech.
- Compression: Low-dimensional SVD projections () generally suffice; on some datasets matches high-dimensional accuracy.
- Few-Shot: Maintains parity or superiority to text-only baselines with 1–50% training data.
| Dataset | Baseline (%) | FT+KB (%) |
|---|---|---|
| Books | 93.85 | 95.40 |
| DVD | 94.15 | 94.95 |
| Music | 91.65 | 94.25 |
| Hate | 79.06 | 81.62 |
| MLDoc | 95.42 | 95.90 |
| XGENRE | 53.67 | 59.19 |
Functional FT+KB (WMD+KFSD):
- Simulations: Across multiple curve-generating processes (with/without outliers, linear/sinusoidal), WMD+KFSD matches or exceeds the global-depth and -NN approaches, especially when data groups are subtle or contaminated.
- Real Data:
- Growth: WMD+KFSD error 3.45% (T1), 2.16% (T2 leave-one-out) vs -NN 3.86%, 3.23%.
- Phoneme: WMD+KFSD error 19.3% (T1), 18.5% (T2) vs -NN 22.1%, 22.5%.
- Robustness: Cross-validated bandwidth selection for kernel ensures adaptability to multimodal and noisy settings.
4. Theoretical and Algorithmic Properties
Statistical Properties:
- LLM+KB Fusion: Concatenation is a non-parametric, information‐augmenting operation; no end-to-end gradient flow between embedding and classifier. All classifier optimization is downstream of feature generation and SVD.
- WMD+KFSD: KFSD provides a local, kernel-sensitive depth that robustly distinguishes functional modalities, outperforming global depths especially under contamination or overlapping group structure.
Dimensionality Reduction:
- SVD Choice: SVD projects both LLM and KB representations jointly, exposing latent axes that maximize variance relevant for downstream AutoML.
- Bandwith Selection in KFSD: Bandwidth is discrete and selected via cross-validation, with computational overhead manageable given small sample sizes typical in FDA.
5. Practical Implementation and Tooling
Codebases and Libraries:
- LLM+KB: Source code (bablfusion) is provided; relies on HuggingFace, OpenAI API, Babelfy API, GraphVite, scikit-learn, and TPOT (Koloski et al., 2024).
- KFSD: Functional depth and classifier construction as described in (Sguera et al., 2013); efficient algorithms rely on Gram matrix precomputation and vectorized operations.
Resource Requirements:
- LLM+KB: Multi-core CPUs, large RAM for embedding and SVD. Execution time h per experiment.
- WMD+KFSD: Modest CPU loads; full simulation for curves with bandwidth search under 2s/sample.
6. Significance, Generalizations, and Outlook
The FT+KB paradigm demonstrates that explicit fusion of unstructured (LLM) and structured (KB or kernel-induced) information materially enhances classification outcomes across both textual and functional domains. Empirically, these approaches maintain performance—often with lower feature dimensionality and robustly in few-shot regimes—relative to strong baselines.
A plausible implication is that the explicit insertion of external knowledge or local kernel structure, followed by informed dimensionality reduction, yields generalizable gains for heterogeneous data. For document classification, this involves LLM grounding via KB entities; in functional data, local kernel depth metrics offer discriminativity when group differences are marginal or noise is present.
The methodology is extensible, with the potential for adaptation to multimodal or hierarchical KBs, as well as for integrating end-to-end differentiability as LLMs and graph neural KBs become more tightly coupled. The FT+KB class thus represents a principled and empirically validated approach for robust, high‐performance classification leveraging both dense and structured knowledge representations (Koloski et al., 2024, Sguera et al., 2013).