GLEN-Bench: Graph-Language Benchmark for Nutritional AI
- GLEN-Bench is a graph-language-based benchmark that integrates clinical, nutritional, and socioeconomic data into a heterogeneous knowledge graph.
- The benchmark supports tasks like opioid misuse risk detection, personalized food recommendation, and nutritional question answering using GNN, LLM, and hybrid models.
- Experimental results demonstrate that hybrid GNN+LLM pipelines outperform baselines while socioeconomic constraints improve recommendation realism.
GLEN-Bench is a graph-language based benchmark that addresses the need for holistic, explainable, and constraint-aware computational frameworks in nutritional health assessment. It unifies NHANES clinical and demographic records, FNDDS food composition data, and USDA food-access metrics into a multi-relational heterogeneous knowledge graph, underpinning linked tasks in risk detection, personalized food recommendation, and nutritional question answering. The benchmark is designed to evaluate and compare models—especially graph neural networks, LLMs, and their hybrids—on tasks that require synthesizing complex dietary-health interactions and real-world socioeconomic constraints (Huang et al., 26 Jan 2026).
1. Knowledge Graph Structure and Data Integration
GLEN-Bench’s central data structure is a directed, typed knowledge graph where nodes are partitioned into user, food, ingredient, category, dietary-habit, health-condition, nutrition-tag, price-tag, poverty-condition, and opioid-level types. Relations include links such as user–consumes–food, user–has–condition, food–has–ingredient, poverty–maps–price, and user–opioid–level. Node features are constructed via concatenation or BERT embeddings. Integration follows:
- NHANES (2003–2020): Each participant yields a user node with demographic, laboratory, and socioeconomic features and edges to health, dietary-habit, poverty, and opioid-status nodes.
- FNDDS/WWEIA: Each food maps to a nutrient-profile node, edges to category/ingredient, and nutrition-tag (e.g., low_sodium) according to thresholds.
- USDA Purchase-to-Plate: Foods are tagged with discretized price-tiers and linked to poverty-constraints to model economic feasibility.
This unified graph enables multi-view, attribute-rich modeling of users, foods, and their contextual constraints, designed for compositional tasks beyond siloed dietary pattern mining or unconstrained recommendation.
2. Benchmark Task Suite and Evaluation
GLEN-Bench defines three explicitly connected tasks:
Task 1: Opioid Misuse Risk Detection
- Input: 1-hop subgraph around a user node.
- Output: Label in (normal, recovered, active).
- Objective: Multi-class cross-entropy.
- Evaluation: F1-macro, AUC (one-vs-rest), GMean, Accuracy.
Task 2: Personalized Food Recommendation
- Input: User context, candidate food nodes, heterogeneous subgraph.
- Output: Ranked top- list of foods meeting nutritional and feasibility constraints.
- Objective: Bayesian Personalized Ranking (BPR) with constraint penalties for violating user-condition and economic feasibility.
- Evaluation: Recall@20, NDCG@20, H-Score@20 (fraction satisfying at least one required tag), PA@20 (poverty awareness), AvgTags@20.
Task 3: Nutritional Question Answering
- Input: (User, Food) pair with serialized subgraph evidence.
- Output: Multi-label nutrition tags (support) and natural language explanation.
- Objective: Combined multi-label classification and sequence generation loss.
- Evaluation: Multi-label (Accuracy, F1, AUC); Generation (ROUGE-1/2/L, BLEU, BERTScore).
This task suite is designed to operationalize clinical relevance, economic realism, and interpretability in nutritional AI.
3. Model Architectures, Optimization, and Baselines
GLEN-Bench provides comprehensive coverage of modern graph and language modeling paradigms:
- Graph Neural Networks: GCN, GraphSAGE, GAT, RGCN, HGT, HAN architectures. Embedding dimensions: 256 (risk detection), 128 (recommendation). Multi-relation and meta-path aware variants (RGCN, HGT, HAN) preserve relational heterogeneity.
- LLMs: LLaMA-3.1-8B, Qwen3-8B, DeepSeek-R1-Distill-Qwen-7B, fine-tuned via structured prompting on serialized graphs.
- Hybrid (GNN+LLM) Pipelines: Inject learned graph embeddings or retrieved subgraph triples as structured evidence into LLMs (KAPING, ToG, G-Retriever, KAR).
Training Protocols: Adam optimizer; tuned learning rates, regularization, dropout. All models are trained for 500 epochs with checkpoint selection based on validation sets, using mixed precision on NVIDIA A100 hardware.
4. Experimental Findings and Performance Analysis
Risk Detection
Table: Macro-F1, AUC, GMean, and Accuracy (best results bolded):
| Method | F1_macro | AUC | GMean | Acc |
|---|---|---|---|---|
| MLP | 11.4% | 56.1 | 49.2 | 17.4 |
| GCN | 12.5% | 57.0 | 52.4 | 18.4 |
| GraphSAGE | 17.0% | 55.6 | 51.5 | 28.6 |
| GAT | 17.8% | 56.7 | 54.1 | 33.0 |
| RGCN | 19.2% | 58.9 | 54.3 | 33.1 |
| HGT | 21.6% | 57.4 | 50.8 | 39.8 |
| HAN | 25.9% | 70.6 | 65.1 | 47.4 |
| DeepSeek+Graph | 31.5% | 70.7 | 60.4 | 67.2 |
The hybrid DeepSeek+Graph outperforms all baselines. HAN and HGT demonstrate the benefit of relation-aware aggregation.
Recommendation
| Method | Recall@20 | NDCG@20 | H-Score@20 | PA@20 | AvgTags@20 |
|---|---|---|---|---|---|
| GAT | 10.17 | 8.58 | 32.2 | 16.4 | 6.90 |
| NGCF | 12.93 | 7.98 | 33.9 | 17.6 | 7.09 |
| RecipeRec | 12.72 | 9.17 | 39.6 | 26.6 | 6.78 |
| HFRS-DA | 12.78 | 9.20 | 34.2 | 27.8 | 7.30 |
| MOPI-HFRS | 13.25 | 9.97 | 38.2 | 26.8 | 7.21 |
Constraint-aware models (HFRS-DA, MOPI-HFRS) provide better H-Score and PA@20, representing improved nutritional suitability and economic realism.
Nutritional QA
| Method | F1_ML | AUC_ML | ROUGE-L | BLEU | BERTScore |
|---|---|---|---|---|---|
| Plain LLaMA3 | 0.486 | 0.734 | 0.600 | 0.363 | 0.944 |
| CoT-BAG | 0.529 | 0.770 | 0.657 | 0.421 | 0.952 |
| G-Retriever | 0.572 | 0.795 | 0.633 | 0.391 | 0.946 |
| KAR (GPT-4) | 0.492 | 0.737 | 0.669 | 0.455 | 0.946 |
Retrieval-augmented pipelines (G-Retriever, KAR) achieve the best multi-label and generation metrics, with explanations closely matching ground truth.
Dietary-Health Correlations
NHANES-derived analyses confirm that opioid users exhibit higher prevalence of poor dietary habits (e.g., excessive salt, frozen food) and comorbidities (sleep disorders, depression, obesity, hypertension). Detecting risk profiles thus requires models sensitive to these subtle, multi-view graph signals; hybrid GNN+LLM architectures excel in this regime.
5. Socioeconomic Constraints and Feasibility Modeling
Incorporating economic context is validated as critical for realistic, actionable dietary interventions:
- Enforcing PovertyCondition–PriceTag edges lifts PA@20 from 16.4% (GAT) to 27.8% (HFRS-DA).
- Modeling poverty/food-insecurity in the graph improves GMean and macro-F1 for the minority at-risk class in risk detection.
These structural design choices demonstrate the necessity of integrating affordability and access constraints for fair and feasible recommendation systems.
6. Design Limitations and Prospective Developments
Key open challenges and directions include:
- Confounding & Missingness: NHANES dietary recalls are susceptible to noise and reporting bias, necessitating robustness analyses (counterfactual perturbations, subgroup tests).
- Feasibility Granularity: Current price tags are coarse; future work should integrate geographic food availability, seasonality, preparation time, cultural and allergen preferences.
- Causal and Safety Evaluation: Evaluations to move beyond compliance metrics toward causal analysis, fidelity and calibration for fact-grounded and trustworthy explanations, and human-in-the-loop assessment.
- Task Expansion: Extensible to tasks such as ingredient substitution, meal plan optimization under multidimensional resource constraints, adherence prediction, and food-drug interaction modeling.
By constructing a reproducible, extensible platform uniting clinical, nutritional, and economic signals under a shared graph-language protocol, GLEN-Bench advances the development and assessment of graph-LLMs for personalized, practical, and interpretable nutritional health support (Huang et al., 26 Jan 2026).