GLEN-Bench: Graph-Language Benchmark for Nutritional AI

Updated 2 February 2026

GLEN-Bench is a graph-language-based benchmark that integrates clinical, nutritional, and socioeconomic data into a heterogeneous knowledge graph.
The benchmark supports tasks like opioid misuse risk detection, personalized food recommendation, and nutritional question answering using GNN, LLM, and hybrid models.
Experimental results demonstrate that hybrid GNN+LLM pipelines outperform baselines while socioeconomic constraints improve recommendation realism.

GLEN-Bench is a graph-language based benchmark that addresses the need for holistic, explainable, and constraint-aware computational frameworks in nutritional health assessment. It unifies NHANES clinical and demographic records, FNDDS food composition data, and USDA food-access metrics into a multi-relational heterogeneous knowledge graph, underpinning linked tasks in risk detection, personalized food recommendation, and nutritional question answering. The benchmark is designed to evaluate and compare models—especially graph neural networks, LLMs, and their hybrids—on tasks that require synthesizing complex dietary-health interactions and real-world socioeconomic constraints (Huang et al., 26 Jan 2026).

1. Knowledge Graph Structure and Data Integration

GLEN-Bench’s central data structure is a directed, typed knowledge graph $\mathcal G = (\mathcal V, \mathcal E, \mathcal T, \mathcal R)$ where nodes $\mathcal V$ are partitioned into user, food, ingredient, category, dietary-habit, health-condition, nutrition-tag, price-tag, poverty-condition, and opioid-level types. Relations $\mathcal R$ include links such as user–consumes–food, user–has–condition, food–has–ingredient, poverty–maps–price, and user–opioid–level. Node features $\mathbf x_v$ are constructed via concatenation or BERT embeddings. Integration follows:

NHANES (2003–2020): Each participant yields a user node with demographic, laboratory, and socioeconomic features and edges to health, dietary-habit, poverty, and opioid-status nodes.
FNDDS/WWEIA: Each food maps to a nutrient-profile node, edges to category/ingredient, and nutrition-tag (e.g., low_sodium) according to thresholds.
USDA Purchase-to-Plate: Foods are tagged with discretized price-tiers and linked to poverty-constraints to model economic feasibility.

This unified graph enables multi-view, attribute-rich modeling of users, foods, and their contextual constraints, designed for compositional tasks beyond siloed dietary pattern mining or unconstrained recommendation.

2. Benchmark Task Suite and Evaluation

GLEN-Bench defines three explicitly connected tasks:

Task 1: Opioid Misuse Risk Detection

Input: 1-hop subgraph around a user node.
Output: Label in $\{0,1,2\}$ (normal, recovered, active).
Objective: Multi-class cross-entropy.
Evaluation: F1-macro, AUC (one-vs-rest), GMean, Accuracy.

Task 2: Personalized Food Recommendation

Input: User context, candidate food nodes, heterogeneous subgraph.
Output: Ranked top- $K$ list of foods meeting nutritional and feasibility constraints.
Objective: Bayesian Personalized Ranking (BPR) with constraint penalties for violating user-condition and economic feasibility.
Evaluation: Recall@20, NDCG@20, H-Score@20 (fraction satisfying at least one required tag), PA@20 (poverty awareness), AvgTags@20.

Task 3: Nutritional Question Answering

Input: (User, Food) pair with serialized subgraph evidence.
Output: Multi-label nutrition tags (support) and natural language explanation.
Objective: Combined multi-label classification and sequence generation loss.
Evaluation: Multi-label (Accuracy, F1, AUC); Generation (ROUGE-1/2/L, BLEU, BERTScore).

This task suite is designed to operationalize clinical relevance, economic realism, and interpretability in nutritional AI.

3. Model Architectures, Optimization, and Baselines

GLEN-Bench provides comprehensive coverage of modern graph and language modeling paradigms:

Graph Neural Networks: GCN, GraphSAGE, GAT, RGCN, HGT, HAN architectures. Embedding dimensions: 256 (risk detection), 128 (recommendation). Multi-relation and meta-path aware variants (RGCN, HGT, HAN) preserve relational heterogeneity.
LLMs: LLaMA-3.1-8B, Qwen3-8B, DeepSeek-R1-Distill-Qwen-7B, fine-tuned via structured prompting on serialized graphs.
Hybrid (GNN+LLM) Pipelines: Inject learned graph embeddings or retrieved subgraph triples as structured evidence into LLMs (KAPING, ToG, G-Retriever, KAR).

Training Protocols: Adam optimizer; tuned learning rates, regularization, dropout. All models are trained for 500 epochs with checkpoint selection based on validation sets, using mixed precision on NVIDIA A100 hardware.

4. Experimental Findings and Performance Analysis

Risk Detection

Table: Macro-F1, AUC, GMean, and Accuracy (best results bolded):

Method	F1_macro	AUC	GMean	Acc
MLP	11.4%	56.1	49.2	17.4
GCN	12.5%	57.0	52.4	18.4
GraphSAGE	17.0%	55.6	51.5	28.6
GAT	17.8%	56.7	54.1	33.0
RGCN	19.2%	58.9	54.3	33.1
HGT	21.6%	57.4	50.8	39.8
HAN	25.9%	70.6	65.1	47.4
DeepSeek+Graph	31.5%	70.7	60.4	67.2

The hybrid DeepSeek+Graph outperforms all baselines. HAN and HGT demonstrate the benefit of relation-aware aggregation.

Recommendation

Method	Recall@20	NDCG@20	H-Score@20	PA@20	AvgTags@20
GAT	10.17	8.58	32.2	16.4	6.90
NGCF	12.93	7.98	33.9	17.6	7.09
RecipeRec	12.72	9.17	39.6	26.6	6.78
HFRS-DA	12.78	9.20	34.2	27.8	7.30
MOPI-HFRS	13.25	9.97	38.2	26.8	7.21

Constraint-aware models (HFRS-DA, MOPI-HFRS) provide better H-Score and PA@20, representing improved nutritional suitability and economic realism.

Nutritional QA

Method	F1_ML	AUC_ML	ROUGE-L	BLEU	BERTScore
Plain LLaMA3	0.486	0.734	0.600	0.363	0.944
CoT-BAG	0.529	0.770	0.657	0.421	0.952
G-Retriever	0.572	0.795	0.633	0.391	0.946
KAR (GPT-4)	0.492	0.737	0.669	0.455	0.946

Retrieval-augmented pipelines (G-Retriever, KAR) achieve the best multi-label and generation metrics, with explanations closely matching ground truth.

Dietary-Health Correlations

NHANES-derived analyses confirm that opioid users exhibit higher prevalence of poor dietary habits (e.g., excessive salt, frozen food) and comorbidities (sleep disorders, depression, obesity, hypertension). Detecting risk profiles thus requires models sensitive to these subtle, multi-view graph signals; hybrid GNN+LLM architectures excel in this regime.

5. Socioeconomic Constraints and Feasibility Modeling

Incorporating economic context is validated as critical for realistic, actionable dietary interventions:

Enforcing PovertyCondition–PriceTag edges lifts PA@20 from 16.4% (GAT) to 27.8% (HFRS-DA).
Modeling poverty/food-insecurity in the graph improves GMean and macro-F1 for the minority at-risk class in risk detection.

These structural design choices demonstrate the necessity of integrating affordability and access constraints for fair and feasible recommendation systems.

6. Design Limitations and Prospective Developments

Key open challenges and directions include:

Confounding & Missingness: NHANES dietary recalls are susceptible to noise and reporting bias, necessitating robustness analyses (counterfactual perturbations, subgroup tests).
Feasibility Granularity: Current price tags are coarse; future work should integrate geographic food availability, seasonality, preparation time, cultural and allergen preferences.
Causal and Safety Evaluation: Evaluations to move beyond compliance metrics toward causal analysis, fidelity and calibration for fact-grounded and trustworthy explanations, and human-in-the-loop assessment.
Task Expansion: Extensible to tasks such as ingredient substitution, meal plan optimization under multidimensional resource constraints, adherence prediction, and food-drug interaction modeling.

By constructing a reproducible, extensible platform uniting clinical, nutritional, and economic signals under a shared graph-language protocol, GLEN-Bench advances the development and assessment of graph-LLMs for personalized, practical, and interpretable nutritional health support (Huang et al., 26 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

GLEN-Bench: A Graph-Language based Benchmark for Nutritional Health (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GLEN-Bench.

GLEN-Bench: Graph-Language Benchmark for Nutritional AI

1. Knowledge Graph Structure and Data Integration

2. Benchmark Task Suite and Evaluation

Task 1: Opioid Misuse Risk Detection

Task 2: Personalized Food Recommendation

Task 3: Nutritional Question Answering

3. Model Architectures, Optimization, and Baselines

4. Experimental Findings and Performance Analysis

Risk Detection

Recommendation

Nutritional QA

Dietary-Health Correlations

5. Socioeconomic Constraints and Feasibility Modeling

6. Design Limitations and Prospective Developments

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

GLEN-Bench: Graph-Language Benchmark for Nutritional AI

1. Knowledge Graph Structure and Data Integration

2. Benchmark Task Suite and Evaluation

Task 1: Opioid Misuse Risk Detection

Task 2: Personalized Food Recommendation

Task 3: Nutritional Question Answering

3. Model Architectures, Optimization, and Baselines

4. Experimental Findings and Performance Analysis

Risk Detection

Recommendation

Nutritional QA

Dietary-Health Correlations

5. Socioeconomic Constraints and Feasibility Modeling

6. Design Limitations and Prospective Developments

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research