NutriScreener: Retrieval-Augmented Multi-Pose Graph Attention Network for Malnourishment Screening (2511.16566v1)

Published 20 Nov 2025 in cs.CV and cs.AI

Abstract: Child malnutrition remains a global crisis, yet existing screening methods are laborious and poorly scalable, hindering early intervention. In this work, we present NutriScreener, a retrieval-augmented, multi-pose graph attention network that combines CLIP-based visual embeddings, class-boosted knowledge retrieval, and context awareness to enable robust malnutrition detection and anthropometric prediction from children's images, simultaneously addressing generalizability and class imbalance. In a clinical study, doctors rated it 4.3/5 for accuracy and 4.6/5 for efficiency, confirming its deployment readiness in low-resource settings. Trained and tested on 2,141 children from AnthroVision and additionally evaluated on diverse cross-continent populations, including ARAN and an in-house collected CampusPose dataset, it achieves 0.79 recall, 0.82 AUC, and significantly lower anthropometric RMSEs, demonstrating reliable measurement in unconstrained pediatric settings. Cross-dataset results show up to 25% recall gain and up to 3.5 cm RMSE reduction using demographically matched knowledge bases. NutriScreener offers a scalable and accurate solution for early malnutrition detection in low-resource environments.

Summary

The paper introduces a unified retrieval-augmented multi-pose graph attention network that merges CLIP-based embedding with FAISS-indexed knowledge to boost malnutrition screening accuracy.
It employs multi-view image processing and graph attention layers for robust anthropometric estimation and effective minority-class handling.
Empirical results show improved recall, AUC, and RMSE, highlighting strong cross-cohort generalizability and clinical deployment potential.

NutriScreener: Retrieval-Augmented Multi-Pose Graph Attention for Malnutrition Screening

Technical Overview and Motivation

NutriScreener introduces a unified retrieval-augmented, multi-pose graph attention network (GAT) for malnutrition screening and pediatric anthropometric estimation from RGB images, addressing persistent challenges in algorithmic generalization, minority-class sensitivity, and real-world deployment. Recognizing the limitations of single-view image input, low-resource setting constraints, and severe malnourishment class imbalance, the framework leverages GATs over CLIP-driven pose-wise feature extraction with context-aware fusion and nearest-neighbor retrieval augmentation. Each child subject is modeled as a graph over multiple anatomical views (frontal, lateral, selfie, back), with each node embedding extracted via CLIP (RN50×64 variant) and enriched with age metadata. The global subject embedding is queried over a FAISS-indexed external Knowledge Base (KB), enabling class-boosted retrieval and flexible, population-aware adaptation.

Methodology

Multi-Pose Embedding and Graph Attention

For each subject, P multi-view images are processed through a frozen CLIP encoder to obtain 1024D semantic embeddings, concatenated with age to form 1025D node features. These serve as vertices in a fully connected undirected graph, with edges facilitating cross-pose relational reasoning. A two-layer GAT with multi-head self-attention models inter-view dependencies, enhancing robust detection of morphological cues indicative of malnutrition and supporting both binary classification (malnourished/healthy) and four-target anthropometric regression (Height, Weight, Mid-Upper Arm Circumference, Head Circumference).

Retrieval-Augmented Inference and Fusion

A global subject embedding (pose-average + age) queries the KB using FAISS, retrieving the k most similar cases with corresponding clinical labels. Retrieved neighbors are temperature-weighted via softmax, with malnourished samples upweighted by a class-specific boost factor. Retrieval predictions (classification and regression) are combined with GAT outputs using a context-aware MLP-driven fusion coefficient, adaptively shifting reliance between local graph attention and KB density. This mechanism remedies minority-class insensitivity and supports transfer to novel demographic or clinical cohorts.

Training and Evaluation Protocol

End-to-end training utilizes binary cross-entropy and mean squared error objectives. Architecture ablation confirms the optimality of two GAT layers, eight attention heads, 0.1 dropout, and cosine distance for retrieval. Class fusion temperature, neighbor count, and boost factor all exhibit negligible sensitivity, confirming framework robustness. 4-fold cross-validation is applied on the largest pediatric dataset (Anthro Vision), supplemented by ARAN (Kurdish children) and CampusPose (collegiate subjects). Evaluation spans recall, F1, ROC AUC, calibration, mean/absolute error, and decision-curve net benefit.

Empirical Results and Claims

Performance Gains

NutriScreener yields 0.79 recall, 0.82 AUC, and F1 of 0.66 for malnutrition classification, outperforming previous multitask models (DomainAdapt recall: 0.67, AUC: 0.55). Anthropometric prediction achieves RMSEs of 6.38 cm (Ht) and 5.32 kg (Wt), substantially improving upon baselines (DomainAdapt RMSE: 22.0 cm, 12.4 kg). Ablations show that multi-pose GATs alone boost AUC (0.82) but only retrieval-augmented fusion achieves high recall, demonstrating crucial minority-class modeling. Cross-cohort results confirm strong generalizability to new populations: fine-tuned CLIP features underperform compared with frozen pretraining, supporting the hypothesis that foundation model features encode transferable anthropometric cues unless over-adapted.

Knowledge Base Impact and Cohort Transfer

KB choice directly modulates recall and regression error: demographically matched KBs induce up to 25% recall improvement and 3.5 cm RMSE reduction. Even partial KB augmentation enhances minority-class sensitivity, while non-retrieval setups revert to non-augmented GAT performance. t-SNE analysis confirms that retrieval efficacy correlates with embedding overlap between target and KB domains.

Calibration, Reliability, and Deployment Readiness

Predicted probabilities are well-calibrated (ECE: 0.06, MCE: 0.26, Brier: 0.16), and decision-curve analysis shows nontrivial clinical net benefit (+0.15 at threshold T=0.3), equating to 15 extra correct screening decisions per 100 subjects versus extreme strategies. User paper (N=12 clinical experts, avg. experience 9.5y) rates NutriScreener as highly accurate (4.3/5) and efficient (4.6/5), and the standalone app achieves acceptable inference latency on low-resource, CPU-only devices (<450 s per case, <822 MB peak RAM), confirming field deployability.

Implications and Future Directions

NutriScreener’s retrieval-augmented GAT framework demonstrates strong representational capacity for minority-class detection, cohort transfer, and anthropometric regression from RGB images, scaling malnutrition screening in low-resource settings. The adaptive fusion mechanism highlights the practical value of population-aware, knowledge-driven model architectures, especially for clinical triage under severe class imbalance. The explicit use of multi-view input and graph-based reasoning reveals direct benefits in modeling pose variability and context-specific morphological cues.

Practically, this methodology reduces reliance on laborious manual measurements and specialized hardware, enabling screening from routine images. The approach also opens new directions for retrieval-augmented architectures in other medical imaging settings, longitudinal health modeling, or real-time field applications.

Future work includes (1) extending diverse KB construction for broader demographic coverage, (2) augmenting interpretability with uncertainty quantification and visual cue localization, and (3) examining federated or privacy-preserving KB exchange to enable truly global, equitable malnutrition triage. The underexplored space of retrieval-guided fusion in graph-based vision architectures for clinical prediction remains ripe for further investigation.

Conclusion

NutriScreener effectively bridges algorithmic and deployment-level gaps for malnutrition screening under real-world constraints, yielding substantial performance, calibration, and operational gains over prior convolutional and domain-adaptive baselines. Its retrieval-augmented, multi-pose GAT pipeline provides robust generalization, minority-class sensitivity, and anthropometric accuracy across heterogeneous populations, demonstrating concrete technical progress toward scalable, accessible pediatric nutrition assessment. The framework sets new empirical standards for image-based screening, with sustained future relevance in AI-driven global health and clinical assistive tools.