FitPro: Modular AI Systems in Diverse Domains

Updated 2 July 2026

FitPro is a collection of independent AI frameworks that target varied domains such as computer vision, supervised language model fine-tuning, and interactive retrieval.
The exercise correction system uses a multi-stage CNN and precise pose error thresholds to deliver real-time feedback with a 1.2% error rate on key exercises.
Other systems under FitPro employ probability-guided token masking and hierarchical, cross-modal retrieval strategies to enhance model accuracy and user interaction.

FitPro refers to a set of independent, topically distinct research efforts sharing an identical or nearly identical acronym or moniker but targeting divergent problem domains in computer vision, embodied AI, supervised LLM fine-tuning, and cross-modal interactive retrieval. Major “FitPro” systems include intelligent exercise-feedback systems for human pose correction (Chen et al., 2019), zero-shot open-world pedestrian retrieval (Luo et al., 20 Sep 2025), probability-masked supervised LLM fine-tuning strategies (Liu et al., 14 Jan 2026), and at-home AI training workflows built atop vision-LLMs for interactive exercise correction (Zuo et al., 10 Aug 2025). There is no unifying framework or direct methodological lineage connecting these works beyond the adopted title.

1. FitPro for Intelligent Exercise Correction

The earliest system formally titled FitPro is “Fitness Done Right (FDR),” a real-time intelligent personal trainer for exercise feedback (Chen et al., 2019). The objective is automatic live monitoring, error detection, and feedback during strength exercises (specifically plank and squat). The system executes a three-stage video-processing pipeline:

Stage 1: Keypoint Detection employs a two-branch, multi-stage CNN architecture (Cao et al. 2016, VGG-19 frontend) to output joint confidence maps $S^t$ and part affinity fields $L^t$ per frame.
Stage 2: Pose Recognition aggregates 17 detected joints into a 52-D keypoint vector and a 12-D joint-angle vector, comparing these to a 200-image, hand-annotated pose database via weighted Euclidean and angular distance ( $D(A,B)=d_E+\alpha d_A$ ).
Stage 3: Error Detection & Correction computes pose-specific structural features (e.g., plank straightness angle $\varphi$ ; squat knee bend angle $\psi$ and weight-distribution $\delta$ ), comparing these to reference thresholds (e.g., $\varphi > 165^\circ$ , $\psi$ within $[0.45\pi, 0.55\pi]$ , $\delta \in [0.8, 1)$ ). Corrective natural language cues are mapped to error types and overlaid live.

The method reports 1.2% error over 1,000 plank/squat recognition samples, with real-time throughput ( $L^t$ 0 fps) on modern GPUs, and error-detection thresholds empirically fixed at $L^t$ 1, $L^t$ 2, and $L^t$ 3. The complete system is robust, scalable, and hardware-agnostic (Chen et al., 2019).

2. FitPro as Probability-Guided Fine-Tuning for LLMs

A separate body of work refers to “ProFit” or "FitPro" as a fine-tuning technique for supervised training of LLMs (Liu et al., 14 Jan 2026). Here, the method addresses the “one-to-many” problem in instruction fine-tuning with single-reference targets, introducing probability-guided masking:

For each token position $L^t$ 4 in the canonical response $L^t$ 5, compute the autoregressive probability $L^t$ 6. Apply a stop-gradient binary mask $L^t$ 7 for a fixed threshold $L^t$ 8 ( $L^t$ 9).
Modify the loss:

$D(A,B)=d_E+\alpha d_A$ 0

so that only “core” (high-confidence) reasoning steps drive gradients.

Extensive ablation studies demonstrate that this core-token masking improves accuracy by 5.5–16.8 percentage points across models (Qwen3, OLMo-2, Llama-3.1) and reasoning/QA/math benchmarks, outperforming standard SFT and entropy-regularized baselines. The technique is fully compatible with LoRA-based parameter updates and introduces negligible training overhead ( $D(A,B)=d_E+\alpha d_A$ 11–2%) (Liu et al., 14 Jan 2026).

3. FitPro: Zero-Shot Interactive Pedestrian Retrieval

Another independent instantiation is “FitPro: A Zero-Shot Framework for Interactive Text-based Pedestrian Retrieval in Open World” (Luo et al., 20 Sep 2025). This solution targets open-scene text-based pedestrian retrieval (TPR) under no-domain adaptation constraints, supporting multi-turn cross-modal user interactions. The architecture comprises:

Feature Contrastive Decoding (FCD): Denoising and super-resolution of detected pedestrian patches, followed by prompt-guided contrastive generation of structured region descriptions by minimizing a contrastive loss across batch negatives.
Incremental Semantic Mining (ISM): Multi-turn fusion of user feedback and multi-view observations into a growing multi-relational knowledge graph $D(A,B)=d_E+\alpha d_A$ 2 for each pedestrian, supporting progressive semantic representation.
Query-aware Hierarchical Retrieval (QHR): A two-stage pipeline: initial fusion of text- and vision-based similarity ( $D(A,B)=d_E+\alpha d_A$ 3), then node-level semantic re-ranking with dynamic weights tuned to query and modality confidence.

Zero-shot evaluation on five standard datasets (CUHK-PEDES, RSTPReid, ICFG-PEDES, CUHK-SYSU-TBPS, PRW-TBPS) demonstrates substantial improvements in both standard and open-scene retrieval protocols (e.g., $D(A,B)=d_E+\alpha d_A$ 4 pp rank-1 over ChatReID; $D(A,B)=d_E+\alpha d_A$ 5 pp mAP over MACA in open-scene). The system’s key innovations are its cross-scene generalization, semantic fusion for ambiguous feedback, and hierarchical precision-recall optimization. Limitations include heavy backbone requirements and the absence of noise-injected interactive benchmarks (Luo et al., 20 Sep 2025).

4. FitPro Extensions: Vision-Language Exercise Coaching Systems

Building on FormCoach (Zuo et al., 10 Aug 2025), FitPro also describes a modern at-home AI training system leveraging vision-LLMs (VLMs) for human-form correction:

The system pipeline includes real-time RGB capture, per-frame pose estimation (OpenPose/BlazePose), multimodal feature extraction (CNN+MLP), Transformer-based temporal fusion, and dual-encoder VLM inference.
The feedback generation employs a hybrid between CLIP-style encoders and finetuned LLM heads, outputting concise, imperative-form corrections (“Push your hips back and keep knees aligned over toes”).
Datasets comprise 1,700 expert-annotated user–reference video pairs over 22 strength and mobility exercises, annotated with action-oriented imperatives.
Rigorous evaluation combines automatic rubric-based metrics (precision, recall, actionability, hallucination) with human-like assessment (GPT-4.1 in zero-temperature mode).
The architecture optimizes for privacy (on-device pose estimation), sub-500ms latency, server-offload of VLM inference, and user-facing dashboards highlighting error trends and personalization cycles.

This instantiation re-implements and extends the FormCoach pipeline with FitPro branding for scalable, real-time corrective feedback and session analytics (Zuo et al., 10 Aug 2025).

5. Comparative Table of Major FitPro Systems

Domain	FitPro Function	Core Technical Contribution	Reference
Exercise Correction	Real-time pose evaluation (plank/squat)	Multi-stage CNN + distance-based error analysis	(Chen et al., 2019)
LLM Fine-Tuning	SFT with probability masking	Mask low-prob tokens to prevent overfitting	(Liu et al., 14 Jan 2026)
Pedestrian Retrieval	Zero-shot cross-modal TPR	FCD + ISM + QHR for open-scene, interactive TPR	(Luo et al., 20 Sep 2025)
At-home Form Coaching	VLM-powered feedback for exercises	Dual-encoder VLMs, real-time UI, expert datasets	(Zuo et al., 10 Aug 2025)

6. Limitations and Future Directions

Each FitPro incarnation is context-constrained and optimized for its task:

Exercise Correction: Error cases outside plank/squat or with atypical body morphology are not addressed. Camera-view correction is only partial. Extending to diverse exercise types mandates new pose error models and databases (Chen et al., 2019).
LLM Fine-Tuning: The static threshold $D(A,B)=d_E+\alpha d_A$ 6 is tuned for logic-intensive tasks and may be suboptimal for open-ended generative tasks. An adaptive scheduler or multi-reference extension is an open avenue (Liu et al., 14 Jan 2026).
Pedestrian Retrieval: Current benchmarks lack realistic noisy feedback and real-time model compression remains underexplored. Modeling human-in-the-loop dialog noise is a recommended trajectory (Luo et al., 20 Sep 2025).
Vision-Language Coaching: Human-level feedback remains a gap. Actionability and hallucination in feedback require continual evaluation by standardized rubrics and human raters (Zuo et al., 10 Aug 2025).

A plausible implication is that “FitPro” as a label frequently connotes functional, modular AI frameworks that exploit recent advances in perception, cross-modal fusion, or token-level statistical modeling to deliver feedback, retrieval, or supervision capabilities under realistic or open-world constraints.