LSP-YOLO: Lightweight Posture Recognition
- The paper introduces a novel single-stage architecture that unifies keypoint estimation and posture classification for efficient real-time performance.
- It leverages a compact Light-C3k2 backbone with partial convolution and parameter-free SimAM attention to substantially reduce computational cost.
- The model achieves state-of-the-art accuracy on desktop setups while maintaining practical deployability on low-power embedded devices.
LSP-YOLO is a lightweight, single-stage convolutional neural network architecture designed specifically for efficient sitting posture recognition on embedded edge devices. Developed as an end-to-end solution that unifies keypoint estimation and posture classification, LSP-YOLO targets real-time applications with severe constraints on computational resources, such as smart classrooms, rehabilitation platforms, and human–computer interfaces. Its core innovations include the introduction of a compact Light-C3k2 backbone featuring partial convolution and parameter-free SimAM attention, as well as a direct, pointwise keypoint-to-class mapping in the recognition head. LSP-YOLO achieves state-of-the-art classification accuracy, extremely high throughput on desktop hardware, and practical deployability on low-power processors (Li et al., 18 Nov 2025).
1. Model Architecture and Design Principles
LSP-YOLO builds on the backbone–neck–head paradigm established in YOLOv11-Pose, but with explicit fusion of pose estimation and posture classification in a single forward pass. The model structure is as follows:
- Backbone: A stack of convolutional and Light-C3k2 modules replaces the conventional C3k2 blocks. The Spatial Pyramid Pooling Fast (SPPF) module enhances the receptive field for robust context capture.
- Neck: A PANet-style multi-scale fusion merges shallow spatial with deep semantic features across three scales, enabling the network to capture multi-level information crucial for keypoint localization and pose inference.
- Recognition Head (LSP-Head): For each output grid cell, the head jointly predicts confidence, regresses 11 upper-body keypoints, and classifies posture via a 1×1 convolution mapping the keypoint vector to six class logits, followed by a softmax.
Module Pipeline (Simplified)
1 2 |
Input → [Conv+Light‐C3k2]×n → SPPF → [Light‐C3k2 + up/down-sampling fusion] (neck) → LSP-Head
→ {confidence, keypoints, class-scores} |
This single-stage approach eliminates the need for separate pose estimation pipelines, reducing both memory footprint and inference latency (Li et al., 18 Nov 2025).
2. Light-C3k2 Block: Partial Convolution and SimAM
The core backbone module, Light-C3k2, merges efficiency-centric and attention mechanisms:
- Partial Convolution (PConv): Rather than convolving all channels, PConv applies standard convolution only to a fraction (set to 0.5), passing the remaining channels via identity. The computational savings for a kernel are
yielding 75% reduction per 3×3 conv when .
- Similarity-Aware Activation Module (SimAM): A parameter-free attention mechanism, SimAM computes an importance energy for each neuron by optimizing
with attention score
acting channel-wise and location-wise.
- Block Composition: Two Bottlenecks (k2), each starting with a PConv, followed by 1×1 convs for fusion, SimAM after each Bottleneck, and a residual connection encompassing both.
Light-C3k2 controls feature dimension via width multiplier , maintaining representation power while substantially reducing GFLOPs and memory requirements (Li et al., 18 Nov 2025).
3. Recognition Head and Losses
The LSP-Head handles all prediction targets per output grid cell:
- Keypoints-to-Class Mapping: The estimated keypoint vector is transformed by a 1×1 conv to obtain six posture scores , with softmax normalization:
- Intermediate Supervision: Keypoint accuracy is enforced with an Object Keypoint Similarity (OKS) loss prior to class mapping, ensuring features support both regression and classification.
Loss Terms
| Term | Loss Function | Purpose |
|---|---|---|
| Confidence | Object presence | |
| Keypoint | Keypoint accuracy | |
| Classification | Posture class accuracy |
The aggregated loss is
with , , (Li et al., 18 Nov 2025).
4. Dataset Construction and Augmentation
LSP-YOLO was trained and validated on a dedicated posture dataset with the following properties:
- Images: 5,000, annotated for six upper-body posture classes: Correct, LeanLeft, LeanRight, ChinSupport, OnDesk, HeadDown.
- Annotations: Each sample labeled with a bounding box (upper body), class, and 11 keypoints.
- Partitioning: 70% training, 15% validation, 15% testing.
- Augmentations: Random scaling, horizontal shift, HSV jitter, and random horizontal flips to bolster generalization (Li et al., 18 Nov 2025).
5. Training Process and Inference Results
- Training: Conducted for 300 epochs with batch size 32, learning rate annealed from 0.01 to , image size 640×640, on AMD EPYC 7742 with dual RTX 3090 GPUs.
- Model Variants: The smallest, LSP-YOLO-n (, depth=0.33), contains 1.9 M parameters and requires 4.2 GFLOPs per inference.
- Performance on PC: Achieves 251 fps and 94.2% precision with a model size of 1.9 MB.
- Embedded Deployment: On the SV830C + GC030A platform (0.5 TOPS, 64 MB RAM, 640×480@30 fps camera), the 8-bit quantized model yields:
- Preprocessing latency: 115 ms
- Inference latency: 255 ms (≈4 fps)
- Memory footprint: 22 MB
- Accuracy: 91.7%
- Model size: 2.2 MB
LSP-YOLO thus demonstrates both real-time throughput on desktop and practical, memory-constrained inference on edge accelerators (Li et al., 18 Nov 2025).
6. Computational Efficiency and Deployment Considerations
The model’s efficiency arises from design innovations:
- GFLOPs Reduction: The combination of PConv (reducing the dominant convolutional cost to 25%) and SimAM (no extra parameters) leads to a total GFLOP reduction of approximately 15–20% compared to the baseline, with greater than 96% retention of classification precision.
- Edge Compatibility: With model sizes near 2 MB, LSP-YOLO fits comfortably within the flash and RAM budgets of microcontroller- to mid-range systems.
Even the smallest variant consistently delivers 250 fps on high-end GPUs and approximately 4 fps on low-power hardware, validating its suitability for embedded applications (Li et al., 18 Nov 2025).
7. Applications, Limitations, and Directions for Further Research
Use Cases:
- Multi-student posture monitoring in smart classrooms
- Remote rehabilitation and posture correction feedback systems
- Human–computer interfaces leveraging posture-based control signals
Identified Limitations:
- Lower limb occlusion constrains reliable full-body posture estimation; expansion to 3D or multi-view sensing is a prospective solution.
- The current design processes single frames; temporal fusion with video streams could augment robustness and consistency.
- Scene-level multi-person counting is not addressed; future research can focus on dynamic keypoint grouping.
- Incorporation of self-supervised pretraining on large, unlabeled posture datasets could improve robustness to real-world variation.
LSP-YOLO thus provides a high-efficiency, deployable baseline for posture recognition research and applications, with multiple avenues for extension in both accuracy and scope (Li et al., 18 Nov 2025).