Papers
Topics
Authors
Recent
2000 character limit reached

LSP-YOLO: Lightweight Posture Recognition

Updated 24 November 2025
  • The paper introduces a novel single-stage architecture that unifies keypoint estimation and posture classification for efficient real-time performance.
  • It leverages a compact Light-C3k2 backbone with partial convolution and parameter-free SimAM attention to substantially reduce computational cost.
  • The model achieves state-of-the-art accuracy on desktop setups while maintaining practical deployability on low-power embedded devices.

LSP-YOLO is a lightweight, single-stage convolutional neural network architecture designed specifically for efficient sitting posture recognition on embedded edge devices. Developed as an end-to-end solution that unifies keypoint estimation and posture classification, LSP-YOLO targets real-time applications with severe constraints on computational resources, such as smart classrooms, rehabilitation platforms, and human–computer interfaces. Its core innovations include the introduction of a compact Light-C3k2 backbone featuring partial convolution and parameter-free SimAM attention, as well as a direct, pointwise keypoint-to-class mapping in the recognition head. LSP-YOLO achieves state-of-the-art classification accuracy, extremely high throughput on desktop hardware, and practical deployability on low-power processors (Li et al., 18 Nov 2025).

1. Model Architecture and Design Principles

LSP-YOLO builds on the backbone–neck–head paradigm established in YOLOv11-Pose, but with explicit fusion of pose estimation and posture classification in a single forward pass. The model structure is as follows:

  • Backbone: A stack of convolutional and Light-C3k2 modules replaces the conventional C3k2 blocks. The Spatial Pyramid Pooling Fast (SPPF) module enhances the receptive field for robust context capture.
  • Neck: A PANet-style multi-scale fusion merges shallow spatial with deep semantic features across three scales, enabling the network to capture multi-level information crucial for keypoint localization and pose inference.
  • Recognition Head (LSP-Head): For each output grid cell, the head jointly predicts confidence, regresses 11 upper-body keypoints, and classifies posture via a 1×1 convolution mapping the keypoint vector to six class logits, followed by a softmax.

Module Pipeline (Simplified)

1
2
Input → [Conv+Light‐C3k2]×n → SPPF → [Light‐C3k2 + up/down-sampling fusion] (neck) → LSP-Head
      → {confidence, keypoints, class-scores}

This single-stage approach eliminates the need for separate pose estimation pipelines, reducing both memory footprint and inference latency (Li et al., 18 Nov 2025).

2. Light-C3k2 Block: Partial Convolution and SimAM

The core backbone module, Light-C3k2, merges efficiency-centric and attention mechanisms:

  • Partial Convolution (PConv): Rather than convolving all cc channels, PConv applies standard convolution only to a fraction rr (set to 0.5), passing the remaining channels via identity. The computational savings for a k×kk\times k kernel are

Relative FLOP reduction=1r2\text{Relative FLOP reduction} = 1 - r^2

yielding 75% reduction per 3×3 conv when r=0.5r=0.5.

  • Similarity-Aware Activation Module (SimAM): A parameter-free attention mechanism, SimAM computes an importance energy for each neuron tt by optimizing

E(wt,bt)=1M1it[1(wtxi+bt)]2+[1(wtt+bt)]2+λwt2E(w_t, b_t) = \frac{1}{M-1} \sum_{i\neq t} [-1 - (w_t x_i + b_t)]^2 + [1 - (w_t t + b_t)]^2 + \lambda w_t^2

with attention score

at=sigmoid(1/et)a_t = \mathrm{sigmoid}(1/e_t^*)

acting channel-wise and location-wise.

  • Block Composition: Two Bottlenecks (k2), each starting with a PConv, followed by 1×1 convs for fusion, SimAM after each Bottleneck, and a residual connection encompassing both.

Light-C3k2 controls feature dimension via width multiplier α{0.25,0.5,0.75,1.0}\alpha\in\{0.25, 0.5, 0.75, 1.0\}, maintaining representation power while substantially reducing GFLOPs and memory requirements (Li et al., 18 Nov 2025).

3. Recognition Head and Losses

The LSP-Head handles all prediction targets per output grid cell:

  • Keypoints-to-Class Mapping: The estimated keypoint vector K^RD\hat K \in \mathbb{R}^D is transformed by a 1×1 conv to obtain six posture scores sis_i, with softmax normalization:

S=Conv1×1(K^),p^i=esij=16esjS = \mathrm{Conv}_{1\times1}(\hat K), \quad \hat p_i = \frac{e^{s_i}}{\sum_{j=1}^6 e^{s_j}}

  • Intermediate Supervision: Keypoint accuracy is enforced with an Object Keypoint Similarity (OKS) loss prior to class mapping, ensuring features support both regression and classification.

Loss Terms

Term Loss Function Purpose
Confidence Lconf=n=1NBCE(pconfn,tconfn)L_{\mathrm{conf}} = \sum_{n=1}^{N}\mathrm{BCE}(p_{\mathrm{conf}}^n,t_{\mathrm{conf}}^n) Object presence
Keypoint Loks=1iexp(di2/(2s2ki2))δ(vi>0)iδ(vi>0)L_{\mathrm{oks}} = 1 - \frac{\sum_i \exp(-d_i^2/(2 s^2 k_i^2))\,\delta(v_i>0)}{\sum_i \delta(v_i>0)} Keypoint accuracy
Classification Lcls=n=1NBCE(pclsn,tclsn)L_{\mathrm{cls}} = \sum_{n=1}^N \mathrm{BCE}(p_{\mathrm{cls}}^n,t_{\mathrm{cls}}^n) Posture class accuracy

The aggregated loss is

Lsum=αLconf+βLoks+γLclsL_{\mathrm{sum}} = \alpha L_{\mathrm{conf}} + \beta L_{\mathrm{oks}} + \gamma L_{\mathrm{cls}}

with α=1\alpha=1, β=12\beta=12, γ=4\gamma=4 (Li et al., 18 Nov 2025).

4. Dataset Construction and Augmentation

LSP-YOLO was trained and validated on a dedicated posture dataset with the following properties:

  • Images: 5,000, annotated for six upper-body posture classes: Correct, LeanLeft, LeanRight, ChinSupport, OnDesk, HeadDown.
  • Annotations: Each sample labeled with a bounding box (upper body), class, and 11 keypoints.
  • Partitioning: 70% training, 15% validation, 15% testing.
  • Augmentations: Random scaling, horizontal shift, HSV jitter, and random horizontal flips to bolster generalization (Li et al., 18 Nov 2025).

5. Training Process and Inference Results

  • Training: Conducted for 300 epochs with batch size 32, learning rate annealed from 0.01 to 1e41\mathrm{e}{-4}, image size 640×640, on AMD EPYC 7742 with dual RTX 3090 GPUs.
  • Model Variants: The smallest, LSP-YOLO-n (α=0.25\alpha=0.25, depth=0.33), contains 1.9 M parameters and requires 4.2 GFLOPs per inference.
  • Performance on PC: Achieves 251 fps and 94.2% precision with a model size of 1.9 MB.
  • Embedded Deployment: On the SV830C + GC030A platform (0.5 TOPS, 64 MB RAM, 640×480@30 fps camera), the 8-bit quantized model yields:
    • Preprocessing latency: 115 ms
    • Inference latency: 255 ms (≈4 fps)
    • Memory footprint: 22 MB
    • Accuracy: 91.7%
    • Model size: 2.2 MB

LSP-YOLO thus demonstrates both real-time throughput on desktop and practical, memory-constrained inference on edge accelerators (Li et al., 18 Nov 2025).

6. Computational Efficiency and Deployment Considerations

The model’s efficiency arises from design innovations:

  • GFLOPs Reduction: The combination of PConv (reducing the dominant convolutional cost to 25%) and SimAM (no extra parameters) leads to a total GFLOP reduction of approximately 15–20% compared to the baseline, with greater than 96% retention of classification precision.
  • Edge Compatibility: With model sizes near 2 MB, LSP-YOLO fits comfortably within the flash and RAM budgets of microcontroller- to mid-range systems.

Even the smallest variant consistently delivers 250 fps on high-end GPUs and approximately 4 fps on low-power hardware, validating its suitability for embedded applications (Li et al., 18 Nov 2025).

7. Applications, Limitations, and Directions for Further Research

Use Cases:

  • Multi-student posture monitoring in smart classrooms
  • Remote rehabilitation and posture correction feedback systems
  • Human–computer interfaces leveraging posture-based control signals

Identified Limitations:

  • Lower limb occlusion constrains reliable full-body posture estimation; expansion to 3D or multi-view sensing is a prospective solution.
  • The current design processes single frames; temporal fusion with video streams could augment robustness and consistency.
  • Scene-level multi-person counting is not addressed; future research can focus on dynamic keypoint grouping.
  • Incorporation of self-supervised pretraining on large, unlabeled posture datasets could improve robustness to real-world variation.

LSP-YOLO thus provides a high-efficiency, deployable baseline for posture recognition research and applications, with multiple avenues for extension in both accuracy and scope (Li et al., 18 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to LSP-YOLO.