Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unsupervised Spatial-Temporal Feature Enrichment and Fidelity Preservation Network for Skeleton based Action Recognition (2401.14034v1)

Published 25 Jan 2024 in cs.CV

Abstract: Unsupervised skeleton based action recognition has achieved remarkable progress recently. Existing unsupervised learning methods suffer from severe overfitting problem, and thus small networks are used, significantly reducing the representation capability. To address this problem, the overfitting mechanism behind the unsupervised learning for skeleton based action recognition is first investigated. It is observed that the skeleton is already a relatively high-level and low-dimension feature, but not in the same manifold as the features for action recognition. Simply applying the existing unsupervised learning method may tend to produce features that discriminate the different samples instead of action classes, resulting in the overfitting problem. To solve this problem, this paper presents an Unsupervised spatial-temporal Feature Enrichment and Fidelity Preservation framework (U-FEFP) to generate rich distributed features that contain all the information of the skeleton sequence. A spatial-temporal feature transformation subnetwork is developed using spatial-temporal graph convolutional network and graph convolutional gate recurrent unit network as the basic feature extraction network. The unsupervised Bootstrap Your Own Latent based learning is used to generate rich distributed features and the unsupervised pretext task based learning is used to preserve the information of the skeleton sequence. The two unsupervised learning ways are collaborated as U-FEFP to produce robust and discriminative representations. Experimental results on three widely used benchmarks, namely NTU-RGB+D-60, NTU-RGB+D-120 and PKU-MMD dataset, demonstrate that the proposed U-FEFP achieves the best performance compared with the state-of-the-art unsupervised learning methods. t-SNE illustrations further validate that U-FEFP can learn more discriminative features for unsupervised skeleton based action recognition.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Fuzzy integral-based cnn classifier fusion for 3d skeleton action recognition. IEEE Transactions on Circuits and Systems for Video Technology 31, 2206–2216.
  2. Skeleton image representation for 3d action recognition based on tree structure and reference joints, in: SIBGRAPI Conference on Graphics, Patterns and Images, pp. 16–23.
  3. Skeleton-based action recognition with gated convolutional neural networks. IEEE Transactions on Circuits and Systems for Video Technology 29, 3247–3257.
  4. Channel-wise topology refinement graph convolution for skeleton-based action recognition, in: IEEE International Conference on Computer Vision (ICCV), pp. 13339–13348.
  5. Hierarchical recurrent neural network for skeleton based action recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1110–1118.
  6. Skeletal quads: Human action recognition using joint quadruples, in: International Conference on Pattern Recognition (ICPR), pp. 4513–4518.
  7. Skeleton-based action recognition with focusing-diffusion graph convolutional networks. IEEE Signal Processing Letters 28, 2058–2062.
  8. Efficient spatio-temporal contrastive learning for skeleton-based 3-d action recognition. IEEE Transactions on Multimedia 25, 405–417.
  9. Bootstrap your own latent: A new approach to self-supervised learning. ArXiv abs/2006.07733.
  10. Momentum contrast for unsupervised visual representation learning, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9726–9735.
  11. Skeleton optical spectra-based action recognition using convolutional neural networks. IEEE Transactions on Circuits and Systems for Video Technology 28, 807–811.
  12. Joint learning in the spatio-temporal and frequency domains for skeleton-based action recognition. IEEE Transactions on Multimedia 22, 2207–2220.
  13. Action recognition scheme based on skeleton representation with ds-lstm network. IEEE Transactions on Circuits and Systems for Video Technology 30, 2129–2140.
  14. Skeletonnet: Mining deep part features for 3-d action recognition. IEEE Signal Processing Letters 24, 731–735.
  15. A new representation of skeleton sequences for 3d action recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4570–4579.
  16. Learning clip representations for skeleton-based 3d action recognition. IEEE Transactions on Image Processing 27, 2842–2855.
  17. Mtt: Multi-scale temporal transformer for skeleton-based action recognition. IEEE Signal Processing Letters 29, 528–532.
  18. Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Processing Letters 24, 624–628.
  19. Multiview-based 3-d action recognition using deep networks. IEEE Transactions on Human-Machine Systems 49, 95–104.
  20. Frequency-driven channel attention-augmented full-scale temporal modeling network for skeleton-based action recognition. Knowl. Based Syst. 256, 109854. URL: https://api.semanticscholar.org/CorpusID:252104191.
  21. 3d human action representation learning via cross-view consistency pursuit, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4739–4748.
  22. Deep independently recurrent neural network (indrnn). ArXiv abs/1910.06251.
  23. Independently recurrent neural network (indrnn): Building a longer and deeper rnn, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5457–5466.
  24. Exploring incomplete decoupling modeling with window and cross-window mechanism for skeleton-based action recognition. Knowl. Based Syst. 281, 111074. URL: https://api.semanticscholar.org/CorpusID:264134579.
  25. Regularization via structural label smoothing, in: International Conference on Artificial Intelligence and Statistics, p. 1453–1463.
  26. Fast autoaugment, in: Neural Information Processing Systems, p. 6665–6675.
  27. Ms2l: Multi-task self-supervised learning for skeleton based action recognition, in: ACM International Conference on Multimedia (ACM MM), pp. 2490–2498.
  28. Actionlet-dependent contrastive learning for unsupervised skeleton-based action recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2363–2372.
  29. Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 2684–2701.
  30. Skeleton-based action recognition using spatio-temporal lstm network with trust gates. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 3007–3021.
  31. A benchmark dataset and comparison study for multi-modal human action analytics, in: ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), pp. 1–24.
  32. Global context-aware attention lstm networks for 3d action recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3671–3680.
  33. A multi-stream graph convolutional networks-hidden conditional random field model for skeleton-based action recognition. IEEE Transactions on Multimedia 23, 64–76.
  34. Spatial focus attention for fine-grained skeleton-based action tasks. IEEE Signal Processing Letters 29, 1883–1887.
  35. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit. 68, 346–362.
  36. Disentangling and unifying graph convolutions for skeleton-based action recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 140–149.
  37. Decoupled weight decay regularization, in: International Conference on Learning Representations.
  38. Localdrop: A hybrid regularization for deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 3590–3601.
  39. Visualizing data using t-sne. Journal of Machine Learning Research 9, 2579–2605.
  40. Multi-localized sensitive autoencoder-attention-lstm for skeleton-based action recognition. IEEE Transactions on Multimedia 24, 1678–1690.
  41. Human action recognition approaches with video datasets - a survey. Knowl. Based Syst. 222, 106995. URL: https://api.semanticscholar.org/CorpusID:233649424.
  42. Spatial temporal graph deconvolutional network for skeleton-based human action recognition. IEEE Signal Processing Letters 28, 244–248.
  43. Augmented skeleton based contrastive action learning with momentum lstm for unsupervised action recognition. Inf. Sci. 569, 90–109.
  44. Halp: Hallucinating latent positives for skeleton-based self-supervised learning of actions, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18846–18856.
  45. NTU RGB+ D: A large scale dataset for 3D human activity analysis, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1010–1019.
  46. Skeleton-based action recognition with directed graph neural networks, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7904–7913.
  47. Two-stream adaptive graph convolutional networks for skeleton based action recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12018–12027.
  48. A survey on image data augmentation for deep learning. Journal of Big Data 6, 1–48.
  49. An end-to-end spatio-temporal attention model for human action recognition from skeleton data, in: Association for the Advance of Artificial Intelligence (AAAI), pp. 4263–4270.
  50. Learning to recognize human actions from noisy skeleton data via noise adaptation. IEEE Transactions on Multimedia 24, 1152–1163.
  51. Richly activated graph convolutional network for robust skeleton-based action recognition. IEEE Transactions on Circuits and Systems for Video Technology 31, 1915–1925.
  52. Predict & cluster: Unsupervised skeleton based action recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9628–9637.
  53. Human action recognition from various data modalities: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence , 1–20doi:10.1109/TPAMI.2022.3183112.
  54. Skeleton-contrastive 3d action representation learning, in: ACM International Conference on Multimedia (ACM MM), pp. 1655–1663.
  55. Human action recognition by representing 3d skeletons as points in a lie group, in: IEEE Conference on Computer Vision and Pattern Recognition, pp. 588–595.
  56. Action recognition based on joint trajectory maps with convolutional neural networks. Knowledge-Based Systems 158, 43–53.
  57. Mining mid-level features for action recognition based on effective skeleton representation, in: International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp. 1–8.
  58. Contrast-reconstruction representation learning for self-supervised skeleton-based action recognition. IEEE Transactions on Image Processing 31, 6224–6238.
  59. Spatio-temporal naive-bayes nearest-neighbor (st-nbnn) for skeleton-based action recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 445–454.
  60. Graph2net: Perceptually-enriched graph learning for skeleton-based action recognition. IEEE Transactions on Circuits and Systems for Video Technology 32, 2120–2132.
  61. View invariant human action recognition using histograms of 3d joints, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 20–27.
  62. Laga-net: Local-and-global attention network for skeleton based action recognition. IEEE Transactions on Multimedia 24, 2648–2661.
  63. Prototypical contrast and reverse prediction: Unsupervised skeleton based action recognition. IEEE Transactions on Multimedia 25, 624–634.
  64. Ensemble one-dimensional convolution neural networks for skeleton-based action recognition. IEEE Signal Processing Letters 25, 1044–1048.
  65. Spatial temporal graph convolutional networks for skeleton-based action recognition, in: Association for the Advance of Artificial Intelligence (AAAI), pp. 7444–7452.
  66. Hierarchical soft quantization for skeleton-based human action recognition. IEEE Transactions on Multimedia 23, 883–898.
  67. Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition, in: ACM International Conference on Multimedia (ACM MM), pp. 55–63.
  68. Rethinking data augmentation for image super-resolution: A comprehensive analysis and a new strategy, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8372–8381.
  69. Contrastive 3d human skeleton action representation learning via crossmoco with spatiotemporal occlusion mask data augmentation. IEEE Transactions on Multimedia 25, 1564–1574.
  70. Skeletal twins: Unsupervised skeleton-based action representation learning, in: IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6.
  71. View adaptive recurrent neural networks for high performance human action recognition from skeleton data, in: IEEE International Conference on Computer Vision (ICCV), pp. 2136–2145.
  72. Semantics-guided neural networks for efficient skeleton-based human action recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1109–1118.
  73. Fusing geometric features for skeleton-based action recognition using multilayer lstm networks. IEEE Transactions on Multimedia 20, 2330–2343.
  74. Deep manifold-to-manifold transforming network for skeleton-based action recognition. IEEE Transactions on Multimedia 22, 2926–2937.
  75. Unsupervised skeleton-based action representation learning via relation consistency pursuit. Neural Computing and Applications 34, 20327–20339.
  76. Unsupervised representation learning with long-term dynamics for skeleton based action recognition, in: Association for the Advance of Artificial Intelligence (AAAI), pp. 2644–2651.
  77. Random erasing data augmentation, in: Association for the Advance of Artificial Intelligence (AAAI).
  78. A cuboid cnn model with an attention mechanism for skeleton-based action recognition. IEEE Transactions on Multimedia 22, 2977–2989.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Chuankun Li (6 papers)
  2. Shuai Li (295 papers)
  3. Yanbo Gao (10 papers)
  4. Ping Chen (123 papers)
  5. Jian Li (667 papers)
  6. Wanqing Li (53 papers)

Summary

We haven't generated a summary for this paper yet.