Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LAC: Latent Action Composition for Skeleton-based Action Segmentation (2308.14500v4)

Published 28 Aug 2023 in cs.CV

Abstract: Skeleton-based action segmentation requires recognizing composable actions in untrimmed videos. Current approaches decouple this problem by first extracting local visual features from skeleton sequences and then processing them by a temporal model to classify frame-wise actions. However, their performances remain limited as the visual features cannot sufficiently express composable actions. In this context, we propose Latent Action Composition (LAC), a novel self-supervised framework aiming at learning from synthesized composable motions for skeleton-based action segmentation. LAC is composed of a novel generation module towards synthesizing new sequences. Specifically, we design a linear latent space in the generator to represent primitive motion. New composed motions can be synthesized by simply performing arithmetic operations on latent representations of multiple input skeleton sequences. LAC leverages such synthesized sequences, which have large diversity and complexity, for learning visual representations of skeletons in both sequence and frame spaces via contrastive learning. The resulting visual encoder has a high expressive power and can be effectively transferred onto action segmentation tasks by end-to-end fine-tuning without the need for additional temporal models. We conduct a study focusing on transfer-learning and we show that representations learned from pre-trained LAC outperform the state-of-the-art by a large margin on TSU, Charades, PKU-MMD datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Skeleton-aware networks for deep motion retargeting. ACM Trans. Graph., 2020.
  2. Unpaired motion style transfer from video to animation. ACM Trans. Graph., 2020.
  3. Learning character-agnostic motion for motion retargeting in 2d. ACM TOG, 2019.
  4. Vivit: A video vision transformer. ICCV, 2021.
  5. Skeleton image representation for 3D action recognition based on tree structure and reference joints. SIBGRAPI, 2019.
  6. A short note on the kinetics-700 human action dataset. CoRR, 2019.
  7. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  8. Everybody dance now. In ICCV, 2019.
  9. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In ICCV, 2021.
  10. Hierarchically self-supervised transformer for human skeleton representation learning. In ECCV, 2022.
  11. Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. In AAAI, 2021.
  12. Pku-mmd: A large scale benchmark for continuous multi-modal human action understanding. arXiv:1703.07475, 2017.
  13. Learning an augmented rgb representation with cross-modal knowledge distillation for action detection. In ICCV, 2021.
  14. MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection. In CVPR, 2022.
  15. Pdan: Pyramid dilated attention network for action detection. In WACV, 2021.
  16. Toyota smarthome untrimmed: Real-world untrimmed videos for activity detection. IEEE TPAMI, 2022.
  17. Vpn++: Rethinking video-pose embeddings for understanding activities of daily living. IEEE TPAMI, 2021.
  18. Motion-aware contrastive video representation learning via foreground-background merging. In CVPR, 2022.
  19. Stfc: Spatio-temporal feature chain for skeleton-based human action recognition. JVCIR, 2015.
  20. Hierarchical recurrent neural network for skeleton based action recognition. In CVPR, 2015.
  21. Revisiting skeleton-based action recognition. In CVPR, 2022.
  22. Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In CVPR, 2020.
  23. Slowfast networks for video recognition. In ICCV, 2019.
  24. A large-scale study on unsupervised spatiotemporal representation learning. In CVPR, 2021.
  25. Convolutional two-stream network fusion for video action recognition. In CVPR, 2016.
  26. Framewise phoneme classification with bidirectional lstm and other neural network architectures. IJCNN, 2005.
  27. Learning spatio-temporal features with 3D residual networks for actio recognition. In ICCVW, 2017.
  28. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  29. Self-supervised video representation learning by context and motion decoupling. In CVPR, 2021.
  30. Adobe Systems Inc. Mixamo. https://www.mixamo.com. https://www.mixamo.com. Accessed: 2018-12-27., 2018.
  31. 3d convolutional neural networks for human action recognition. IEEE TPAMI, 2013.
  32. Two-stream convolutional networks for action recognition in videos. In NeurIPS, 2014.
  33. Temporal convolutional networks for action segmentation and detection. In CVPR, 2017.
  34. Skeleton boxes: Solving skeleton based action detection with a single deep convolutional neural network. In ICMEW, 2017.
  35. Joint distance maps based action recognition with convolutional neural networks. In ICMEW, 2017.
  36. Ct-net: Channel tensorization network for video classification. In ICLR, 2021.
  37. 3d human action representation learning via cross-view consistency pursuit. In CVPR, 2021.
  38. Skeleton graph scattering networks for 3d skeleton-based human motion prediction. In ICCVW, 2021.
  39. Online human action detection using joint classification-regression recurrent neural networks. In ECCV, 2016.
  40. Ntu rgb+d 120: A large-scale benchmark for 3D human activity understanding. IEEE TPAMI, 2020.
  41. Liquid warping gan: A unified framework for human motion imitation, appearance transfer and novel view synthesis. In ICCV, 2019.
  42. Graph distillation for action detection with privileged modalities. In ECCV, 2018.
  43. Cmd: Self-supervised 3d action representation learning with cross-modal mutual distillation. In ECCV, 2022.
  44. Representation learning with contrastive predictive coding. In arXiv:1807.03748, 2018.
  45. Learning latent super-events to detect multiple activities in videos. In CVPR, 2018.
  46. Temporal gaussian mixture layer for videos. In ICML, 2019.
  47. Self-supervised video transformer. In CVPR, 2022.
  48. Assemblenet++: Assembling modality representations via attention connections. ECCV, 2020.
  49. Ntu rgb+d: A large scale dataset for 3D human activity analysis. In CVPR, 2016.
  50. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In CVPR, 2019.
  51. First order motion model for image animation. Advances in Neural Information Processing Systems, 2019.
  52. Motion representations for articulated animation. In CVPR, 2021.
  53. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016.
  54. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In CVPR, 2022.
  55. Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition. In ACM MM, 2020.
  56. Self-supervised 3d skeleton action representation learning with motion consistency and continuity. In ICCV, 2021.
  57. Composable augmentation encoding for video representation learning. In ICCV, 2021.
  58. Contrastive multiview coding. In ECCV, 2020.
  59. Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. In AAAI, 2022.
  60. MoCoGAN: Decomposing motion and content for video generation. In CVPR, 2018.
  61. Contact-aware retargeting of skinned motion. In ICCV, 2021.
  62. Neural kinematic networks for unsupervised motion retargetting. In CVPR, 2018.
  63. Tdn: Temporal difference networks for efficient action recognition. In CVPR, 2021.
  64. Dance in the wild: Monocular human animation with neural dynamic appearance synthesis. In 3DV, 2021.
  65. G3AN: Disentangling appearance and motion for video generation. In CVPR, 2020.
  66. ImaGINator: Conditional spatio-temporal gan for video generation. In WACV, 2020.
  67. Inmodegan: Interpretable motion decomposition generative adversarial network for video generation. arXiv:2101.03049, 2021.
  68. Latent image animator: Learning to animate images via latent space navigation. In ICLR, 2022.
  69. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.
  70. Spatial temporal graph convolutional networks for skeleton-based action recognition. AAAI, 2018.
  71. Selective spatio-temporal aggregation based pose refinement system: Towards understanding human activities in real-world videos. In WACV, 2021.
  72. Self-supervised video pose representation learning for occlusion-robust action recognition. In FG, 2021.
  73. Unik: A unified framework for real-world skeleton-based action recognition. In BMVC, 2021.
  74. Via: View-invariant skeleton action representation learning via motion retargeting. arXiv:2209.00065, 2022.
  75. Skeleton cloud colorization for unsupervised 3d action representation learning. In ICCV, 2021.
  76. Generating videos with dynamics-aware implicit generative adversarial networks. In ICLR, 2022.
  77. Temporal query networks for fine-grained video understanding. In CVPR, 2021.
Citations (5)

Summary

We haven't generated a summary for this paper yet.