DiffAnt: Diffusion Models for Action Anticipation (2311.15991v1)
Abstract: Anticipating future actions is inherently uncertain. Given an observed video segment containing ongoing actions, multiple subsequent actions can plausibly follow. This uncertainty becomes even larger when predicting far into the future. However, the majority of existing action anticipation models adhere to a deterministic approach, neglecting to account for future uncertainties. In this work, we rethink action anticipation from a generative view, employing diffusion models to capture different possible future actions. In this framework, future actions are iteratively generated from standard Gaussian noise in the latent space, conditioned on the observed video, and subsequently transitioned into the action space. Extensive experiments on four benchmark datasets, i.e., Breakfast, 50Salads, EpicKitchens, and EGTEA Gaze+, are performed and the proposed method achieves superior or comparable results to state-of-the-art methods, showing the effectiveness of a generative approach for action anticipation. Our code and trained models will be published on GitHub.
- Uncertainty-Aware Anticipation of Activities. In ICCV Workshop, 2019.
- Long-term anticipation of activities with cycle consistency. In GCPR, pages 159–173. Springer, 2020.
- SegDiff: Image segmentation with diffusion probabilistic models. arXiv preprint arXiv:2112.00390, 2021.
- Label-efficient semantic segmentation with diffusion models. ICLR, 2021.
- End-to-end object detection with transformers. In ECCV, 2020.
- Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In CVPR, 2017.
- DiffusionDet: Diffusion model for object detection. arXiv preprint arXiv:2211.09788, 2022.
- Scaling egocentric vision: The epic-kitchens dataset. In ECCV, pages 720–736, 2018.
- The epic-kitchens dataset: Collection, challenges and baselines. TPAMI, 43(11):4125–4141, 2020.
- Diffusion models beat GANs on image synthesis. In NeurIPS, 2021.
- Temporal action segmentation: An analysis of modern technique. arXiv preprint arXiv:2210.10352, 2022.
- Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In CVPR, pages 3575–3584, 2019.
- When will you do what? - Anticipating Temporal Occurrences of Activities. In CVPR, 2018.
- What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention. In ICCV, 2019.
- RED: Reinforced Encoder-Decoder Networks for Action Anticipation. In BMVC, 2017.
- Anticipative Video Transformer. In ICCV, 2021.
- Actionvlad: Learning spatio-temporal aggregation for action classification. In CVPR, pages 971–980, 2017.
- Future Transformer for Long-term Action Anticipation. In CVPR, 2022.
- Diffuseq: Sequence to sequence text generation with diffusion models. ICLR, 2023.
- Ego4D: Around the World in 3,000 Hours of Egocentric Video. In CVPR, 2022.
- Vector quantized diffusion model for text-to-image synthesis. In CVPR, 2022.
- A-act: Action anticipation through cycle transformations. arXiv preprint arXiv:2204.00942, 2022.
- Denoising diffusion probabilistic models. In NeurIPS, pages 6840–6851, 2020.
- Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022.
- Timeception for complex action recognition. In CVPR, pages 254–263, 2019a.
- Videograph: Recognizing minutes-long human activities in videos. In ICCV Workshop, 2019b.
- Time-Conditioned Action Anticipation in One Shot. In CVPR, 2019.
- DiffusionClip: Text-guided image manipulation using diffusion models. In CVPR, 2022.
- Auto-encoding variational bayes. In ICLR, 2014.
- The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities. In CVPR, 2014.
- BDDM: Bilateral denoising diffusion models for fast and high-quality speech synthesis. arXiv preprint arXiv:2203.13508, 2022.
- Binauralgrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis. arXiv preprint arXiv:2205.14807, 2022.
- Diffusion-lm improves controllable text generation. NeurIPS, 35:4328–4343, 2022.
- In the eye of beholder: Joint learning of gaze and actions in first person video. In ECCV, pages 619–635, 2018.
- Diffusion action segmentation. arXiv preprint arXiv:2303.17959, 2023.
- Calvin Luo. Understanding diffusion models: A unified perspective. arXiv preprint arXiv:2208.11970, 2022.
- Intention-conditioned long-term human egocentric action anticipation. In WACV, pages 6048–6057, 2023.
- A Variational Auto-Encoder Model for Stochastic Point Processes. In CVPR, 2019.
- Ego-topo: Environment affordances from egocentric video. In CVPR, pages 163–172, 2020.
- Rethinking learning approaches for long-term action anticipation. In ECCV, pages 558–576. Springer, 2022.
- Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, ICML, 2021.
- Adversarial generative grammars for human activity prediction. In ECCV, pages 507–523. Springer, 2020.
- Weakly supervised action learning with rnn based fine-to-coarse modeling. In CVPR, pages 754–763, 2017.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Temporal Aggregate Representations for Long-Range Video Understanding. In ECCV, 2020.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Denoising diffusion implicit models. In ICLR, 2021.
- Generative modeling by estimating gradients of the data distribution. NeurIPS, 2019.
- Improved techniques for training score-based generative models. NeurIPS, 2020.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- Combining embedded accelerometers with computer vision for recognizing food preparation activities. In UbiComp, pages 729–738, 2013.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- Attention is All you Need. In NeurIPS, 2017.
- Masked conditional video diffusion for prediction, generation, and interpolation. arXiv preprint arXiv:2205.09853, 2022.
- Anticipating visual representations from unlabeled video. In CVPR, pages 98–106, 2016.
- Learning to schedule in diffusion probabilistic models. In KDD, 2023.
- Diffusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481, 2022.
- Asformer: Transformer for action segmentation. In BMVC, 2021.
- Latent diffusion energy-based model for interpretable text modeling. arXiv preprint arXiv:2206.05895, 2022.
- On diverse asynchronous activity anticipation. In ECCV, pages 781–799. Springer, 2020.
- A survey on deep learning techniques for action anticipation. arXiv preprint arXiv:2309.17257, 2023a.
- Anticipative feature fusion transformer for multi-modal action anticipation. In WACV, 2023b.
- Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
- Zeyun Zhong (7 papers)
- Chengzhi Wu (15 papers)
- Manuel Martin (3 papers)
- Michael Voit (35 papers)
- Juergen Gall (121 papers)
- Jürgen Beyerer (40 papers)