Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

End-to-End Spatio-Temporal Action Localisation with Video Transformers (2304.12160v1)

Published 24 Apr 2023 in cs.CV

Abstract: The most performant spatio-temporal action localisation models use external person proposals and complex external memory banks. We propose a fully end-to-end, purely-transformer based model that directly ingests an input video, and outputs tubelets -- a sequence of bounding boxes and the action classes at each frame. Our flexible model can be trained with either sparse bounding-box supervision on individual frames, or full tubelet annotations. And in both cases, it predicts coherent tubelets as the output. Moreover, our end-to-end model requires no additional pre-processing in the form of proposals, or post-processing in terms of non-maximal suppression. We perform extensive ablation experiments, and significantly advance the state-of-the-art results on four different spatio-temporal action localisation benchmarks with both sparse keyframes and full tubelet annotations.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Alexey Gritsenko (16 papers)
  2. Xuehan Xiong (17 papers)
  3. Josip Djolonga (21 papers)
  4. Mostafa Dehghani (64 papers)
  5. Chen Sun (187 papers)
  6. Mario Lučić (51 papers)
  7. Cordelia Schmid (206 papers)
  8. Anurag Arnab (56 papers)
Citations (10)