Emma
Summary:
The Vid2Seq model has been developed to pre-train a visual language model for dense video captioning, using a large-scale dataset of temporally annotated videos and their corresponding captions. The model achieved state-of-the-art performance on several benchmark datasets and could be used in various downstream tasks, such as video retrieval and
summarization.