Vidchapters-7m: Video chapters at scale

A Yang, A Nagrani, I Laptev, J Sivic… - Advances in Neural …, 2024 - proceedings.neurips.cc
Segmenting untrimmed videos into chapters enables users to quickly navigate to the
information of their interest. This important topic has been understudied due to the lack of …

Movienet: A holistic dataset for movie understanding

Q Huang, Y **ong, A Rao, J Wang, D Lin - Computer Vision–ECCV 2020 …, 2020 - Springer
Recent years have seen remarkable advances in visual understanding. However, how to
understand a story-based long video with artistic styles, eg movie, remains challenging. In …

Efficient movie scene detection using state-space transformers

MM Islam, M Hasan, KS Athrey… - Proceedings of the …, 2023 - openaccess.thecvf.com
The ability to distinguish between different movie scenes is critical for understanding the
storyline of a movie. However, accurately detecting movie scenes is often challenging as it …

Computational media intelligence: Human-centered machine analysis of media

K Somandepalli, T Guha, VR Martinez… - Proceedings of the …, 2021 - ieeexplore.ieee.org
Media is created by humans for humans to tell stories. There exists a natural and imminent
need for creating human-centered media analytics to illuminate the stories being told and to …

Cross-modal consensus network for weakly supervised temporal action localization

FT Hong, JC Feng, D Xu, Y Shan… - Proceedings of the 29th …, 2021 - dl.acm.org
Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to
localize action instances in the given video with video-level categorical supervision …

Category-level 6d object pose estimation via cascaded relation and recurrent reconstruction networks

J Wang, K Chen, Q Dou - 2021 IEEE/RSJ International …, 2021 - ieeexplore.ieee.org
Category-level 6D pose estimation, aiming to predict the location and orientation of unseen
object instances, is fundamental to many scenarios such as robotic manipulation and …

Shot contrastive self-supervised learning for scene boundary detection

S Chen, X Nie, D Fan, D Zhang… - Proceedings of the …, 2021 - openaccess.thecvf.com
Scenes play a crucial role in breaking the storyline of movies and TV episodes into
semantically cohesive parts. However, given their complex temporal structure, finding scene …

Sep-stereo: Visually guided stereophonic audio generation by associating source separation

H Zhou, X Xu, D Lin, X Wang, Z Liu - … , Glasgow, UK, August 23–28, 2020 …, 2020 - Springer
Stereophonic audio is an indispensable ingredient to enhance human auditory experience.
Recent research has explored the usage of visual information as guidance to generate …

Towards global video scene segmentation with context-aware transformer

Y Yang, Y Huang, W Guo, B Xu, D **a - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Videos such as movies or TV episodes usually need to divide the long storyline into
cohesive units, ie, scenes, to facilitate the understanding of video semantics. The key …

Condensed movies: Story based retrieval with contextual embeddings

M Bain, A Nagrani, A Brown… - Proceedings of the …, 2020 - openaccess.thecvf.com
Our objective in this work is the long range understandingof the narrative structure of
movies. Instead of considering the entire movie, we propose to learn from thekey scenes' of …