Vidchapters-7m: Video chapters at scale
Segmenting untrimmed videos into chapters enables users to quickly navigate to the
information of their interest. This important topic has been understudied due to the lack of …
information of their interest. This important topic has been understudied due to the lack of …
Movienet: A holistic dataset for movie understanding
Recent years have seen remarkable advances in visual understanding. However, how to
understand a story-based long video with artistic styles, eg movie, remains challenging. In …
understand a story-based long video with artistic styles, eg movie, remains challenging. In …
Efficient movie scene detection using state-space transformers
The ability to distinguish between different movie scenes is critical for understanding the
storyline of a movie. However, accurately detecting movie scenes is often challenging as it …
storyline of a movie. However, accurately detecting movie scenes is often challenging as it …
Computational media intelligence: Human-centered machine analysis of media
Media is created by humans for humans to tell stories. There exists a natural and imminent
need for creating human-centered media analytics to illuminate the stories being told and to …
need for creating human-centered media analytics to illuminate the stories being told and to …
Cross-modal consensus network for weakly supervised temporal action localization
Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to
localize action instances in the given video with video-level categorical supervision …
localize action instances in the given video with video-level categorical supervision …
Category-level 6d object pose estimation via cascaded relation and recurrent reconstruction networks
Category-level 6D pose estimation, aiming to predict the location and orientation of unseen
object instances, is fundamental to many scenarios such as robotic manipulation and …
object instances, is fundamental to many scenarios such as robotic manipulation and …
Shot contrastive self-supervised learning for scene boundary detection
Scenes play a crucial role in breaking the storyline of movies and TV episodes into
semantically cohesive parts. However, given their complex temporal structure, finding scene …
semantically cohesive parts. However, given their complex temporal structure, finding scene …
Sep-stereo: Visually guided stereophonic audio generation by associating source separation
Stereophonic audio is an indispensable ingredient to enhance human auditory experience.
Recent research has explored the usage of visual information as guidance to generate …
Recent research has explored the usage of visual information as guidance to generate …
Towards global video scene segmentation with context-aware transformer
Videos such as movies or TV episodes usually need to divide the long storyline into
cohesive units, ie, scenes, to facilitate the understanding of video semantics. The key …
cohesive units, ie, scenes, to facilitate the understanding of video semantics. The key …
Condensed movies: Story based retrieval with contextual embeddings
Our objective in this work is the long range understandingof the narrative structure of
movies. Instead of considering the entire movie, we propose to learn from thekey scenes' of …
movies. Instead of considering the entire movie, we propose to learn from thekey scenes' of …