Attention mechanisms in computer vision: A survey
Humans can naturally and effectively find salient regions in complex scenes. Motivated by
this observation, attention mechanisms were introduced into computer vision with the aim of …
this observation, attention mechanisms were introduced into computer vision with the aim of …
Asm-loc: Action-aware segment modeling for weakly-supervised temporal action localization
Weakly-supervised temporal action localization aims to recognize and localize action
segments in untrimmed videos given only video-level action labels for training. Without the …
segments in untrimmed videos given only video-level action labels for training. Without the …
Align and attend: Multimodal summarization with dual contrastive losses
The goal of multimodal summarization is to extract the most important information from
different modalities to form summaries. Unlike unimodal summarization, the multimodal …
different modalities to form summaries. Unlike unimodal summarization, the multimodal …
Chop & learn: Recognizing and generating object-state compositions
Recognizing and generating object-state compositions has been a challenging task,
especially when generalizing to unseen compositions. In this paper, we study the task of …
especially when generalizing to unseen compositions. In this paper, we study the task of …
Towards scalable neural representation for diverse videos
Implicit neural representations (INR) have gained increasing attention in representing 3D
scenes and images, and have been recently applied to encode videos (eg, NeRV, E-NeRV) …
scenes and images, and have been recently applied to encode videos (eg, NeRV, E-NeRV) …
Omnivid: A generative framework for universal video understanding
The core of video understanding tasks such as recognition captioning and tracking is to
automatically detect objects or actions in a video and analyze their temporal evolution …
automatically detect objects or actions in a video and analyze their temporal evolution …
Efficient video transformers with spatial-temporal token selection
Video transformers have achieved impressive results on major video recognition
benchmarks, which however suffer from high computational cost. In this paper, we present …
benchmarks, which however suffer from high computational cost. In this paper, we present …
Metagait: Learning to learn an omni sample adaptive representation for gait recognition
Gait recognition, which aims at identifying individuals by their walking patterns, has recently
drawn increasing research attention. However, gait recognition still suffers from the conflicts …
drawn increasing research attention. However, gait recognition still suffers from the conflicts …
Improving RGB-D salient object detection via modality-aware decoder
Most existing RGB-D salient object detection (SOD) methods are primarily focusing on cross-
modal and cross-level saliency fusion, which has been proved to be efficient and effective …
modal and cross-level saliency fusion, which has been proved to be efficient and effective …
Efficient spatio-temporal modeling methods for real-time violence recognition
Violence recognition is challenging since recognition must be performed on videos acquired
by a lot of surveillance cameras at any time or place. It should make reliable detections in …
by a lot of surveillance cameras at any time or place. It should make reliable detections in …