3D human pose estimation with spatio-temporal criss-cross attention
Recent transformer-based solutions have shown great success in 3D human pose
estimation. Nevertheless, to calculate the joint-to-joint affinity matrix, the computational cost …
estimation. Nevertheless, to calculate the joint-to-joint affinity matrix, the computational cost …
Video-focalnets: Spatio-temporal focal modulation for video action recognition
Recent video recognition models utilize Transformer models for long-range spatio-temporal
context modeling. Video transformer designs are based on self-attention that can model …
context modeling. Video transformer designs are based on self-attention that can model …
Gsrformer: Grounded situation recognition transformer with alternate semantic attention refinement
Grounded Situation Recognition (GSR) aims to generate structured semantic summaries of
images for" human-like''event understanding. Specifically, GSR task not only detects the …
images for" human-like''event understanding. Specifically, GSR task not only detects the …
AGPN: Action granularity pyramid network for video action recognition
Video action recognition is a fundamental task for video understanding. Action recognition in
complex spatio-temporal contexts generally requires fusing of different multi-granularity …
complex spatio-temporal contexts generally requires fusing of different multi-granularity …
Emotion-prior awareness network for emotional video captioning
Emotional video captioning (EVC) is an emerging task to describe the factual content with
the inherent emotion expressed in a video. It is crucial for the EVC task to effectively …
the inherent emotion expressed in a video. It is crucial for the EVC task to effectively …
In the eye of transformer: Global–local correlation for egocentric gaze estimation and beyond
Predicting human's gaze from egocentric videos serves as a critical role for human intention
understanding in daily activities. In this paper, we present the first transformer-based model …
understanding in daily activities. In this paper, we present the first transformer-based model …
Real-time semantic segmentation with parallel multiple views feature augmentation
Real-time semantic segmentation is essential for many practical applications, which utilizes
attention-based feature aggregation into lightweight structures to improve accuracy and …
attention-based feature aggregation into lightweight structures to improve accuracy and …
In the eye of transformer: Global-local correlation for egocentric gaze estimation
In this paper, we present the first transformer-based model to address the challenging
problem of egocentric gaze estimation. We observe that the connection between the global …
problem of egocentric gaze estimation. We observe that the connection between the global …
Dfil: Deepfake incremental learning by exploiting domain-invariant forgery clues
The malicious use and widespread dissemination of deepfake pose a significant crisis of
trust. Current deepfake detection models can generally recognize forgery images by training …
trust. Current deepfake detection models can generally recognize forgery images by training …
FTCM: Frequency-temporal collaborative module for efficient 3D human pose estimation in video
Capturing cross-pose correlation from a sequence of frame-level 2D poses is essential for
3D human pose estimation (3D-HPE) in the video. Recent studies have shown the …
3D human pose estimation (3D-HPE) in the video. Recent studies have shown the …