3D human pose estimation with spatio-temporal criss-cross attention

Z Tang, Z Qiu, Y Hao, R Hong… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Recent transformer-based solutions have shown great success in 3D human pose
estimation. Nevertheless, to calculate the joint-to-joint affinity matrix, the computational cost …

Video-focalnets: Spatio-temporal focal modulation for video action recognition

ST Wasim, MU Khattak, M Naseer… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recent video recognition models utilize Transformer models for long-range spatio-temporal
context modeling. Video transformer designs are based on self-attention that can model …

Gsrformer: Grounded situation recognition transformer with alternate semantic attention refinement

ZQ Cheng, Q Dai, S Li, T Mitamura… - Proceedings of the 30th …, 2022 - dl.acm.org
Grounded Situation Recognition (GSR) aims to generate structured semantic summaries of
images for" human-like''event understanding. Specifically, GSR task not only detects the …

AGPN: Action granularity pyramid network for video action recognition

Y Chen, H Ge, Y Liu, X Cai… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Video action recognition is a fundamental task for video understanding. Action recognition in
complex spatio-temporal contexts generally requires fusing of different multi-granularity …

Emotion-prior awareness network for emotional video captioning

P Song, D Guo, X Yang, S Tang, E Yang… - Proceedings of the 31st …, 2023 - dl.acm.org
Emotional video captioning (EVC) is an emerging task to describe the factual content with
the inherent emotion expressed in a video. It is crucial for the EVC task to effectively …

In the eye of transformer: Global–local correlation for egocentric gaze estimation and beyond

B Lai, M Liu, F Ryan, JM Rehg - International Journal of Computer Vision, 2024 - Springer
Predicting human's gaze from egocentric videos serves as a critical role for human intention
understanding in daily activities. In this paper, we present the first transformer-based model …

Real-time semantic segmentation with parallel multiple views feature augmentation

JJ Qiao, ZQ Cheng, X Wu, W Li, J Zhang - Proceedings of the 30th ACM …, 2022 - dl.acm.org
Real-time semantic segmentation is essential for many practical applications, which utilizes
attention-based feature aggregation into lightweight structures to improve accuracy and …

In the eye of transformer: Global-local correlation for egocentric gaze estimation

B Lai, M Liu, F Ryan, JM Rehg - arxiv preprint arxiv:2208.04464, 2022 - arxiv.org
In this paper, we present the first transformer-based model to address the challenging
problem of egocentric gaze estimation. We observe that the connection between the global …

Dfil: Deepfake incremental learning by exploiting domain-invariant forgery clues

K Pan, Y Yin, Y Wei, F Lin, Z Ba, Z Liu, Z Wang… - Proceedings of the 31st …, 2023 - dl.acm.org
The malicious use and widespread dissemination of deepfake pose a significant crisis of
trust. Current deepfake detection models can generally recognize forgery images by training …

FTCM: Frequency-temporal collaborative module for efficient 3D human pose estimation in video

Z Tang, Y Hao, J Li, R Hong - … on Circuits and Systems for Video …, 2023 - ieeexplore.ieee.org
Capturing cross-pose correlation from a sequence of frame-level 2D poses is essential for
3D human pose estimation (3D-HPE) in the video. Recent studies have shown the …