- Academic Search

Z Tang, Z Qiu, Y Hao, R Hong… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Recent transformer-based solutions have shown great success in 3D human pose
estimation. Nevertheless, to calculate the joint-to-joint affinity matrix, the computational cost …

保存引用被引用数: 97 関連記事全 5 バージョン HTMLバージョン

[Free GPT-4]

[PDF] thecvf.com

Video-focalnets: Spatio-temporal focal modulation for video action recognition

ST Wasim, MU Khattak, M Naseer… - Proceedings of the …, 2023 - openaccess.thecvf.com

Recent video recognition models utilize Transformer models for long-range spatio-temporal
context modeling. Video transformer designs are based on self-attention that can model …

保存引用被引用数: 23 関連記事全 7 バージョン HTMLバージョン

[Free GPT-4]

[PDF] acm.org

Gsrformer: Grounded situation recognition transformer with alternate semantic attention refinement

ZQ Cheng, Q Dai, S Li, T Mitamura… - Proceedings of the 30th …, 2022 - dl.acm.org

Grounded Situation Recognition (GSR) aims to generate structured semantic summaries of
images for" human-like''event understanding. Specifically, GSR task not only detects the …

保存引用被引用数: 41 関連記事全 6 バージョン

AGPN: Action granularity pyramid network for video action recognition

Y Chen, H Ge, Y Liu, X Cai… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Video action recognition is a fundamental task for video understanding. Action recognition in
complex spatio-temporal contexts generally requires fusing of different multi-granularity …

保存引用被引用数: 35 関連記事全 2 バージョン

[Free GPT-4]

[PDF] github.io

Emotion-prior awareness network for emotional video captioning

P Song, D Guo, X Yang, S Tang, E Yang… - Proceedings of the 31st …, 2023 - dl.acm.org

Emotional video captioning (EVC) is an emerging task to describe the factual content with
the inherent emotion expressed in a video. It is crucial for the EVC task to effectively …

保存引用被引用数: 15 関連記事全 2 バージョン

[Free GPT-4]

[PDF] springer.com

In the eye of transformer: Global–local correlation for egocentric gaze estimation and beyond

B Lai, M Liu, F Ryan, JM Rehg - International Journal of Computer Vision, 2024 - Springer

Predicting human's gaze from egocentric videos serves as a critical role for human intention
understanding in daily activities. In this paper, we present the first transformer-based model …

保存引用被引用数: 14 関連記事全 8 バージョン

[Free GPT-4]

[PDF] openreview.net

Real-time semantic segmentation with parallel multiple views feature augmentation

JJ Qiao, ZQ Cheng, X Wu, W Li, J Zhang - Proceedings of the 30th ACM …, 2022 - dl.acm.org

Real-time semantic segmentation is essential for many practical applications, which utilizes
attention-based feature aggregation into lightweight structures to improve accuracy and …

保存引用被引用数: 16 関連記事全 3 バージョン

[Free GPT-4]

[PDF] arxiv.org

In the eye of transformer: Global-local correlation for egocentric gaze estimation

B Lai, M Liu, F Ryan, JM Rehg - arxiv preprint arxiv:2208.04464, 2022 - arxiv.org

In this paper, we present the first transformer-based model to address the challenging
problem of egocentric gaze estimation. We observe that the connection between the global …

保存引用被引用数: 18 関連記事全 5 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Dfil: Deepfake incremental learning by exploiting domain-invariant forgery clues

K Pan, Y Yin, Y Wei, F Lin, Z Ba, Z Liu, Z Wang… - Proceedings of the 31st …, 2023 - dl.acm.org

The malicious use and widespread dissemination of deepfake pose a significant crisis of
trust. Current deepfake detection models can generally recognize forgery images by training …

保存引用被引用数: 16 関連記事全 4 バージョン

FTCM: Frequency-temporal collaborative module for efficient 3D human pose estimation in video

Z Tang, Y Hao, J Li, R Hong - … on Circuits and Systems for Video …, 2023 - ieeexplore.ieee.org

Capturing cross-pose correlation from a sequence of frame-level 2D poses is essential for
3D human pose estimation (3D-HPE) in the video. Recent studies have shown the …

保存引用被引用数: 16 関連記事全 2 バージョン

アラートを作成

引用

検索オプション

マイライブラリに保存しました

Group contextualization for video recognition

3D human pose estimation with spatio-temporal criss-cross attention

Video-focalnets: Spatio-temporal focal modulation for video action recognition

Gsrformer: Grounded situation recognition transformer with alternate semantic attention refinement

AGPN: Action granularity pyramid network for video action recognition

Emotion-prior awareness network for emotional video captioning

In the eye of transformer: Global–local correlation for egocentric gaze estimation and beyond

Real-time semantic segmentation with parallel multiple views feature augmentation

In the eye of transformer: Global-local correlation for egocentric gaze estimation

Dfil: Deepfake incremental learning by exploiting domain-invariant forgery clues

FTCM: Frequency-temporal collaborative module for efficient 3D human pose estimation in video