Transformers in vision: A survey

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org
Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …

Deep learning-based human pose estimation: A survey

C Zheng, W Wu, C Chen, T Yang, S Zhu, J Shen… - ACM Computing …, 2023 - dl.acm.org
Human pose estimation aims to locate the human body parts and build human body
representation (eg, body skeleton) from input data such as images and videos. It has drawn …

A survey on vision transformer

K Han, Y Wang, H Chen, X Chen, J Guo… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Transformer, first applied to the field of natural language processing, is a type of deep neural
network mainly based on the self-attention mechanism. Thanks to its strong representation …

Multiscale vision transformers

H Fan, B **ong, K Mangalam, Y Li… - Proceedings of the …, 2021 - openaccess.thecvf.com
Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition,
by connecting the seminal idea of multiscale feature hierarchies with transformer models …

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Humans in 4D: Reconstructing and tracking humans with transformers

S Goel, G Pavlakos, J Rajasegaran… - Proceedings of the …, 2023 - openaccess.thecvf.com
We present an approach to reconstruct humans and track them over time. At the core of our
approach, we propose a fully" transformerized" version of a network for human mesh …

3d human pose estimation with spatial and temporal transformers

C Zheng, S Zhu, M Mendieta, T Yang… - Proceedings of the …, 2021 - openaccess.thecvf.com
Transformer architectures have become the model of choice in natural language processing
and are now being introduced into computer vision tasks such as image classification, object …

Mhformer: Multi-hypothesis transformer for 3d human pose estimation

W Li, H Liu, H Tang, P Wang… - Proceedings of the …, 2022 - openaccess.thecvf.com
Estimating 3D human poses from monocular videos is a challenging task due to depth
ambiguity and self-occlusion. Most existing works attempt to solve both issues by exploiting …

Cliff: Carrying location information in full frames into human pose and shape estimation

Z Li, J Liu, Z Zhang, S Xu, Y Yan - European Conference on Computer …, 2022 - Springer
Top-down methods dominate the field of 3D human pose and shape estimation, because
they are decoupled from human detection and allow researchers to focus on the core …

Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video

J Zhang, Z Tu, J Yang, Y Chen… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Recent transformer-based solutions have been introduced to estimate 3D human pose from
2D keypoint sequence by considering body joints among all frames globally to learn spatio …