Študovňa Google

L Zhu, X Wang, Z Ke, W Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com

As the core building block of vision transformers, attention is a powerful tool to capture long-
range dependency. However, such power comes at a cost: it incurs a huge computation …

Uložiť Citovať Citované 733-krát Súvisiace články Všetky verzie 11 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Chat-univi: Unified visual representation empowers large language models with image and video understanding

P **, R Takanobu, W Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Large language models have demonstrated impressive universal capabilities across a wide
range of open-ended tasks and have extended their utility to encompass multimodal …

Uložiť Citovať Citované 179-krát Súvisiace články Všetky verzie 6 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Effective whole-body pose estimation with two-stages distillation

Z Yang, A Zeng, C Yuan, Y Li - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Whole-body pose estimation localizes the human body, hand, face, and foot keypoints in an
image. This task is challenging due to multi-scale body parts, fine-grained localization for …

Uložiť Citovať Citované 143-krát Súvisiace články Všetky verzie 6 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Beyond appearance: a semantic controllable self-supervised learning framework for human-centric visual tasks

W Chen, X Xu, J Jia, H Luo, Y Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Human-centric visual tasks have attracted increasing research attention due to their
widespread applications. In this paper, we aim to learn a general human representation from …

Uložiť Citovať Citované 112-krát Súvisiace články Všetky verzie 8 HTML verzia

Dynamic neural network structure: A review for its theories and applications

J Guo, CLP Chen, Z Liu, X Yang - IEEE Transactions on Neural …, 2024 - ieeexplore.ieee.org

The dynamic neural network (DNN), in contrast to the static counterpart, offers numerous
advantages, such as improved accuracy, efficiency, and interpretability. These benefits stem …

Uložiť Citovať Citované 10-krát Súvisiace články

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Joint token pruning and squeezing towards more aggressive compression of vision transformers

S Wei, T Ye, S Zhang, Y Tang… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Although vision transformers (ViTs) have shown promising results in various computer vision
tasks recently, their high computational cost limits their practical applications. Previous …

Uložiť Citovať Citované 72-krát Súvisiace články Všetky verzie 7 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

P **, J Huang, P **) or merging tokens. It is an important but challenging task. Although recent …

Uložiť Citovať Citované 46-krát Súvisiace články Všetky verzie 5 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Hourglass tokenizer for efficient transformer-based 3D human pose estimation

W Li, M Liu, H Liu, P Wang, J Cai… - Proceedings of the …, 2024 - openaccess.thecvf.com

Transformers have been successfully applied in the field of video-based 3D human pose
estimation. However the high computational costs of these video pose transformers (VPTs) …

Uložiť Citovať Citované 27-krát Súvisiace články Všetky verzie 9 HTML verzia

Vytvoriť upozornenie

Citovať

Rozšírené vyhľadávanie

Uložené do mojej knižnice

Not all tokens are equal: Human-centric visual analysis via token clustering transformer

Biformer: Vision transformer with bi-level routing attention

Chat-univi: Unified visual representation empowers large language models with image and video understanding

Effective whole-body pose estimation with two-stages distillation

Beyond appearance: a semantic controllable self-supervised learning framework for human-centric visual tasks

Dynamic neural network structure: A review for its theories and applications

Joint token pruning and squeezing towards more aggressive compression of vision transformers

Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

Hourglass tokenizer for efficient transformer-based 3D human pose estimation