- Academic Search

R Zhang, X Hu, B Li, S Huang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Visual recognition in low-data regimes requires deep neural networks to learn generalized
representations from limited training samples. Recently, CLIP-based methods have shown …

Save Cite Cited by 173 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

Pimae: Point cloud and image interactive masked autoencoders for 3d object detection

A Chen, K Zhang, R Zhang, Z Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Masked Autoencoders learn strong visual representations and achieve state-of-the-art
results in several independent modalities, yet very few works have addressed their …

Save Cite Cited by 75 Related articles All 7 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

Not all features matter: Enhancing few-shot clip with adaptive prior refinement

X Zhu, R Zhang, B He, A Zhou… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract The popularity of Contrastive Language-Image Pre-training (CLIP) has propelled its
application to diverse downstream vision tasks. To improve its capacity on downstream …

Save Cite Cited by 72 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] aaai.org

Calip: Zero-shot enhancement of clip with parameter-free attention

Z Guo, R Zhang, L Qiu, X Ma, X Miao, X He… - Proceedings of the AAAI …, 2023 - ojs.aaai.org

Abstract Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual
representations with promising zero-shot performance. To further improve its downstream …

Save Cite Cited by 112 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

Binding touch to everything: Learning unified multimodal tactile representations

F Yang, C Feng, Z Chen, H Park… - Proceedings of the …, 2024 - openaccess.thecvf.com

The ability to associate touch with other modalities has huge implications for humans and
computational systems. However multimodal learning with touch remains challenging due to …

Save Cite Cited by 38 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

Eda: Explicit text-decoupling and dense alignment for 3d visual grounding

Y Wu, X Cheng, R Zhang, Z Cheng… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract 3D visual grounding aims to find the object within point clouds mentioned by free-
form natural language descriptions with rich semantic cues. However, existing methods …

Save Cite Cited by 85 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

Viewrefer: Grasp the multi-view knowledge for 3d visual grounding

Z Guo, Y Tang, R Zhang, D Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Understanding 3D scenes from multi-view inputs has been proven to alleviate the view
discrepancy issue in 3D visual grounding. However, existing methods normally neglect the …

Save Cite Cited by 39 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Deception detection from linguistic and physiological data streams using bimodal convolutional neural networks

P Li, M Abouelenien, R Mihalcea, Z Ding… - 2024 5th …, 2024 - ieeexplore.ieee.org

Deception detection is gaining increasing interest due to ethical and security concerns. This
paper explores the application of convolutional neural networks for the purpose of …

Save Cite Cited by 74 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Dual modality prompt tuning for vision-language pre-trained model

Y **ng, Q Wu, D Cheng, S Zhang… - IEEE Transactions …, 2023 - ieeexplore.ieee.org

With the emergence of large pretrained vison-language models such as CLIP, transferable
representations can be adapted to a wide range of downstream tasks via prompt tuning …

Save Cite Cited by 103 Related articles All 4 versions Free GPT-4

[Free GPT-4]

[PDF] acm.org

Deep Multimodal Data Fusion

F Zhao, C Zhang, B Geng - ACM Computing Surveys, 2024 - dl.acm.org

Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data
(eg, images, texts, or data collected from different sensors), feature engineering (eg …

Save Cite Cited by 34 Related articles

Create alert

Cite

Advanced search

Saved to My library

Can language understand depth?

Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners

Pimae: Point cloud and image interactive masked autoencoders for 3d object detection

Not all features matter: Enhancing few-shot clip with adaptive prior refinement

Calip: Zero-shot enhancement of clip with parameter-free attention

Binding touch to everything: Learning unified multimodal tactile representations

Eda: Explicit text-decoupling and dense alignment for 3d visual grounding

Viewrefer: Grasp the multi-view knowledge for 3d visual grounding

Deception detection from linguistic and physiological data streams using bimodal convolutional neural networks

Dual modality prompt tuning for vision-language pre-trained model

Deep Multimodal Data Fusion