Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners

R Zhang, X Hu, B Li, S Huang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Visual recognition in low-data regimes requires deep neural networks to learn generalized
representations from limited training samples. Recently, CLIP-based methods have shown …

Pimae: Point cloud and image interactive masked autoencoders for 3d object detection

A Chen, K Zhang, R Zhang, Z Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Masked Autoencoders learn strong visual representations and achieve state-of-the-art
results in several independent modalities, yet very few works have addressed their …

Not all features matter: Enhancing few-shot clip with adaptive prior refinement

X Zhu, R Zhang, B He, A Zhou… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract The popularity of Contrastive Language-Image Pre-training (CLIP) has propelled its
application to diverse downstream vision tasks. To improve its capacity on downstream …

Calip: Zero-shot enhancement of clip with parameter-free attention

Z Guo, R Zhang, L Qiu, X Ma, X Miao, X He… - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Abstract Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual
representations with promising zero-shot performance. To further improve its downstream …

Binding touch to everything: Learning unified multimodal tactile representations

F Yang, C Feng, Z Chen, H Park… - Proceedings of the …, 2024 - openaccess.thecvf.com
The ability to associate touch with other modalities has huge implications for humans and
computational systems. However multimodal learning with touch remains challenging due to …

Eda: Explicit text-decoupling and dense alignment for 3d visual grounding

Y Wu, X Cheng, R Zhang, Z Cheng… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract 3D visual grounding aims to find the object within point clouds mentioned by free-
form natural language descriptions with rich semantic cues. However, existing methods …

Viewrefer: Grasp the multi-view knowledge for 3d visual grounding

Z Guo, Y Tang, R Zhang, D Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Understanding 3D scenes from multi-view inputs has been proven to alleviate the view
discrepancy issue in 3D visual grounding. However, existing methods normally neglect the …

Deception detection from linguistic and physiological data streams using bimodal convolutional neural networks

P Li, M Abouelenien, R Mihalcea, Z Ding… - 2024 5th …, 2024 - ieeexplore.ieee.org
Deception detection is gaining increasing interest due to ethical and security concerns. This
paper explores the application of convolutional neural networks for the purpose of …

Dual modality prompt tuning for vision-language pre-trained model

Y **ng, Q Wu, D Cheng, S Zhang… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
With the emergence of large pretrained vison-language models such as CLIP, transferable
representations can be adapted to a wide range of downstream tasks via prompt tuning …

Deep Multimodal Data Fusion

F Zhao, C Zhang, B Geng - ACM Computing Surveys, 2024 - dl.acm.org
Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data
(eg, images, texts, or data collected from different sensors), feature engineering (eg …