Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners
Visual recognition in low-data regimes requires deep neural networks to learn generalized
representations from limited training samples. Recently, CLIP-based methods have shown …
representations from limited training samples. Recently, CLIP-based methods have shown …
Pimae: Point cloud and image interactive masked autoencoders for 3d object detection
Masked Autoencoders learn strong visual representations and achieve state-of-the-art
results in several independent modalities, yet very few works have addressed their …
results in several independent modalities, yet very few works have addressed their …
Not all features matter: Enhancing few-shot clip with adaptive prior refinement
Abstract The popularity of Contrastive Language-Image Pre-training (CLIP) has propelled its
application to diverse downstream vision tasks. To improve its capacity on downstream …
application to diverse downstream vision tasks. To improve its capacity on downstream …
Calip: Zero-shot enhancement of clip with parameter-free attention
Abstract Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual
representations with promising zero-shot performance. To further improve its downstream …
representations with promising zero-shot performance. To further improve its downstream …
Binding touch to everything: Learning unified multimodal tactile representations
The ability to associate touch with other modalities has huge implications for humans and
computational systems. However multimodal learning with touch remains challenging due to …
computational systems. However multimodal learning with touch remains challenging due to …
Eda: Explicit text-decoupling and dense alignment for 3d visual grounding
Abstract 3D visual grounding aims to find the object within point clouds mentioned by free-
form natural language descriptions with rich semantic cues. However, existing methods …
form natural language descriptions with rich semantic cues. However, existing methods …
Viewrefer: Grasp the multi-view knowledge for 3d visual grounding
Understanding 3D scenes from multi-view inputs has been proven to alleviate the view
discrepancy issue in 3D visual grounding. However, existing methods normally neglect the …
discrepancy issue in 3D visual grounding. However, existing methods normally neglect the …
Deception detection from linguistic and physiological data streams using bimodal convolutional neural networks
Deception detection is gaining increasing interest due to ethical and security concerns. This
paper explores the application of convolutional neural networks for the purpose of …
paper explores the application of convolutional neural networks for the purpose of …
Dual modality prompt tuning for vision-language pre-trained model
With the emergence of large pretrained vison-language models such as CLIP, transferable
representations can be adapted to a wide range of downstream tasks via prompt tuning …
representations can be adapted to a wide range of downstream tasks via prompt tuning …
Deep Multimodal Data Fusion
Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data
(eg, images, texts, or data collected from different sensors), feature engineering (eg …
(eg, images, texts, or data collected from different sensors), feature engineering (eg …