Factorized contrastive learning: Going beyond multi-view redundancy
In a wide range of multimodal tasks, contrastive learning has become a particularly
appealing approach since it can successfully learn representations from abundant …
appealing approach since it can successfully learn representations from abundant …
Dime-fm: Distilling multimodal and efficient foundation models
Abstract Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and
Florence, are trained on large private datasets of image-caption pairs and achieve superior …
Florence, are trained on large private datasets of image-caption pairs and achieve superior …
Learning on Multimodal Graphs: A Survey
Multimodal data pervades various domains, including healthcare, social media, and
transportation, where multimodal graphs play a pivotal role. Machine learning on multimodal …
transportation, where multimodal graphs play a pivotal role. Machine learning on multimodal …
Clip-cid: Efficient clip distillation via cluster-instance discrimination
Contrastive Language-Image Pre-training (CLIP) has achieved excellent performance over
a wide range of tasks. However, the effectiveness of CLIP heavily relies on a substantial …
a wide range of tasks. However, the effectiveness of CLIP heavily relies on a substantial …
Multi-modal Relation Distillation for Unified 3D Representation Learning
Recent advancements in multi-modal pre-training for 3D point clouds have demonstrated
promising results by aligning heterogeneous features across 3D shapes and their …
promising results by aligning heterogeneous features across 3D shapes and their …
Structural Information Guided Multimodal Pre-training for Vehicle-Centric Perception
Understanding vehicles in images is important for various applications such as intelligent
transportation and self-driving system. Existing vehicle-centric works typically pre-train …
transportation and self-driving system. Existing vehicle-centric works typically pre-train …
What to align in multimodal contrastive learning?
Humans perceive the world through multisensory integration, blending the information of
different modalities to adapt their behavior. Contrastive learning offers an appealing solution …
different modalities to adapt their behavior. Contrastive learning offers an appealing solution …
Foundations of Multisensory Artificial Intelligence
PP Liang - arxiv preprint arxiv:2404.18976, 2024 - arxiv.org
Building multisensory AI systems that learn from multiple sensory inputs such as text,
speech, video, real-world sensors, wearable devices, and medical data holds great promise …
speech, video, real-world sensors, wearable devices, and medical data holds great promise …
Advancing Human Motion Recognition with SkeletonCLIP++: Weighted Video Feature Integration and Enhanced Contrastive Sample Discrimination
L Yuan, Z He, Q Wang, L Xu - Sensors, 2024 - mdpi.com
This paper introduces 'SkeletonCLIP++', an extension of our prior work in human action
recognition, emphasizing the use of semantic information beyond traditional label-based …
recognition, emphasizing the use of semantic information beyond traditional label-based …
Expanding Event Modality Applications through a Robust CLIP-Based Encoder
This paper introduces a powerful encoder that transfers CLIPs capabilities to event-based
data, enhancing its utility and expanding its applicability across diverse domains. While …
data, enhancing its utility and expanding its applicability across diverse domains. While …