Factorized contrastive learning: Going beyond multi-view redundancy

PP Liang, Z Deng, MQ Ma, JY Zou… - Advances in …, 2023 - proceedings.neurips.cc
In a wide range of multimodal tasks, contrastive learning has become a particularly
appealing approach since it can successfully learn representations from abundant …

Dime-fm: Distilling multimodal and efficient foundation models

X Sun, P Zhang, P Zhang, H Shah… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and
Florence, are trained on large private datasets of image-caption pairs and achieve superior …

Learning on Multimodal Graphs: A Survey

C Peng, J He, F **a - arxiv preprint arxiv:2402.05322, 2024 - arxiv.org
Multimodal data pervades various domains, including healthcare, social media, and
transportation, where multimodal graphs play a pivotal role. Machine learning on multimodal …

Clip-cid: Efficient clip distillation via cluster-instance discrimination

K Yang, T Gu, X An, H Jiang, X Dai, Z Feng… - arxiv preprint arxiv …, 2024 - arxiv.org
Contrastive Language-Image Pre-training (CLIP) has achieved excellent performance over
a wide range of tasks. However, the effectiveness of CLIP heavily relies on a substantial …

Multi-modal Relation Distillation for Unified 3D Representation Learning

H Wang, Y Bao, P Pan, Z Li, X Liu, R Yang… - European Conference on …, 2024 - Springer
Recent advancements in multi-modal pre-training for 3D point clouds have demonstrated
promising results by aligning heterogeneous features across 3D shapes and their …

Structural Information Guided Multimodal Pre-training for Vehicle-Centric Perception

X Wang, W Wu, C Li, Z Zhao, Z Chen, Y Shi… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Understanding vehicles in images is important for various applications such as intelligent
transportation and self-driving system. Existing vehicle-centric works typically pre-train …

What to align in multimodal contrastive learning?

B Dufumier, J Castillo-Navarro, D Tuia… - arxiv preprint arxiv …, 2024 - arxiv.org
Humans perceive the world through multisensory integration, blending the information of
different modalities to adapt their behavior. Contrastive learning offers an appealing solution …

Foundations of Multisensory Artificial Intelligence

PP Liang - arxiv preprint arxiv:2404.18976, 2024 - arxiv.org
Building multisensory AI systems that learn from multiple sensory inputs such as text,
speech, video, real-world sensors, wearable devices, and medical data holds great promise …

Advancing Human Motion Recognition with SkeletonCLIP++: Weighted Video Feature Integration and Enhanced Contrastive Sample Discrimination

L Yuan, Z He, Q Wang, L Xu - Sensors, 2024 - mdpi.com
This paper introduces 'SkeletonCLIP++', an extension of our prior work in human action
recognition, emphasizing the use of semantic information beyond traditional label-based …

Expanding Event Modality Applications through a Robust CLIP-Based Encoder

S Jeong, H Chen, S Yun, S Cho, W Huang… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper introduces a powerful encoder that transfers CLIPs capabilities to event-based
data, enhancing its utility and expanding its applicability across diverse domains. While …