- Academic Search

H Wang, PKA Vasu, F Faghri… - Proceedings of the …, 2024 - openaccess.thecvf.com

The landscape of publicly available vision foundation models (VFMs) such as CLIP and
SAM is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their …

Save Cite Cited by 123 Related articles All 7 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Llm2clip: Powerful language model unlock richer visual representation

W Huang, A Wu, Y Yang, X Luo, Y Yang, L Hu… - arxiv preprint arxiv …, 2024 - arxiv.org

CLIP is one of the most important multimodal foundational models today. What powers
CLIP's capabilities? The rich supervision signals provided by natural language, the carrier of …

Save Cite Cited by 3 Related articles View as HTML

[Free GPT-4]

[PDF] arxiv.org

Advancing Multi-Modal Sensing Through Expandable Modality Alignment

S Dai, S Jiang, Y Yang, T Cao, M Li, S Banerjee… - arxiv preprint arxiv …, 2024 - arxiv.org

Sensing technology is widely used for comprehending the physical world, with numerous
modalities explored in past decades. While there has been considerable work on multi …

Save Cite Cited by 4 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[HTML] mdpi.com

[HTML][HTML] SeFi-CD: A Semantic First Change Detection Paradigm That Can Detect Any Change You Want

L Zhao, Z Huang, Y Wang, C Peng, J Gan, H Li, C Hu - Remote Sensing, 2024 - mdpi.com

The existing change detection (CD) methods can be summarized as the visual-first change
detection (ViFi-CD) paradigm, which first extracts change features from visual differences …

[Free GPT-4]

[PDF] openreview.net

LLM2CLIP: Extending the Capability Boundaries of CLIP through Large Language Models

A Wu, Y Yang, X Luo, Y Yang, L Hu, Q Dai, X Dai… - openreview.net

CLIP is one of the most important multimodal foundational models today, aligning visual and
textual signals into a shared feature space using a simple contrastive learning loss on large …

Save Cite Related articles View as HTML

[Free GPT-4]

[PDF] openreview.net

LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation

A Wu, Y Yang, X Luo, Y Yang, C Wang, L Hu… - NeurIPS 2024 Workshop … - openreview.net

CLIP is one of the most important foundational multimodal models today. It aligns image and
text modalities into a shared feature space by leveraging a simple contrastive learning loss …

Save Cite Related articles View as HTML

Create alert

Cite

Advanced search

Saved to My library

Ape: Aligning pretrained encoders to quickly learn aligned multimodal representations

Sam-clip: Merging vision foundation models towards semantic and spatial understanding

Llm2clip: Powerful language model unlock richer visual representation

Advancing Multi-Modal Sensing Through Expandable Modality Alignment

[HTML][HTML] SeFi-CD: A Semantic First Change Detection Paradigm That Can Detect Any Change You Want

LLM2CLIP: Extending the Capability Boundaries of CLIP through Large Language Models

LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation