- Academic Search

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org

Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

Spara Citera Citerat av 139 Relaterade artiklar Alla 2 versionerna

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Sigmoid loss for language image pre-training

X Zhai, B Mustafa, A Kolesnikov… - Proceedings of the …, 2023 - openaccess.thecvf.com

We propose a simple pairwise sigmoid loss for image-text pre-training. Unlike standard
contrastive learning with softmax normalization, the sigmoid loss operates solely on image …

Spara Citera Citerat av 637 Relaterade artiklar Alla 5 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Improving clip training with language rewrites

L Fan, D Krishnan, P Isola… - Advances in Neural …, 2023 - proceedings.neurips.cc

Abstract Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective
and scalable methods for training transferable vision models using paired image and text …

Spara Citera Citerat av 161 Relaterade artiklar Alla 6 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Eva-02: A visual representation for neon genesis

Y Fang, Q Sun, X Wang, T Huang, X Wang… - Image and Vision …, 2024 - Elsevier

We launch EVA-02, a next-generation Transformer-based visual representation pre-trained
to reconstruct strong and robust language-aligned vision features via masked image …

Spara Citera Citerat av 229 Relaterade artiklar Alla 3 versionerna

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Foundation models in robotics: Applications, challenges, and the future

R Firoozi, J Tucker, S Tian… - … Journal of Robotics …, 2023 - journals.sagepub.com

We survey applications of pretrained foundation models in robotics. Traditional deep
learning models in robotics are trained on small datasets tailored for specific tasks, which …

Spara Citera Citerat av 129 Relaterade artiklar Alla 2 versionerna

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Self-chained image-language model for video localization and question answering

S Yu, J Cho, P Yadav, M Bansal - Advances in Neural …, 2023 - proceedings.neurips.cc

Recent studies have shown promising results on utilizing large pre-trained image-language
models for video question answering. While these image-language models can efficiently …

Spara Citera Citerat av 148 Relaterade artiklar Alla 7 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] usenix.org

Glaze: Protecting artists from style mimicry by {Text-to-Image} models

S Shan, J Cryan, E Wenger, H Zheng… - 32nd USENIX Security …, 2023 - usenix.org

Recent text-to-image diffusion models such as MidJourney and Stable Diffusion threaten to
displace many in the professional artist community. In particular, models can learn to mimic …

Spara Citera Citerat av 206 Relaterade artiklar Alla 10 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] nowpublishers.com

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com

Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Spara Citera Citerat av 219 Relaterade artiklar Alla 6 versionerna Bibliotekssökning Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Z Chen, J Wu, W Wang, W Su, G Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com

The exponential growth of large language models (LLMs) has opened up numerous
possibilities for multi-modal AGI systems. However the progress in vision and vision …

Spara Citera Citerat av 175 Relaterade artiklar Alla 4 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] ieee.org

Transformer-based visual segmentation: A survey

X Li, H Ding, H Yuan, W Zhang, J Pang… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org

Visual segmentation seeks to partition images, video frames, or point clouds into multiple
segments or groups. This technique has numerous real-world applications, such as …

Spara Citera Citerat av 125 Relaterade artiklar Alla 3 versionerna

Skapa alarm

Citera

Avancerad sökning

Har sparats i Mitt bibliotek

Scaling language-image pre-training via masking

Foundation Models Defining a New Era in Vision: a Survey and Outlook

Sigmoid loss for language image pre-training

Improving clip training with language rewrites

Eva-02: A visual representation for neon genesis

Foundation models in robotics: Applications, challenges, and the future

Self-chained image-language model for video localization and question answering

Glaze: Protecting artists from style mimicry by {Text-to-Image} models

Multimodal foundation models: From specialists to general-purpose assistants

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Transformer-based visual segmentation: A survey