Μελετητής Google

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org

Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

Αποθήκευση Παράθεση Γίνεται αναφορά σε 136 Σχετικά άρθρα Όλες οι 2 εκδοχές

[Free GPT-4]

[PDF] oup.com

A Survey of Multimodel Large Language Models

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org

With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

Αποθήκευση Παράθεση Γίνεται αναφορά σε 156 Σχετικά άρθρα Όλες οι 7 εκδοχές

[Free GPT-4]

[PDF] arxiv.org

Videochat: Chat-centric video understanding

KC Li, Y He, Y Wang, Y Li, W Wang, P Luo… - ar** an end-to-end chat-centric video
understanding system, coined as VideoChat. It integrates video foundation models and …

Αποθήκευση Παράθεση Γίνεται αναφορά σε 585 Σχετικά άρθρα Όλες οι 4 εκδοχές Προβολή ως HTML

[Free GPT-4]

[PDF] thecvf.com

Sigmoid loss for language image pre-training

X Zhai, B Mustafa, A Kolesnikov… - Proceedings of the …, 2023 - openaccess.thecvf.com

We propose a simple pairwise sigmoid loss for image-text pre-training. Unlike standard
contrastive learning with softmax normalization, the sigmoid loss operates solely on image …

Αποθήκευση Παράθεση Γίνεται αναφορά σε 620 Σχετικά άρθρα Όλες οι 5 εκδοχές Προβολή ως HTML

[Free GPT-4]

[PDF] ieee.org

End-to-end autonomous driving: Challenges and frontiers

L Chen, P Wu, K Chitta, B Jaeger… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

The autonomous driving community has witnessed a rapid growth in approaches that
embrace an end-to-end algorithm framework, utilizing raw sensor input to generate vehicle …

Αποθήκευση Παράθεση Γίνεται αναφορά σε 242 Σχετικά άρθρα Όλες οι 4 εκδοχές

[Free GPT-4]

[PDF] thecvf.com

Generative multimodal models are in-context learners

Q Sun, Y Cui, X Zhang, F Zhang, Q Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Humans can easily solve multimodal tasks in context with only a few demonstrations or
simple instructions which current multimodal systems largely struggle to imitate. In this work …

Αποθήκευση Παράθεση Γίνεται αναφορά σε 198 Σχετικά άρθρα Όλες οι 3 εκδοχές Προβολή ως HTML

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Αποθήκευση Παράθεση Γίνεται αναφορά σε 118 Σχετικά άρθρα Όλες οι 3 εκδοχές

[Free GPT-4]

[PDF] arxiv.org

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

S Tong, E Brown, P Wu, S Woo, M Middepogu… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

Αποθήκευση Παράθεση Γίνεται αναφορά σε 172 Σχετικά άρθρα Όλες οι 4 εκδοχές Προβολή ως HTML

[Free GPT-4]

[PDF] arxiv.org

Cogvlm: Visual expert for pretrained language models

W Wang, Q Lv, W Yu, W Hong, J Qi, Y Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

We introduce CogVLM, a powerful open-source visual language foundation model. Different
from the popular shallow alignment method which maps image features into the input space …

Αποθήκευση Παράθεση Γίνεται αναφορά σε 554 Σχετικά άρθρα Όλες οι 3 εκδοχές Προβολή ως HTML

[Free GPT-4]

[PDF] thecvf.com

Eyes wide shut? exploring the visual shortcomings of multimodal llms

S Tong, Z Liu, Y Zhai, Y Ma… - Proceedings of the …, 2024 - openaccess.thecvf.com

Is vision good enough for language? Recent advancements in multimodal models primarily
stem from the powerful reasoning abilities of large language models (LLMs). However the …

Αποθήκευση Παράθεση Γίνεται αναφορά σε 224 Σχετικά άρθρα Όλες οι 4 εκδοχές Προβολή ως HTML

Δημιουργία ειδοποίησης

Παράθεση

Σύνθετη αναζήτηση

Αποθηκεύτηκε στη Βιβλιοθήκη μου

Eva-clip: Improved training techniques for clip at scale

Foundation Models Defining a New Era in Vision: a Survey and Outlook

A Survey of Multimodel Large Language Models

Videochat: Chat-centric video understanding

Sigmoid loss for language image pre-training

End-to-end autonomous driving: Challenges and frontiers

Generative multimodal models are in-context learners

Internvideo2: Scaling foundation models for multimodal video understanding

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Cogvlm: Visual expert for pretrained language models

Eyes wide shut? exploring the visual shortcomings of multimodal llms