- Academic Search

A Dubey, A Jauhri, A Pandey, A Kadian… - arxiv preprint arxiv …, 2024 - arxiv.org

Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …

保存引用被引用数: 2246 関連記事全 3 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

J Chen, D Zhu, X Shen, X Li, Z Liu, P Zhang… - arxiv preprint arxiv …, 2023 - arxiv.org

Large language models have shown their remarkable capabilities as a general interface for
various language-related applications. Motivated by this, we target to build a unified …

保存引用被引用数: 498 関連記事全 6 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

S Tong, E Brown, P Wu, S Woo, M Middepogu… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

保存引用被引用数: 170 関連記事全 4 バージョン HTMLバージョン

[Free GPT-4]

[PDF] thecvf.com

Eyes wide shut? exploring the visual shortcomings of multimodal llms

S Tong, Z Liu, Y Zhai, Y Ma… - Proceedings of the …, 2024 - openaccess.thecvf.com

Is vision good enough for language? Recent advancements in multimodal models primarily
stem from the powerful reasoning abilities of large language models (LLMs). However the …

保存引用被引用数: 223 関連記事全 4 バージョン HTMLバージョン

[Free GPT-4]

[PDF] thecvf.com

Probing the 3d awareness of visual foundation models

M El Banani, A Raj, KK Maninis, A Kar… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recent advances in large-scale pretraining have yielded visual foundation models with
strong capabilities. Not only can recent models generalize to arbitrary images for their …

保存引用被引用数: 66 関連記事全 3 バージョン HTMLバージョン

Veclip: Improving clip training via visual-enriched captions

Z Lai, H Zhang, B Zhang, W Wu, H Bai… - … on Computer Vision, 2024 - Springer

Large-scale web-crawled datasets are fundamental for the success of pre-training vision-
language models, such as CLIP. However, the inherent noise and potential irrelevance of …

保存引用被引用数: 18 関連記事全 3 バージョン

[Free GPT-4]

[PDF] arxiv.org

Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model

D Muhtar, Z Li, F Gu, X Zhang, P **ao - European Conference on …, 2024 - Springer

The revolutionary capabilities of large language models (LLMs) have paved the way for
multimodal large language models (MLLMs) and fostered diverse applications across …

保存引用被引用数: 38 関連記事全 2 バージョン

[Free GPT-4]

[PDF] arxiv.org

Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models

M Deitke, C Clark, S Lee, R Tripathi, Y Yang… - arxiv preprint arxiv …, 2024 - arxiv.org

Today's most advanced multimodal models remain proprietary. The strongest open-weight
models rely heavily on synthetic data from proprietary VLMs to achieve good performance …

保存引用被引用数: 37 関連記事全 3 バージョン HTMLバージョン

[Free GPT-4]

[PDF] thecvf.com

The Neglected Tails in Vision-Language Models

S Parashar, Z Lin, T Liu, X Dong, Y Li… - Proceedings of the …, 2024 - openaccess.thecvf.com

Vision-language models (VLMs) excel in zero-shot recognition but their performance varies
greatly across different visual concepts. For example although CLIP achieves impressive …

保存引用被引用数: 31 関連記事全 3 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Proxyclip: Proxy attention improves clip for open-vocabulary segmentation

M Lan, C Chen, Y Ke, X Wang, L Feng… - European Conference on …, 2024 - Springer

Open-vocabulary semantic segmentation requires models to effectively integrate visual
representations with open-vocabulary semantic labels. While Contrastive Language-Image …

保存引用被引用数: 16 関連記事全 8 バージョン

アラートを作成

引用

検索オプション

マイライブラリに保存しました

Demystifying clip data

The llama 3 herd of models

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Probing the 3d awareness of visual foundation models

Veclip: Improving clip training via visual-enriched captions

Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model

Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models

The Neglected Tails in Vision-Language Models

Proxyclip: Proxy attention improves clip for open-vocabulary segmentation