- Academic Search

Z Ji, N Lee, R Frieske, T Yu, D Su, Y Xu, E Ishii… - ACM Computing …, 2023 - dl.acm.org

Natural Language Generation (NLG) has improved exponentially in recent years thanks to
the development of sequence-to-sequence deep learning technologies such as Transformer …

保存引用被引用数: 3272 関連記事全 7 バージョン

[Free GPT-4]

[PDF] neurips.cc

Language is not all you need: Aligning perception with language models

S Huang, L Dong, W Wang, Y Hao… - Advances in …, 2023 - proceedings.neurips.cc

A big convergence of language, multimodal perception, action, and world modeling is a key
step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a …

保存引用被引用数: 478 関連記事全 5 バージョン HTMLバージョン

[Free GPT-4]

[PDF] mlr.press

Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models

J Li, D Li, S Savarese, S Hoi - International conference on …, 2023 - proceedings.mlr.press

The cost of vision-and-language pre-training has become increasingly prohibitive due to
end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and …

保存引用被引用数: 4801 関連記事全 7 バージョン HTMLバージョン

[Free GPT-4]

[PDF] neurips.cc

Flamingo: a visual language model for few-shot learning

JB Alayrac, J Donahue, P Luc… - Advances in neural …, 2022 - proceedings.neurips.cc

Building models that can be rapidly adapted to novel tasks using only a handful of annotated
examples is an open challenge for multimodal machine learning research. We introduce …

保存引用被引用数: 3703 関連記事全 7 バージョン HTMLバージョン

[Free GPT-4]

[PDF] thecvf.com

From images to textual prompts: Zero-shot visual question answering with frozen large language models

J Guo, J Li, D Li, AMH Tiong, B Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

Large language models (LLMs) have demonstrated excellent zero-shot generalization to
new language tasks. However, effective utilization of LLMs for zero-shot visual question …

保存引用被引用数: 139 関連記事全 5 バージョン HTMLバージョン

[Free GPT-4]

[PDF] thecvf.com

Distilling large vision-language model with out-of-distribution generalizability

X Li, Y Fang, M Liu, Z Ling, Z Tu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Large vision-language models have achieved outstanding performance, but their size and
computational requirements make their deployment on resource-constrained devices and …

保存引用被引用数: 28 関連記事全 5 バージョン HTMLバージョン

[Free GPT-4]

[PDF] neurips.cc

Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering

W Lin, J Chen, J Mei, A Coca… - Advances in Neural …, 2023 - proceedings.neurips.cc

Abstract Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to
utilize knowledge from external knowledge bases to answer visually-grounded questions …

保存引用被引用数: 41 関連記事全 5 バージョン HTMLバージョン

[Free GPT-4]

[PDF] thecvf.com

Tinyclip: Clip distillation via affinity mimicking and weight inheritance

K Wu, H Peng, Z Zhou, B **ao, M Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this paper, we propose a novel cross-modal distillation method, called TinyCLIP, for large-
scale language-image pre-trained models. The method introduces two core techniques …

保存引用被引用数: 44 関連記事全 5 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Language models are general-purpose interfaces

Y Hao, H Song, L Dong, S Huang, Z Chi… - arxiv preprint arxiv …, 2022 - arxiv.org

Foundation models have received much attention due to their effectiveness across a broad
range of downstream applications. Though there is a big convergence in terms of …

保存引用被引用数: 108 関連記事全 2 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Plug-and-play vqa: Zero-shot vqa by conjoining large pretrained models with zero training

AMH Tiong, J Li, B Li, S Savarese, SCH Hoi - arxiv preprint arxiv …, 2022 - arxiv.org

Visual question answering (VQA) is a hallmark of vision and language reasoning and a
challenging task under the zero-shot setting. We propose Plug-and-Play VQA (PNP-VQA), a …

保存引用被引用数: 104 関連記事全 2 バージョン HTMLバージョン

アラートを作成

引用

検索オプション

マイライブラリに保存しました

Enabling multimodal generation on clip via vision-language knowledge distillation

Survey of hallucination in natural language generation

Language is not all you need: Aligning perception with language models

Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models

Flamingo: a visual language model for few-shot learning

From images to textual prompts: Zero-shot visual question answering with frozen large language models

Distilling large vision-language model with out-of-distribution generalizability

Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering

Tinyclip: Clip distillation via affinity mimicking and weight inheritance

Language models are general-purpose interfaces

Plug-and-play vqa: Zero-shot vqa by conjoining large pretrained models with zero training