- Academic Search

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

保存引用被引用次数：198 相关文章所有 7 个版本图书馆搜索 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning

Y Wang, W Chen, X Han, X Lin, H Zhao, Y Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI) with abstract
reasoning ability is the goal of next-generation AI. Recent advancements in Large Language …

保存引用被引用次数：34 相关文章所有 3 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] stableaiprompts.com

[PDF][PDF] The dawn of lmms: Preliminary explorations with gpt-4v (ision)

Z Yang, L Li, K Lin, J Wang, CC Lin… - arxiv preprint arxiv …, 2023 - stableaiprompts.com

Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory
skills, such as visual understanding, to achieve stronger generic intelligence. In this paper …

保存引用被引用次数：588 相关文章所有 4 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vipergpt: Visual inference via python execution for reasoning

D Surís, S Menon, C Vondrick - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Answering visual queries is a complex task that requires both visual processing and
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …

保存引用被引用次数：422 相关文章所有 9 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mm-react: Prompting chatgpt for multimodal reasoning and action

Z Yang, L Li, J Wang, K Lin, E Azarnasab… - arxiv preprint arxiv …, 2023 - arxiv.org

We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision
experts to achieve multimodal reasoning and action. In this paper, we define and explore a …

保存引用被引用次数：332 相关文章所有 2 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Retrieval-augmented generation for ai-generated content: A survey

P Zhao, H Zhang, Q Yu, Z Wang, Y Geng, F Fu… - arxiv preprint arxiv …, 2024 - arxiv.org

The development of Artificial Intelligence Generated Content (AIGC) has been facilitated by
advancements in model algorithms, scalable foundation model architectures, and the …

保存引用被引用次数：196 相关文章所有 3 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Self-chained image-language model for video localization and question answering

S Yu, J Cho, P Yadav, M Bansal - Advances in Neural …, 2023 - proceedings.neurips.cc

Recent studies have shown promising results on utilizing large pre-trained image-language
models for video question answering. While these image-language models can efficiently …

保存引用被引用次数：148 相关文章所有 7 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Compositional chain-of-thought prompting for large multimodal models

C Mitra, B Huang, T Darrell… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …

保存引用被引用次数：72 相关文章所有 5 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation

T Wu, G Yang, Z Li, K Zhang, Z Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Despite recent advances in text-to-3D generative methods there is a notable absence of
reliable evaluation metrics. Existing metrics usually focus on a single criterion each such as …

保存引用被引用次数：71 相关文章所有 6 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Zero-shot video question answering via frozen bidirectional language models

A Yang, A Miech, J Sivic, I Laptev… - Advances in Neural …, 2022 - proceedings.neurips.cc

Video question answering (VideoQA) is a complex task that requires diverse multi-modal
data for training. Manual annotation of question and answers for videos, however, is tedious …

保存引用被引用次数：230 相关文章所有 11 个版本 HTML 版

创建快讯

引用

高级搜索

已保存到“我的图书馆”

Language models with image descriptors are strong few-shot video-language learners

Vision-language pre-training: Basics, recent advances, and future trends

Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning

[PDF][PDF] The dawn of lmms: Preliminary explorations with gpt-4v (ision)

Vipergpt: Visual inference via python execution for reasoning

Mm-react: Prompting chatgpt for multimodal reasoning and action

Retrieval-augmented generation for ai-generated content: A survey

Self-chained image-language model for video localization and question answering

Compositional chain-of-thought prompting for large multimodal models

Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation

Zero-shot video question answering via frozen bidirectional language models