- Academic Search

Z **, W Chen, X Guo, W He, Y Ding, B Hong… - Science China …, 2025 - Springer

For a long time, researchers have sought artificial intelligence (AI) that matches or exceeds
human intelligence. AI agents, which are artificial entities capable of sensing the …

保存引用被引用数: 723 関連記事全 4 バージョン

[Free GPT-4]

[PDF] arxiv.org

Foundation Models Defining a New Era in Vision: a Survey and Outlook

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org

Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

保存引用被引用数: 134 関連記事全 2 バージョン

[Free GPT-4]

[PDF] neurips.cc

Visual instruction tuning

H Liu, C Li, Q Wu, YJ Lee - Advances in neural information …, 2024 - proceedings.neurips.cc

Instruction tuning large language models (LLMs) using machine-generated instruction-
following data has been shown to improve zero-shot capabilities on new tasks, but the idea …

保存引用被引用数: 4995 関連記事全 15 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

S Liu, Z Zeng, T Ren, F Li, H Zhang, J Yang… - … on Computer Vision, 2024 - Springer

In this paper, we develop an open-set object detector, called Grounding DINO, by marrying
Transformer-based detector DINO with grounded pre-training, which can detect arbitrary …

保存引用被引用数: 1580 関連記事全 4 バージョン

[Free GPT-4]

[PDF] arxiv.org

Sharegpt4v: Improving large multi-modal models with better captions

L Chen, J Li, X Dong, P Zhang, C He, J Wang… - … on Computer Vision, 2024 - Springer

Modality alignment serves as the cornerstone for large multi-modal models (LMMs).
However, the impact of different attributes (eg, data type, quality, and scale) of training data …

保存引用被引用数: 443 関連記事全 3 バージョン

[Free GPT-4]

[PDF] thecvf.com

Image as a foreign language: Beit pretraining for vision and vision-language tasks

W Wang, H Bao, L Dong, J Bjorck… - Proceedings of the …, 2023 - openaccess.thecvf.com

A big convergence of language, vision, and multimodal pretraining is emerging. In this work,
we introduce a general-purpose multimodal foundation model BEiT-3, which achieves …

保存引用被引用数: 449 関連記事全 5 バージョン HTMLバージョン

[Free GPT-4]

[PDF] thecvf.com

Open-vocabulary panoptic segmentation with text-to-image diffusion models

J Xu, S Liu, A Vahdat, W Byeon… - Proceedings of the …, 2023 - openaccess.thecvf.com

We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies
pre-trained text-image diffusion and discriminative models to perform open-vocabulary …

保存引用被引用数: 424 関連記事全 6 バージョン HTMLバージョン

[Free GPT-4]

[PDF] thecvf.com

Vipergpt: Visual inference via python execution for reasoning

D Surís, S Menon, C Vondrick - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Answering visual queries is a complex task that requires both visual processing and
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …

保存引用被引用数: 415 関連記事全 6 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Kosmos-2: Grounding multimodal large language models to the world

Z Peng, W Wang, L Dong, Y Hao, S Huang… - arxiv preprint arxiv …, 2023 - arxiv.org

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
capabilities of perceiving object descriptions (eg, bounding boxes) and grounding text to the …

保存引用被引用数: 604 関連記事全 2 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Drivegpt4: Interpretable end-to-end autonomous driving via large language model

Z Xu, Y Zhang, E **e, Z Zhao, Y Guo… - IEEE Robotics and …, 2024 - ieeexplore.ieee.org

Multimodallarge language models (MLLMs) have emerged as a prominent area of interest
within the research community, given their proficiency in handling and reasoning with non …

保存引用被引用数: 241 関連記事全 5 バージョン

アラートを作成

引用

検索オプション

マイライブラリに保存しました

Grounded language-image pre-training

The rise and potential of large language model based agents: A survey

Foundation Models Defining a New Era in Vision: a Survey and Outlook

Visual instruction tuning

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Sharegpt4v: Improving large multi-modal models with better captions

Image as a foreign language: Beit pretraining for vision and vision-language tasks

Open-vocabulary panoptic segmentation with text-to-image diffusion models

Vipergpt: Visual inference via python execution for reasoning

Kosmos-2: Grounding multimodal large language models to the world

Drivegpt4: Interpretable end-to-end autonomous driving via large language model