- Academic Search

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org

Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

保存引用被引用次数：134 相关文章所有 2 个版本

[Free GPT-4]

[PDF] nowpublishers.com

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

保存引用被引用次数：197 相关文章所有 7 个版本图书馆搜索 HTML 版

[Free GPT-4]

[PDF] neurips.cc

Visual instruction tuning

H Liu, C Li, Q Wu, YJ Lee - Advances in neural information …, 2024 - proceedings.neurips.cc

Instruction tuning large language models (LLMs) using machine-generated instruction-
following data has been shown to improve zero-shot capabilities on new tasks, but the idea …

保存引用被引用次数：4995 相关文章所有 15 个版本 HTML 版

[Free GPT-4]

[PDF] thecvf.com

Improved baselines with visual instruction tuning

H Liu, C Li, Y Li, YJ Lee - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com

Large multimodal models (LMM) have recently shown encouraging progress with visual
instruction tuning. In this paper we present the first systematic study to investigate the design …

保存引用被引用次数：1745 相关文章所有 5 个版本 HTML 版

[Free GPT-4]

[PDF] researchhub.com

[PDF][PDF] Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond

J Bai, S Bai, S Yang, S Wang… - arxiv preprint …, 2023 - storage.prod.researchhub.com

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models
(LVLMs) designed to perceive and understand both texts and images. Starting from the …

保存引用被引用次数：539 相关文章 HTML 版

[Free GPT-4]

[PDF] thecvf.com

Scaling up gans for text-to-image synthesis

M Kang, JY Zhu, R Zhang, J Park… - Proceedings of the …, 2023 - openaccess.thecvf.com

The recent success of text-to-image synthesis has taken the world by storm and captured the
general public's imagination. From a technical standpoint, it also marked a drastic change in …

保存引用被引用次数：531 相关文章所有 6 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Video-llava: Learning united visual representation by alignment before projection

B Lin, Y Ye, B Zhu, J Cui, M Ning, P **… - arxiv preprint arxiv …, 2023 - arxiv.org

The Large Vision-Language Model (LVLM) has enhanced the performance of various
downstream tasks in visual-language understanding. Most existing approaches encode …

保存引用被引用次数：428 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] thecvf.com

Open-vocabulary panoptic segmentation with text-to-image diffusion models

J Xu, S Liu, A Vahdat, W Byeon… - Proceedings of the …, 2023 - openaccess.thecvf.com

We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies
pre-trained text-image diffusion and discriminative models to perform open-vocabulary …

保存引用被引用次数：424 相关文章所有 6 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Sdxl: Improving latent diffusion models for high-resolution image synthesis

D Podell, Z English, K Lacey, A Blattmann… - arxiv preprint arxiv …, 2023 - arxiv.org

We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to
previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone …

保存引用被引用次数：1581 相关文章所有 4 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Yi: Open foundation models by 01. ai

A Young, B Chen, C Li, C Huang, G Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce the Yi model family, a series of language and multimodal models that
demonstrate strong multi-dimensional capabilities. The Yi model family is based on 6B and …

保存引用被引用次数：343 相关文章所有 2 个版本 HTML 版

引用

高级搜索

已保存到“我的图书馆”

Foundation Models Defining a New Era in Vision: a Survey and Outlook

Vision-language pre-training: Basics, recent advances, and future trends

Visual instruction tuning

Improved baselines with visual instruction tuning

[PDF][PDF] Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond

Scaling up gans for text-to-image synthesis

Video-llava: Learning united visual representation by alignment before projection

Open-vocabulary panoptic segmentation with text-to-image diffusion models

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Yi: Open foundation models by 01. ai