محقق Google

B Zheng, B Gou, J Kil, H Sun, Y Su - arxiv preprint arxiv:2401.01614, 2024‏ - arxiv.org‏

The recent development on large multimodal models (LMMs), especially GPT-4V (ision) and
Gemini, has been quickly expanding the capability boundaries of multimodal models …‏

ذخیره ارجاع بیان شده در 168 یافته مقاله‌های مربوط تمام نسخه‌های 6 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Evaluating text-to-visual generation with image-to-text generation‏

Z Lin, D Pathak, B Li, J Li, X **a, G Neubig… - … on Computer Vision, 2024‏ - Springer‏

Despite significant progress in generative AI, comprehensive evaluation remains
challenging because of the lack of effective metrics and standardized benchmarks. For …‏

ذخیره ارجاع بیان شده در 73 یافته مقاله‌های مربوط تمام نسخه‌های 7

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation‏

T Wu, G Yang, Z Li, K Zhang, Z Liu… - Proceedings of the …, 2024‏ - openaccess.thecvf.com‏

Despite recent advances in text-to-3D generative methods there is a notable absence of
reliable evaluation metrics. Existing metrics usually focus on a single criterion each such as …‏

ذخیره ارجاع بیان شده در 72 یافته مقاله‌های مربوط تمام نسخه‌های 6 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mantis: Interleaved multi-image instruction tuning‏

D Jiang, X He, H Zeng, C Wei, M Ku, Q Liu… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Large multimodal models (LMMs) have shown great results in single-image vision language
tasks. However, their abilities to solve multi-image visual language tasks is yet to be …‏

ذخیره ارجاع بیان شده در 77 یافته مقاله‌های مربوط تمام نسخه‌های 3 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation‏

A Yan, Z Yang, W Zhu, K Lin, L Li, J Wang… - arxiv preprint arxiv …, 2023‏ - arxiv.org‏

We present MM-Navigator, a GPT-4V-based agent for the smartphone graphical user
interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as …‏

ذخیره ارجاع بیان شده در 83 یافته مقاله‌های مربوط تمام نسخه‌های 3 نسخه HTML

Evaluating GPT-4V (GPT-4 with vision) on detection of radiologic findings on chest radiographs‏

Y Zhou, H Ong, P Kennedy, CC Wu, J Kazam, K Hentel… - Radiology, 2024‏ - pubs.rsna.org‏

Background Generating radiologic findings from chest radiographs is pivotal in medical
image analysis. The emergence of OpenAI's generative pretrained transformer, GPT-4 with …‏

ذخیره ارجاع بیان شده در 33 یافته مقاله‌های مربوط تمام نسخه‌های 5

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Ufo: A ui-focused agent for windows os interaction‏

C Zhang, L Li, S He, X Zhang, B Qiao, S Qin… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

We introduce UFO, an innovative UI-Focused agent to fulfill user requests tailored to
applications on Windows OS, harnessing the capabilities of GPT-Vision. UFO employs a …‏

ذخیره ارجاع بیان شده در 52 یافته مقاله‌های مربوط تمام نسخه‌های 4 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Viescore: Towards explainable metrics for conditional image synthesis evaluation‏

M Ku, D Jiang, C Wei, X Yue, W Chen - arxiv preprint arxiv:2312.14867, 2023‏ - arxiv.org‏

In the rapidly advancing field of conditional image generation research, challenges such as
limited explainability lie in effectively evaluating the performance and capabilities of various …‏

ذخیره ارجاع بیان شده در 36 یافته مقاله‌های مربوط تمام نسخه‌های 5 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Sapiens: Foundation for human vision models‏

R Khirodkar, T Bagautdinov, J Martinez… - … on Computer Vision, 2024‏ - Springer‏

We present Sapiens, a family of models for four fundamental human-centric vision tasks–2D
pose estimation, body-part segmentation, depth estimation, and surface normal prediction …‏

ذخیره ارجاع بیان شده در 14 یافته مقاله‌های مربوط تمام نسخه‌های 4

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Llava-critic: Learning to evaluate multimodal models‏

T **ong, X Wang, D Guo, Q Ye, H Fan, Q Gu… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as
a generalist evaluator to assess performance across a wide range of multimodal tasks …‏

ذخیره ارجاع بیان شده در 23 یافته مقاله‌های مربوط تمام نسخه‌های 3 نسخه HTML

ایجاد هشدار

ارجاع

جستجوی پیشرفته

در «کتابخانه من» ذخیره شد

Gpt-4v (ision) as a generalist evaluator for vision-language tasks

Gpt-4v (ision) is a generalist web agent, if grounded‏

Evaluating text-to-visual generation with image-to-text generation‏

Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation‏

Mantis: Interleaved multi-image instruction tuning‏

Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation‏

Evaluating GPT-4V (GPT-4 with vision) on detection of radiologic findings on chest radiographs‏

Ufo: A ui-focused agent for windows os interaction‏

Viescore: Towards explainable metrics for conditional image synthesis evaluation‏

Sapiens: Foundation for human vision models‏

Llava-critic: Learning to evaluate multimodal models‏