Google Académico

H Liu, W Xue, Y Chen, D Chen, X Zhao, K Wang… - ar** predictive
models for various vision-grounded language downstream tasks by providing rich …

Guardar Citar Citado por 52 Artículos relacionados Las 5 versiones Versión en HTML

[Free GPT-4]

[PDF] thecvf.com

Sai3d: Segment any instance in 3d scenes

Y Yin, Y Liu, Y **ao, D Cohen-Or… - Proceedings of the …, 2024 - openaccess.thecvf.com

Advancements in 3D instance segmentation have traditionally been tethered to the
availability of annotated datasets limiting their application to a narrow spectrum of object …

Guardar Citar Citado por 24 Artículos relacionados Las 4 versiones Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

CLIP4STR: a simple baseline for scene text recognition with pre-trained vision-language model

S Zhao, R Quan, L Zhu, Y Yang - IEEE Transactions on Image …, 2024 - ieeexplore.ieee.org

Pre-trained vision-language models (VLMs) are the de-facto foundation models for various
downstream tasks. However, scene text recognition methods still prefer backbones pre …

Guardar Citar Citado por 30 Artículos relacionados Las 2 versiones

[Free GPT-4]

[PDF] arxiv.org

Retrieving multimodal information for augmented generation: A survey

R Zhao, H Chen, W Wang, F Jiao, XL Do, C Qin… - arxiv preprint arxiv …, 2023 - arxiv.org

As Large Language Models (LLMs) become popular, there emerged an important trend of
using multimodality to augment the LLMs' generation ability, which enables LLMs to better …

Guardar Citar Citado por 60 Artículos relacionados Las 5 versiones Versión en HTML

[Free GPT-4]

[PDF] thecvf.com

Guiding image captioning models toward more specific captions

S Kornblith, L Li, Z Wang… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Image captioning is conventionally formulated as the task of generating captions that match
the conditional distribution of reference image-caption pairs. However, reference captions in …

Guardar Citar Citado por 14 Artículos relacionados Las 5 versiones Versión en HTML

[Free GPT-4]

[PDF] thecvf.com

Fusing pre-trained language models with multimodal prompts through reinforcement learning

Y Yu, J Chung, H Yun, J Hessel… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Language models are capable of commonsense reasoning: while domain-specific
models can learn from explicit knowledge (eg commonsense graphs [6], ethical norms [25]) …

Guardar Citar Citado por 17 Artículos relacionados Las 4 versiones Versión en HTML

[Free GPT-4]

[PDF] neurips.cc

Zero-shot visual relation detection via composite visual cues from large language models

L Li, J **ao, G Chen, J Shao… - Advances in Neural …, 2024 - proceedings.neurips.cc

Pretrained vision-language models, such as CLIP, have demonstrated strong generalization
capabilities, making them promising tools in the realm of zero-shot visual recognition. Visual …

Guardar Citar Citado por 33 Artículos relacionados Las 6 versiones Versión en HTML

Crear alerta

Citar

Búsqueda avanzada

Guardado en Mi biblioteca

Fine-grained image captioning with clip reward

A survey on hallucination in large vision-language models

Sai3d: Segment any instance in 3d scenes

CLIP4STR: a simple baseline for scene text recognition with pre-trained vision-language model

Retrieving multimodal information for augmented generation: A survey

Guiding image captioning models toward more specific captions

Fusing pre-trained language models with multimodal prompts through reinforcement learning

Zero-shot visual relation detection via composite visual cues from large language models