- Academic Search

D Surís, S Menon, C Vondrick - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Answering visual queries is a complex task that requires both visual processing and
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …

บันทึก อ้างอิง อ้างโดย430 บทความที่เกี่ยวข้อง ทั้งหมด 9 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models

J Cho, A Zala, M Bansal - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Recently, DALL-E, a multimodal transformer language model, and its variants including
diffusion models have shown high-quality text-to-image generation capabilities. However …

บันทึก อ้างอิง อ้างโดย170 บทความที่เกี่ยวข้อง ทั้งหมด 5 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] aaai.org

Duet: Cross-modal semantic grounding for contrastive zero-shot learning

Z Chen, Y Huang, J Chen, Y Geng, W Zhang… - Proceedings of the …, 2023 - ojs.aaai.org

Zero-shot learning (ZSL) aims to predict unseen classes whose samples have never
appeared during training. One of the most effective and widely used semantic information for …

บันทึก อ้างอิง อ้างโดย65 บทความที่เกี่ยวข้อง ทั้งหมด 6 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Neural-logic human-object interaction detection

L Li, J Wei, W Wang, Y Yang - Advances in Neural …, 2023 - proceedings.neurips.cc

The interaction decoder utilized in prevalent Transformer-based HOI detectors typically
accepts pre-composed human-object pairs as inputs. Though achieving remarkable …

บันทึก อ้างอิง อ้างโดย22 บทความที่เกี่ยวข้อง ทั้งหมด 7 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vqacl: A novel visual question answering continual learning setting

X Zhang, F Zhang, C Xu - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com

Research on continual learning has recently led to a variety of work in unimodal community,
however little attention has been paid to multimodal tasks like visual question answering …

บันทึก อ้างอิง อ้างโดย30 บทความที่เกี่ยวข้อง ทั้งหมด 6 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] jair.org Full View

Visually grounded language learning: a review of language games, datasets, tasks, and models

A Suglia, I Konstas, O Lemon - Journal of Artificial Intelligence Research, 2024 - jair.org

In recent years, several machine learning models have been proposed. They are trained
with a language modelling objective on large-scale text-only data. With such pretraining …

บันทึก อ้างอิง อ้างโดย8 บทความที่เกี่ยวข้อง ทั้งหมด 7 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Towards general purpose vision systems: An end-to-end task-agnostic vision-language architecture

T Gupta, A Kamath, A Kembhavi… - Proceedings of the …, 2022 - openaccess.thecvf.com

Computer vision systems today are primarily N-purpose systems, designed and trained for a
predefined set of tasks. Adapting such systems to new tasks is challenging and often …

บันทึก อ้างอิง อ้างโดย97 บทความที่เกี่ยวข้อง ทั้งหมด 9 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Latent structure mining with contrastive modality fusion for multimedia recommendation

J Zhang, Y Zhu, Q Liu, M Zhang, S Wu… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org

Multimedia contents are of predominance in the modern Web era. Recent years have
witnessed growing research interests in multimedia recommendation, which aims to predict …

บันทึก อ้างอิง อ้างโดย74 บทความที่เกี่ยวข้อง ทั้งหมด 5 ฉบับ

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Reliable visual question answering: Abstain rather than answer incorrectly

S Whitehead, S Petryk, V Shakib, J Gonzalez… - … on Computer Vision, 2022 - Springer

Abstract Machine learning has advanced dramatically, narrowing the accuracy gap to
humans in multimodal tasks like visual question answering (VQA). However, while humans …

บันทึก อ้างอิง อ้างโดย52 บทความที่เกี่ยวข้อง ทั้งหมด 7 ฉบับ

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Webly supervised concept expansion for general purpose vision models

A Kamath, C Clark, T Gupta, E Kolve, D Hoiem… - … on Computer Vision, 2022 - Springer

Abstract General Purpose Vision (GPV) systems are models that are designed to solve a
wide array of visual tasks without requiring architectural changes. Today, GPVs primarily …

บันทึก อ้างอิง อ้างโดย60 บทความที่เกี่ยวข้อง ทั้งหมด 8 ฉบับ

สร้างการแจ้งเตือน

อ้างอิง

การค้นหาขั้นสูง

บันทึกไปยังคลังของฉันแล้ว

Separating skills and concepts for novel visual question answering

Vipergpt: Visual inference via python execution for reasoning

Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models

Duet: Cross-modal semantic grounding for contrastive zero-shot learning

Neural-logic human-object interaction detection

Vqacl: A novel visual question answering continual learning setting

Visually grounded language learning: a review of language games, datasets, tasks, and models

Towards general purpose vision systems: An end-to-end task-agnostic vision-language architecture

Latent structure mining with contrastive modality fusion for multimedia recommendation

Reliable visual question answering: Abstain rather than answer incorrectly

Webly supervised concept expansion for general purpose vision models