Vipergpt: Visual inference via python execution for reasoning

D Surís, S Menon, C Vondrick - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Answering visual queries is a complex task that requires both visual processing and
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …

Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models

J Cho, A Zala, M Bansal - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Recently, DALL-E, a multimodal transformer language model, and its variants including
diffusion models have shown high-quality text-to-image generation capabilities. However …

Duet: Cross-modal semantic grounding for contrastive zero-shot learning

Z Chen, Y Huang, J Chen, Y Geng, W Zhang… - Proceedings of the …, 2023 - ojs.aaai.org
Zero-shot learning (ZSL) aims to predict unseen classes whose samples have never
appeared during training. One of the most effective and widely used semantic information for …

Neural-logic human-object interaction detection

L Li, J Wei, W Wang, Y Yang - Advances in Neural …, 2023 - proceedings.neurips.cc
The interaction decoder utilized in prevalent Transformer-based HOI detectors typically
accepts pre-composed human-object pairs as inputs. Though achieving remarkable …

Vqacl: A novel visual question answering continual learning setting

X Zhang, F Zhang, C Xu - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
Research on continual learning has recently led to a variety of work in unimodal community,
however little attention has been paid to multimodal tasks like visual question answering …

Visually grounded language learning: a review of language games, datasets, tasks, and models

A Suglia, I Konstas, O Lemon - Journal of Artificial Intelligence Research, 2024 - jair.org
In recent years, several machine learning models have been proposed. They are trained
with a language modelling objective on large-scale text-only data. With such pretraining …

Towards general purpose vision systems: An end-to-end task-agnostic vision-language architecture

T Gupta, A Kamath, A Kembhavi… - Proceedings of the …, 2022 - openaccess.thecvf.com
Computer vision systems today are primarily N-purpose systems, designed and trained for a
predefined set of tasks. Adapting such systems to new tasks is challenging and often …

Latent structure mining with contrastive modality fusion for multimedia recommendation

J Zhang, Y Zhu, Q Liu, M Zhang, S Wu… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Multimedia contents are of predominance in the modern Web era. Recent years have
witnessed growing research interests in multimedia recommendation, which aims to predict …

Reliable visual question answering: Abstain rather than answer incorrectly

S Whitehead, S Petryk, V Shakib, J Gonzalez… - … on Computer Vision, 2022 - Springer
Abstract Machine learning has advanced dramatically, narrowing the accuracy gap to
humans in multimodal tasks like visual question answering (VQA). However, while humans …

Webly supervised concept expansion for general purpose vision models

A Kamath, C Clark, T Gupta, E Kolve, D Hoiem… - … on Computer Vision, 2022 - Springer
Abstract General Purpose Vision (GPV) systems are models that are designed to solve a
wide array of visual tasks without requiring architectural changes. Today, GPVs primarily …