- Academic Search

C Mitra, B Huang, T Darrell… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …

Speichern Zitieren Zitiert von: 69 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]

[PDF] thecvf.com

Teaching structured vision & language concepts to vision & language models

S Doveh, A Arbelle, S Harary… - Proceedings of the …, 2023 - openaccess.thecvf.com

Vision and Language (VL) models have demonstrated remarkable zero-shot performance in
a variety of tasks. However, some aspects of complex language understanding still remain a …

Speichern Zitieren Zitiert von: 71 Ähnliche Artikel Alle 8 Versionen HTML-Version

[Free GPT-4]

[PDF] neurips.cc

Dense and aligned captions (dac) promote compositional reasoning in vl models

S Doveh, A Arbelle, S Harary… - Advances in …, 2023 - proceedings.neurips.cc

Vision and Language (VL) models offer an effective method for aligning representation
spaces of images and text allowing for numerous applications such as cross-modal retrieval …

Speichern Zitieren Zitiert von: 42 Ähnliche Artikel Alle 7 Versionen HTML-Version

[Free GPT-4]

[PDF] thecvf.com

Object-region video transformers

R Herzig, E Ben-Avraham… - Proceedings of the …, 2022 - openaccess.thecvf.com

Recently, video transformers have shown great success in video understanding, exceeding
CNN performance; yet existing video transformer models do not explicitly model objects …

Speichern Zitieren Zitiert von: 99 Ähnliche Artikel Alle 7 Versionen HTML-Version

[Free GPT-4]

[PDF] thecvf.com

Learning to generate scene graph from natural language supervision

Y Zhong, J Shi, J Yang, C Xu… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

Learning from image-text data has demonstrated recent success for many recognition tasks,
yet is currently limited to visual features or individual visual concepts such as objects. In this …

Speichern Zitieren Zitiert von: 81 Ähnliche Artikel Alle 7 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Incorporating structured representations into pretrained vision & language models using scene graphs

R Herzig, A Mendelson, L Karlinsky, A Arbelle… - arxiv preprint arxiv …, 2023 - arxiv.org

Vision and language models (VLMs) have demonstrated remarkable zero-shot (ZS)
performance in a variety of tasks. However, recent works have shown that even the best …

Speichern Zitieren Zitiert von: 27 Ähnliche Artikel Alle 4 Versionen HTML-Version

[Free GPT-4]

[PDF] thecvf.com

Linguistic structures as weak supervision for visual scene graph generation

K Ye, A Kovashka - … of the IEEE/CVF Conference on …, 2021 - openaccess.thecvf.com

Prior work in scene graph generation requires categorical supervision at the level of triplets---
subjects and objects, and predicates that relate them, either with or without bounding box …

Speichern Zitieren Zitiert von: 54 Ähnliche Artikel Alle 9 Versionen HTML-Version

[Free GPT-4]

[PDF] thecvf.com

Promptonomyvit: Multi-task prompt learning improves video transformers using synthetic scene data

R Herzig, O Abramovich… - Proceedings of the …, 2024 - openaccess.thecvf.com

Action recognition models have achieved impressive results by incorporating scene-level
annotations, such as objects, their relations, 3D structure, and more. However, obtaining …

Speichern Zitieren Zitiert von: 17 Ähnliche Artikel Alle 5 Versionen HTML-Version

[Free GPT-4]

[PDF] neurips.cc

FETA: Towards specializing foundational models for expert task applications

A Alfassy, A Arbelle, O Halimi… - Advances in …, 2022 - proceedings.neurips.cc

Abstract Foundational Models (FMs) have demonstrated unprecedented capabilities
including zero-shot learning, high fidelity data synthesis, and out of domain generalization …

Speichern Zitieren Zitiert von: 17 Ähnliche Artikel Alle 7 Versionen HTML-Version

[Free GPT-4]

[PDF] neurips.cc

Bringing image scene structure to video via frame-clip consistency of object tokens

E Ben Avraham, R Herzig… - Advances in …, 2022 - proceedings.neurips.cc

Recent action recognition models have achieved impressive results by integrating objects,
their locations and interactions. However, obtaining dense structured annotations for each …

Speichern Zitieren Zitiert von: 16 Ähnliche Artikel Alle 9 Versionen HTML-Version

Alert erstellen

Zitieren

Erweiterte Suche

In „Meine Bibliothek“ gespeichert

Learning object detection from captions via textual scene attributes

Compositional chain-of-thought prompting for large multimodal models

Teaching structured vision & language concepts to vision & language models

Dense and aligned captions (dac) promote compositional reasoning in vl models

Object-region video transformers

Learning to generate scene graph from natural language supervision

Incorporating structured representations into pretrained vision & language models using scene graphs

Linguistic structures as weak supervision for visual scene graph generation

Promptonomyvit: Multi-task prompt learning improves video transformers using synthetic scene data

FETA: Towards specializing foundational models for expert task applications

Bringing image scene structure to video via frame-clip consistency of object tokens