Machine-generated text: A comprehensive survey of threat models and detection methods
Machine-generated text is increasingly difficult to distinguish from text authored by humans.
Powerful open-source models are freely available, and user-friendly tools that democratize …
Powerful open-source models are freely available, and user-friendly tools that democratize …
Align and attend: Multimodal summarization with dual contrastive losses
The goal of multimodal summarization is to extract the most important information from
different modalities to form summaries. Unlike unimodal summarization, the multimodal …
different modalities to form summaries. Unlike unimodal summarization, the multimodal …
Docformerv2: Local features for document understanding
We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding
(VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) …
(VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) …
Learning attention propagation for compositional zero-shot learning
Compositional zero-shot learning aims to recognize unseen compositions of seen visual
primitives of object classes and their states. While all primitives (states and objects) are …
primitives of object classes and their states. While all primitives (states and objects) are …
Separate and locate: Rethink the text in text-based visual question answering
Text-based Visual Question Answering (TextVQA) aims at answering questions about the
text in images. Most works in this field focus on designing network structures or pre-training …
text in images. Most works in this field focus on designing network structures or pre-training …
Prestu: Pre-training for scene-text understanding
The ability to recognize and reason about text embedded in visual inputs is often lacking in
vision-and-language (V&L) models, perhaps because V&L pre-training methods have often …
vision-and-language (V&L) models, perhaps because V&L pre-training methods have often …
Filling in the blank: Rationale-augmented prompt tuning for TextVQA
G Zeng, Y Zhang, Y Zhou, B Fang, G Zhao… - Proceedings of the 31st …, 2023 - dl.acm.org
Recently, generative Text-based visual question answering (TextVQA) methods, which are
often based on language models, have exhibited impressive results and drawn increasing …
often based on language models, have exhibited impressive results and drawn increasing …
Toward 3d spatial reasoning for human-like text-based visual question answering
Text-based Visual Question Answering~(TextVQA) aims to produce correct answers for
given questions about the images with multiple scene texts. In most cases, the texts naturally …
given questions about the images with multiple scene texts. In most cases, the texts naturally …
Prophet: Prompting large language models with complementary answer heuristics for knowledge-based visual question answering
Knowledge-based visual question answering (VQA) requires external knowledge beyond
the image to answer the question. Early studies retrieve required knowledge from explicit …
the image to answer the question. Early studies retrieve required knowledge from explicit …
Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering
Insufficient training data is a common barrier to effectively learn multimodal information
interactions and question semantics in existing medical Visual Question Answering (VQA) …
interactions and question semantics in existing medical Visual Question Answering (VQA) …