Machine-generated text: A comprehensive survey of threat models and detection methods

EN Crothers, N Japkowicz, HL Viktor - IEEE Access, 2023 - ieeexplore.ieee.org
Machine-generated text is increasingly difficult to distinguish from text authored by humans.
Powerful open-source models are freely available, and user-friendly tools that democratize …

Align and attend: Multimodal summarization with dual contrastive losses

B He, J Wang, J Qiu, T Bui… - Proceedings of the …, 2023 - openaccess.thecvf.com
The goal of multimodal summarization is to extract the most important information from
different modalities to form summaries. Unlike unimodal summarization, the multimodal …

Docformerv2: Local features for document understanding

S Appalaraju, P Tang, Q Dong, N Sankaran… - Proceedings of the …, 2024 - ojs.aaai.org
We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding
(VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) …

Learning attention propagation for compositional zero-shot learning

MGZA Khan, MF Naeem, L Van Gool… - Proceedings of the …, 2023 - openaccess.thecvf.com
Compositional zero-shot learning aims to recognize unseen compositions of seen visual
primitives of object classes and their states. While all primitives (states and objects) are …

Separate and locate: Rethink the text in text-based visual question answering

C Fang, J Li, L Li, C Ma, D Hu - … of the 31st ACM International Conference …, 2023 - dl.acm.org
Text-based Visual Question Answering (TextVQA) aims at answering questions about the
text in images. Most works in this field focus on designing network structures or pre-training …

Prestu: Pre-training for scene-text understanding

J Kil, S Changpinyo, X Chen, H Hu… - Proceedings of the …, 2023 - openaccess.thecvf.com
The ability to recognize and reason about text embedded in visual inputs is often lacking in
vision-and-language (V&L) models, perhaps because V&L pre-training methods have often …

Filling in the blank: Rationale-augmented prompt tuning for TextVQA

G Zeng, Y Zhang, Y Zhou, B Fang, G Zhao… - Proceedings of the 31st …, 2023 - dl.acm.org
Recently, generative Text-based visual question answering (TextVQA) methods, which are
often based on language models, have exhibited impressive results and drawn increasing …

Toward 3d spatial reasoning for human-like text-based visual question answering

H Li, J Huang, P **, G Song, Q Wu, J Chen - arxiv preprint arxiv …, 2022 - arxiv.org
Text-based Visual Question Answering~(TextVQA) aims to produce correct answers for
given questions about the images with multiple scene texts. In most cases, the texts naturally …

Prophet: Prompting large language models with complementary answer heuristics for knowledge-based visual question answering

Z Yu, X Ouyang, Z Shao, M Wang, J Yu - arxiv preprint arxiv:2303.01903, 2023 - arxiv.org
Knowledge-based visual question answering (VQA) requires external knowledge beyond
the image to answer the question. Early studies retrieve required knowledge from explicit …

Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering

Y Li, Q Yang, FL Wang, LK Lee, Y Qu, T Hao - Artificial Intelligence in …, 2023 - Elsevier
Insufficient training data is a common barrier to effectively learn multimodal information
interactions and question semantics in existing medical Visual Question Answering (VQA) …