Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting

Y Wang, X Liu, Y Li, M Chen, C **ao - European Conference on Computer …, 2024 - Springer
With the advent and widespread deployment of Multimodal Large Language Models
(MLLMs), the imperative to ensure their safety has become increasingly pronounced …

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Gemini pro defeated by gpt-4v: Evidence from education

GG Lee, E Latif, L Shi, X Zhai - arxiv preprint arxiv:2401.08660, 2023 - arxiv.org
This study compared the classification performance of Gemini Pro and GPT-4V in
educational settings. Employing visual question answering (VQA) techniques, the study …

GPT4Vis: what can GPT-4 do for zero-shot visual recognition?

W Wu, H Yao, M Zhang, Y Song, W Ouyang… - arxiv preprint arxiv …, 2023 - arxiv.org
This paper does not present a novel method. Instead, it delves into an essential, yet must-
know baseline in light of the latest advancements in Generative Artificial Intelligence …

Videovista: A versatile benchmark for video understanding and reasoning

Y Li, X Chen, B Hu, L Wang, H Shi, M Zhang - arxiv preprint arxiv …, 2024 - arxiv.org
Despite significant breakthroughs in video analysis driven by the rapid development of large
multimodal models (LMMs), there remains a lack of a versatile evaluation benchmark to …

Cocot: Contrastive chain-of-thought prompting for large multimodal models with multiple image inputs

D Zhang, J Yang, H Lyu, Z **, Y Yao, M Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
When exploring the development of Artificial General Intelligence (AGI), a critical task for
these models involves interpreting and processing information from multiple image inputs …

Visual-roleplay: Universal jailbreak attack on multimodal large language models via role-playing image character

S Ma, W Luo, Y Wang, X Liu - arxiv preprint arxiv:2405.20773, 2024 - arxiv.org
With the advent and widespread deployment of Multimodal Large Language Models
(MLLMs), ensuring their safety has become increasingly critical. To achieve this objective, it …

Fakingrecipe: Detecting fake news on short video platforms from the perspective of creative process

Y Bu, Q Sheng, J Cao, P Qi, D Wang, J Li - Proceedings of the 32nd …, 2024 - dl.acm.org
As short-form video-sharing platforms become a significant channel for news consumption,
fake news in short videos has emerged as a serious threat in the online information …

GPT4Ego: unleashing the potential of pre-trained models for zero-shot egocentric action recognition

G Dai, X Shu, W Wu, R Yan… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Vision-Language Models (VLMs), pre-trained on large-scale datasets, have shown
impressive performance in various visual recognition tasks. This advancement paves the …

Machine-generated text localization

Z Zhang, W Qin, BA Plummer - arxiv preprint arxiv:2402.11744, 2024 - arxiv.org
Machine-Generated Text (MGT) detection aims to identify a piece of text as machine or
human written. Prior work has primarily formulated MGT detection as a binary classification …