Learning in audio-visual context: A review, analysis, and new perspective

Y Wei, D Hu, Y Tian, X Li - arxiv preprint arxiv:2208.09579, 2022‏ - arxiv.org
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

Onellm: One framework to align all modalities with language

J Han, K Gong, Y Zhang, J Wang… - Proceedings of the …, 2024‏ - openaccess.thecvf.com
Multimodal large language models (MLLMs) have gained significant attention due to their
strong multimodal understanding capability. However existing works rely heavily on modality …

Lavis: A library for language-vision intelligence

D Li, J Li, H Le, G Wang, S Savarese… - arxiv preprint arxiv …, 2022‏ - arxiv.org
We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research
and applications. LAVIS aims to serve as a one-stop comprehensive library that brings …

Videollm-online: Online video large language model for streaming video

J Chen, Z Lv, S Wu, KQ Lin, C Song… - Proceedings of the …, 2024‏ - openaccess.thecvf.com
Abstract Large Language Models (LLMs) have been enhanced with vision capabilities
enabling them to comprehend images videos and interleaved vision-language content …

Learning to answer questions in dynamic audio-visual scenarios

G Li, Y Wei, Y Tian, C Xu, JR Wen… - Proceedings of the …, 2022‏ - openaccess.thecvf.com
In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to
answer questions regarding different visual objects, sounds, and their associations in …

[HTML][HTML] Learning towards conversational AI: A survey

T Fu, S Gao, X Zhao, J Wen, R Yan - AI Open, 2022‏ - Elsevier
Recent years have witnessed a surge of interest in the field of open-domain dialogue.
Thanks to the rapid development of social media, large dialogue corpus from the Internet …

Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably)

Y Huang, J Lin, C Zhou, H Yang… - … conference on machine …, 2022‏ - proceedings.mlr.press
Despite the remarkable success of deep multi-modal learning in practice, it has not been
well-explained in theory. Recently, it has been observed that the best uni-modal network …

Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration

C Lyu, M Wu, L Wang, X Huang, B Liu, Z Du… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Although instruction-tuned large language models (LLMs) have exhibited remarkable
capabilities across various NLP tasks, their effectiveness on other data modalities beyond …

What makes training multi-modal classification networks hard?

W Wang, D Tran, M Feiszli - … of the IEEE/CVF conference on …, 2020‏ - openaccess.thecvf.com
Consider end-to-end training of a multi-modal vs. a uni-modal network on a task with
multiple input modalities: the multi-modal network receives more information, so it should …

PLATO: Pre-trained dialogue generation model with discrete latent variable

S Bao, H He, F Wang, H Wu, H Wang - arxiv preprint arxiv:1910.07931, 2019‏ - arxiv.org
Pre-training models have been proved effective for a wide range of natural language
processing tasks. Inspired by this, we propose a novel dialogue generation pre-training …