A survey on generative ai and llm for video generation, understanding, and streaming

P Zhou, L Wang, Z Liu, Y Hao, P Hui, S Tarkoma… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper offers an insightful examination of how currently top-trending AI technologies, ie,
generative artificial intelligence (Generative AI) and large language models (LLMs), are …

Cap4video: What can auxiliary captions do for text-video retrieval?

W Wu, H Luo, B Fang, J Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Most existing text-video retrieval methods focus on cross-modal matching between the
visual content of videos and textual query sentences. However, in real-world scenarios …

Revisiting classifier: Transferring vision-language models for video recognition

W Wu, Z Sun, W Ouyang - Proceedings of the AAAI conference on …, 2023 - ojs.aaai.org
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is
an important topic in computer vision research. Along with the growth of computational …

[HTML][HTML] Deep learning innovations in video classification: A survey on techniques and dataset evaluations

M Mao, A Lee, M Hong - Electronics, 2024 - mdpi.com
Video classification has achieved remarkable success in recent years, driven by advanced
deep learning models that automatically categorize video content. This paper provides a …

Ophnet: A large-scale video benchmark for ophthalmic surgical workflow understanding

M Hu, P **a, L Wang, S Yan, F Tang, Z Xu… - … on Computer Vision, 2024 - Springer
Surgical scene perception via videos is critical for advancing robotic surgery, telesurgery,
and AI-assisted surgery, particularly in ophthalmology. However, the scarcity of diverse and …

Disentangling spatial and temporal learning for efficient image-to-video transfer learning

Z Qing, S Zhang, Z Huang, Y Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recently, large-scale pre-trained language-image models like CLIP have shown
extraordinary capabilities for understanding spatial contents, but naively transferring such …

Lana: A language-capable navigator for instruction following and generation

X Wang, W Wang, J Shao… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Recently, visual-language navigation (VLN)--entailing robot agents to follow navigation
instructions--has shown great advance. However, existing literature put most emphasis on …

Alternating gradient descent and mixture-of-experts for integrated multimodal perception

H Akbari, D Kondratyuk, Y Cui… - Advances in …, 2023 - proceedings.neurips.cc
Abstract We present Integrated Multimodal Perception (IMP), a simple and scalable
multimodal multi-task training and modeling approach. IMP integrates multimodal inputs …

GPT4Vis: what can GPT-4 do for zero-shot visual recognition?

W Wu, H Yao, M Zhang, Y Song, W Ouyang… - arxiv preprint arxiv …, 2023 - arxiv.org
This paper does not present a novel method. Instead, it delves into an essential, yet must-
know baseline in light of the latest advancements in Generative Artificial Intelligence …

What Can Simple Arithmetic Operations Do for Temporal Modeling?

W Wu, Y Song, Z Sun, J Wang, C Xu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Temporal modeling plays a crucial role in understanding video content. To tackle this
problem, previous studies built complicated temporal relations through time sequence …