A survey on generative ai and llm for video generation, understanding, and streaming
This paper offers an insightful examination of how currently top-trending AI technologies, ie,
generative artificial intelligence (Generative AI) and large language models (LLMs), are …
generative artificial intelligence (Generative AI) and large language models (LLMs), are …
Cap4video: What can auxiliary captions do for text-video retrieval?
Most existing text-video retrieval methods focus on cross-modal matching between the
visual content of videos and textual query sentences. However, in real-world scenarios …
visual content of videos and textual query sentences. However, in real-world scenarios …
Revisiting classifier: Transferring vision-language models for video recognition
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is
an important topic in computer vision research. Along with the growth of computational …
an important topic in computer vision research. Along with the growth of computational …
[HTML][HTML] Deep learning innovations in video classification: A survey on techniques and dataset evaluations
Video classification has achieved remarkable success in recent years, driven by advanced
deep learning models that automatically categorize video content. This paper provides a …
deep learning models that automatically categorize video content. This paper provides a …
Ophnet: A large-scale video benchmark for ophthalmic surgical workflow understanding
Surgical scene perception via videos is critical for advancing robotic surgery, telesurgery,
and AI-assisted surgery, particularly in ophthalmology. However, the scarcity of diverse and …
and AI-assisted surgery, particularly in ophthalmology. However, the scarcity of diverse and …
Disentangling spatial and temporal learning for efficient image-to-video transfer learning
Recently, large-scale pre-trained language-image models like CLIP have shown
extraordinary capabilities for understanding spatial contents, but naively transferring such …
extraordinary capabilities for understanding spatial contents, but naively transferring such …
Lana: A language-capable navigator for instruction following and generation
Recently, visual-language navigation (VLN)--entailing robot agents to follow navigation
instructions--has shown great advance. However, existing literature put most emphasis on …
instructions--has shown great advance. However, existing literature put most emphasis on …
Alternating gradient descent and mixture-of-experts for integrated multimodal perception
Abstract We present Integrated Multimodal Perception (IMP), a simple and scalable
multimodal multi-task training and modeling approach. IMP integrates multimodal inputs …
multimodal multi-task training and modeling approach. IMP integrates multimodal inputs …
GPT4Vis: what can GPT-4 do for zero-shot visual recognition?
This paper does not present a novel method. Instead, it delves into an essential, yet must-
know baseline in light of the latest advancements in Generative Artificial Intelligence …
know baseline in light of the latest advancements in Generative Artificial Intelligence …
What Can Simple Arithmetic Operations Do for Temporal Modeling?
Temporal modeling plays a crucial role in understanding video content. To tackle this
problem, previous studies built complicated temporal relations through time sequence …
problem, previous studies built complicated temporal relations through time sequence …