Knowledge graphs meet multi-modal learning: A comprehensive survey
Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …
semantic web community's exploration into multi-modal dimensions unlocking new avenues …
Neural prompt search
The size of vision models has grown exponentially over the last few years, especially after
the emergence of Vision Transformer. This has motivated the development of parameter …
the emergence of Vision Transformer. This has motivated the development of parameter …
V3det: Vast vocabulary visual detection dataset
Recent advances in detecting arbitrary objects in the real world are trained and evaluated
on object detection datasets with a relatively restricted vocabulary. To facilitate the …
on object detection datasets with a relatively restricted vocabulary. To facilitate the …
T-rex2: Towards generic object detection via text-visual prompt synergy
We present T-Rex2, a highly practical model for open-set object detection. Previous open-
set object detection methods relying on text prompts effectively encapsulate the abstract …
set object detection methods relying on text prompts effectively encapsulate the abstract …
Vision-rwkv: Efficient and scalable visual perception with rwkv-like architectures
Transformers have revolutionized computer vision and natural language processing, but
their high computational complexity limits their application in high-resolution image …
their high computational complexity limits their application in high-resolution image …
Benchmarking omni-vision representation through the lens of visual realms
Though impressive performance has been achieved in specific visual realms (eg faces,
dogs, and places), an omni-vision representation generalizing to many natural visual …
dogs, and places), an omni-vision representation generalizing to many natural visual …
RADAM: Texture recognition through randomized aggregated encoding of deep activation maps
Texture analysis is a classical yet challenging task in computer vision for which deep neural
networks are actively being applied. Most approaches are based on building feature …
networks are actively being applied. Most approaches are based on building feature …
[PDF][PDF] Octavius: Mitigating task interference in mllms via moe
OCTAVIUS: MITIGATING TASK INTERFERENCE IN MLLMS VIA MOE Page 1 OCTAVIUS:
MITIGATING TASK INTERFERENCE IN MLLMS VIA MOE Zeren Chen1,2*, Ziqin Wang1,3*, Zhen …
MITIGATING TASK INTERFERENCE IN MLLMS VIA MOE Zeren Chen1,2*, Ziqin Wang1,3*, Zhen …
Open long-tailed recognition in a dynamic world
Real world data often exhibits a long-tailed and open-ended (ie, with unseen classes)
distribution. A practical recognition system must balance between majority (head) and …
distribution. A practical recognition system must balance between majority (head) and …
Chef: A comprehensive evaluation framework for standardized assessment of multimodal large language models
Multimodal Large Language Models (MLLMs) have shown impressive abilities in interacting
with visual content with myriad potential downstream tasks. However, even though a list of …
with visual content with myriad potential downstream tasks. However, even though a list of …