Knowledge graphs meet multi-modal learning: A comprehensive survey

Z Chen, Y Zhang, Y Fang, Y Geng, L Guo… - arxiv preprint arxiv …, 2024 - arxiv.org
Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …

Neural prompt search

Y Zhang, K Zhou, Z Liu - arxiv preprint arxiv:2206.04673, 2022 - arxiv.org
The size of vision models has grown exponentially over the last few years, especially after
the emergence of Vision Transformer. This has motivated the development of parameter …

V3det: Vast vocabulary visual detection dataset

J Wang, P Zhang, T Chu, Y Cao… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recent advances in detecting arbitrary objects in the real world are trained and evaluated
on object detection datasets with a relatively restricted vocabulary. To facilitate the …

T-rex2: Towards generic object detection via text-visual prompt synergy

Q Jiang, F Li, Z Zeng, T Ren, S Liu, L Zhang - European Conference on …, 2024 - Springer
We present T-Rex2, a highly practical model for open-set object detection. Previous open-
set object detection methods relying on text prompts effectively encapsulate the abstract …

Vision-rwkv: Efficient and scalable visual perception with rwkv-like architectures

Y Duan, W Wang, Z Chen, X Zhu, L Lu, T Lu… - arxiv preprint arxiv …, 2024 - arxiv.org
Transformers have revolutionized computer vision and natural language processing, but
their high computational complexity limits their application in high-resolution image …

Benchmarking omni-vision representation through the lens of visual realms

Y Zhang, Z Yin, J Shao, Z Liu - European Conference on Computer Vision, 2022 - Springer
Though impressive performance has been achieved in specific visual realms (eg faces,
dogs, and places), an omni-vision representation generalizing to many natural visual …

RADAM: Texture recognition through randomized aggregated encoding of deep activation maps

L Scabini, KM Zielinski, LC Ribas, WN Gonçalves… - Pattern Recognition, 2023 - Elsevier
Texture analysis is a classical yet challenging task in computer vision for which deep neural
networks are actively being applied. Most approaches are based on building feature …

[PDF][PDF] Octavius: Mitigating task interference in mllms via moe

Z Chen, Z Wang, Z Wang, H Liu, Z Yin, S Liu… - arxiv preprint arxiv …, 2023 - iclr.cc
OCTAVIUS: MITIGATING TASK INTERFERENCE IN MLLMS VIA MOE Page 1 OCTAVIUS:
MITIGATING TASK INTERFERENCE IN MLLMS VIA MOE Zeren Chen1,2*, Ziqin Wang1,3*, Zhen …

Open long-tailed recognition in a dynamic world

Z Liu, Z Miao, X Zhan, J Wang, B Gong… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Real world data often exhibits a long-tailed and open-ended (ie, with unseen classes)
distribution. A practical recognition system must balance between majority (head) and …

Chef: A comprehensive evaluation framework for standardized assessment of multimodal large language models

Z Shi, Z Wang, H Fan, Z Yin, L Sheng, Y Qiao… - arxiv preprint arxiv …, 2023 - arxiv.org
Multimodal Large Language Models (MLLMs) have shown impressive abilities in interacting
with visual content with myriad potential downstream tasks. However, even though a list of …