- Academic Search

Z **, W Chen, X Guo, W He, Y Ding, B Hong… - Science China …, 2025 - Springer

For a long time, researchers have sought artificial intelligence (AI) that matches or exceeds
human intelligence. AI agents, which are artificial entities capable of sensing the …

Save Cite Cited by 723 Related articles All 4 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

A comprehensive survey on applications of transformers for deep learning tasks

S Islam, H Elmekki, A Elsebai, J Bentahar… - Expert Systems with …, 2024 - Elsevier

Abstract Transformers are Deep Neural Networks (DNN) that utilize a self-attention
mechanism to capture contextual relationships within sequential data. Unlike traditional …

Save Cite Cited by 178 Related articles All 4 versions Free GPT-4

[Free GPT-4]

[PDF] thecvf.com

Imagebind: One embedding space to bind them all

R Girdhar, A El-Nouby, Z Liu, M Singh… - Proceedings of the …, 2023 - openaccess.thecvf.com

We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …

Save Cite Cited by 799 Related articles All 7 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] neurips.cc

Attention bottlenecks for multimodal fusion

A Nagrani, S Yang, A Arnab, A Jansen… - Advances in neural …, 2021 - proceedings.neurips.cc

Humans perceive the world by concurrently processing and fusing high-dimensional inputs
from multiple modalities such as vision and audio. Machine perception models, in stark …

Save Cite Cited by 639 Related articles All 8 versions Free GPT-4 View as HTML

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Save Cite Cited by 118 Related articles All 3 versions Free GPT-4

[Free GPT-4]

[PDF] thecvf.com

Merlot reserve: Neural script knowledge through vision and language and sound

R Zellers, J Lu, X Lu, Y Yu, Y Zhao… - Proceedings of the …, 2022 - openaccess.thecvf.com

As humans, we navigate a multimodal world, building a holistic understanding from all our
senses. We introduce MERLOT Reserve, a model that represents videos jointly over time …

Save Cite Cited by 269 Related articles All 9 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] neurips.cc

Masked autoencoders that listen

PY Huang, H Xu, J Li, A Baevski… - Advances in …, 2022 - proceedings.neurips.cc

This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-
supervised representation learning from audio spectrograms. Following the Transformer …

Save Cite Cited by 250 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

Save Cite Cited by 103 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] aaai.org

Ssast: Self-supervised audio spectrogram transformer

Y Gong, CI Lai, YA Chung, J Glass - … of the AAAI Conference on Artificial …, 2022 - ojs.aaai.org

Recently, neural networks based purely on self-attention, such as the Vision Transformer
(ViT), have been shown to outperform deep learning models constructed with convolutional …

Save Cite Cited by 310 Related articles All 11 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research

X Mei, C Meng, H Liu, Q Kong, T Ko… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org

The advancement of audio-language (AL) multimodal learning tasks has been significant in
recent years, yet the limited size of existing audio-language datasets poses challenges for …

Save Cite Cited by 155 Related articles All 3 versions Free GPT-4

Create alert

Cite

Advanced search

Saved to My library

Ast: Audio spectrogram transformer

The rise and potential of large language model based agents: A survey

A comprehensive survey on applications of transformers for deep learning tasks

Imagebind: One embedding space to bind them all

Attention bottlenecks for multimodal fusion

Internvideo2: Scaling foundation models for multimodal video understanding

Merlot reserve: Neural script knowledge through vision and language and sound

Masked autoencoders that listen

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

Ssast: Self-supervised audio spectrogram transformer

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research