AVROBUSTBENCH: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time

NeurIPS Datasets & Benchmarks Track 2025

*denotes equal contribution
The University of Texas at Dallas, Richardson, TX, USA

Abstract

While recent audio-visual models have demonstrated impressive performance, their robustness to distributional shifts at test-time remains not fully understood. Existing robustness benchmarks mainly focus on single modalities, making them insufficient for thoroughly assessing the robustness of audio-visual models. Motivated by real-world scenarios where shifts can occur simultaneously in both audio and visual modalities, we introduce AVROBUSTBENCH, a comprehensive benchmark designed to evaluate the test-time robustness of audio-visual recognition models. AVROBUSTBENCH comprises four audio-visual benchmark datasets, AUDIOSET-2C , VGGSOUND-2C , KINETICS-2C , and EPICKITCHENS-2C , each incorporating 75 bimodal audio-visual corruptions that are co-occurring and correlated. Through extensive evaluations, we observe that state-of-the-art supervised and self-supervised audio-visual models exhibit declining robustness as corruption severity increases. Furthermore, online test-time adaptation (TTA) methods, on VGGSOUND-2C and KINETICS-2C, offer minimal improvements in performance under bimodal corruptions. We further propose AV2C, a simple TTA approach enabling on-the-fly cross-modal fusion by penalizing high-entropy samples, which achieves improvements on VGGSOUND-2C. We hope that AVROBUSTBENCH will steer the development of more effective and robust audio-visual TTA approaches.

Problem Overview and Motivation

Distributional shifts at test-time are inevitable, uncontrollable, and hamper learning and generalization. Even state-of-the-art audio-visual models still lack serious robustness! 🥹

Are SOTA audio-visual models robust enough to such co-occurring and correlated bimodal audio-visual corruptions? Can they be deployed yet with trust?

Recognition plot / bar chart
CLIP variants comparison
Segmentation drops
Source separation drops
Top two figures correspond to VGGSOUND-2C and show recognition results, while the bottom ones depict audio-visual segmentation (bottom-left) and sound-source separation (bottom-right). All results are on the respective test sets with our proposed bimodal audio-visual corruptions.
A rigorous robustness benchmark is needed for audio-visual learning problems!

AVROBUSTBENCH: Benchmark and Contributions

➤ Emulating real-world settings, we release realistic, co-occurring, and correlated audio-visual corruptions at test-time. We propose 15 bimodal audio-visual corruptions each of 5 severity levels, categorized into Digital, Environmental, and Human-Related corruptions. Our proposed can be easily extended to any audio/speech-visual test sets.

➤ We also release four benchmark datasets: AUDIOSET-2C, VGGSOUND-2C, KINETICS-2C, and EPICKITCHENS-2C, each containing 75 bimodal audio-visual corruptions applied on their respective source test sets.

Examples from AUDIOSET-2C
💡Tip: If inaudible, download the videos to listen with audio.
Clean/Source
Gaussian
Impulse
Shot
Speckle
Compression
Snow
Frost
Spatter
Wind
Rain
Underwater
Concert
Smoke
Crowd
Interference

➤ We propose a simple and effective online test-time adaptation (TTA) method, AV2C, to overcome bimodal distributional shifts. AV2C leverages cross-modal fusion by penalizing high-entropy samples, enabling models to adapt on-the-fly during test-time.

Analysis, Discussions and Takeaways

Results on limited robustness of SOTA models
SOTA robustness results

We observe a significant gap between clean accuracy and performance under our bimodal AV corruptions (severity 5). We report top-1 accuracy, absolute robustness, and relative robustness averaged across corruptions. Models include both supervised and self-supervised variants.

Citation

If our work is of interest to you, consider citing it.

@inproceedings{maharana2025avrobustbench,
        title={AVROBUSTBENCH: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time},
        author={Maharana, Sarthak Kumar and Kushwaha, Saksham Singh and Zhang, Baoming and Rodriguez, Adrian and Wei, Songtao and Tian, Yapeng and Guo, Yunhui},
        booktitle={Advances in Neural Information Processing Systems},
        year={2025}
      }