ECCV 2026 · Malmö, Sweden 🇸🇪

Audio-Visual Continual Test-Time Adaptation without Forgetting

ECCV 2026 ICML CATS Workshop 2026
1The University of Texas at Dallas  ·  2Dolby Laboratories Inc.

Work done during an internship.

TL;DR

Existing audio-visual online test-time adaptation methods degrade under continual domain shifts. We show that the attention fusion layer transfers strongly across related corruptions — so instead of continually overwriting shared parameters (which leads to catastrophic forgetting), our proposed AVReCAP maintains a shared buffer of fusion-layer snapshots and selectively retrieves the best match using modality-specific input statistics. This source-free approach improves accuracy on unimodal and bimodal benchmarks while largely preserving source knowledge and minimizing catastrophic forgetting.

AVReCAP Framework

A memory-efficient buffer stores fusion-layer parameter snapshots; at each time step, KL-divergence over modality statistics guides selective retrieval before a single online adaptation step.

AVReCAP framework illustration

AVReCAP at time-step t: audio-visual inputs are summarized by modality-specific Gaussian statistics. A KL-divergence criterion compares the current batch against elements in shared buffer K. If a close match is found (within threshold τ), stored attention fusion parameters (WQ, WK, WV) are retrieved, adapted with READ-style losses, and written back. Otherwise, a new buffer element is added and redundant entries are merged to respect memory budget η. Only the fusion layer is adapted — audio and visual encoders remain frozen.

The Challenge

In real-world deployment, audio-visual models encounter a sequence of evolving domains without task boundaries or source data. Continual parameter updates under bimodal shifts amplify error accumulation and catastrophic forgetting.

How AVReCAP Works

At each time step, AVReCAP summarizes incoming batches, retrieves compatible fusion parameters from a shared buffer, adapts online, and maintains a fixed memory budget.

Step Stage Description
1 Summarize inputs Compute modality-specific means and covariances (μ, Σ) for audio and visual inputs from the current test batch.
2 Retrieve from buffer Compare current statistics to stored buffer elements via KL divergence. If the best match is within threshold τ, retrieve the corresponding fusion parameters (WQ, WK, WV) and plug them into the model.
3 Adapt & update Perform one online gradient step on the fusion layer. EMA-update the matched buffer entry's statistics and parameters.
4 Expand or merge buffer If no close match exists, add a new buffer element after adaptation. When the buffer exceeds budget η, merge the two most similar entries to preserve a compact memory footprint.

Contributions

Source-free AV continual test-time adaptation study

Comprehensive evaluation under unimodal and bimodal corruptions in a strict source-free continual setting.

Fusion-layer transfer

We show intra- and cross-category transferability of attention fusion parameters, enabling selective retrieval rather than overwriting shared weights across domains.

Minimal forgetting

AVReCAP achieves SOTA continual adaptation while incurring only a 2.9% source accuracy (to evaluate forgetting) drop on VGGSound, compared to 27.9% for READ.

BibTeX

If our work is of interest to you, consider citing it.

@inproceedings{maharana2026avrecap,
  title={Audio-Visual Continual Test-Time Adaptation without Forgetting},
  author={Maharana, Sarthak Kumar and Mehra, Akshay and Ramakrishna, Bhavya and Guo, Yunhui and Su, Guan-Ming},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2026}
}