ECCV 2026 · Malmö, Sweden 🇸🇪
Work done during an internship.
A memory-efficient buffer stores fusion-layer parameter snapshots; at each time step, KL-divergence over modality statistics guides selective retrieval before a single online adaptation step.
AVReCAP at time-step t: audio-visual inputs are summarized by modality-specific Gaussian statistics. A KL-divergence criterion compares the current batch against elements in shared buffer K. If a close match is found (within threshold τ), stored attention fusion parameters (WQ, WK, WV) are retrieved, adapted with READ-style losses, and written back. Otherwise, a new buffer element is added and redundant entries are merged to respect memory budget η. Only the fusion layer is adapted — audio and visual encoders remain frozen.
At each time step, AVReCAP summarizes incoming batches, retrieves compatible fusion parameters from a shared buffer, adapts online, and maintains a fixed memory budget.
| Step | Stage | Description |
|---|---|---|
| 1 | Summarize inputs | Compute modality-specific means and covariances (μ, Σ) for audio and visual inputs from the current test batch. |
| 2 | Retrieve from buffer | Compare current statistics to stored buffer elements via KL divergence. If the best match is within threshold τ, retrieve the corresponding fusion parameters (WQ, WK, WV) and plug them into the model. |
| 3 | Adapt & update | Perform one online gradient step on the fusion layer. EMA-update the matched buffer entry's statistics and parameters. |
| 4 | Expand or merge buffer | If no close match exists, add a new buffer element after adaptation. When the buffer exceeds budget η, merge the two most similar entries to preserve a compact memory footprint. |
Comprehensive evaluation under unimodal and bimodal corruptions in a strict source-free continual setting.
We show intra- and cross-category transferability of attention fusion parameters, enabling selective retrieval rather than overwriting shared weights across domains.
AVReCAP achieves SOTA continual adaptation while incurring only a 2.9% source accuracy (to evaluate forgetting) drop on VGGSound, compared to 27.9% for READ.
If our work is of interest to you, consider citing it.
@inproceedings{maharana2026avrecap,
title={Audio-Visual Continual Test-Time Adaptation without Forgetting},
author={Maharana, Sarthak Kumar and Mehra, Akshay and Ramakrishna, Bhavya and Guo, Yunhui and Su, Guan-Ming},
booktitle={European Conference on Computer Vision (ECCV)},
year={2026}
}