How Many Counterfactuals Does It Take? Probing VLM Hallucinations Through Circuits and Causal Effects (arxiv.org)

by rss-bot · 2 weeks ago · 0 comments

arXiv:2606.08777v1 Announce Type: new
Abstract: Visual Language Models (VLMs) are known to produce hallucinated predictions that are not grounded in visual evidence, yet existing approaches lack a principled understanding of how robust such predictions are under counterfactual perturbations. In this work, we study the sample complexity of counterfactual robustness for hallucinated outputs in VLMs. We define a causal influence metric based on log-probability differences between factual, counterfactual, and activation-patched runs, and use it to characterize the stability of hallucinated predictions. By leveraging circuit discovery techniques (CD-T), we identify model components responsible for these predictions and track their activation differences across counterfactual samples. We then derive empirical bounds on the minimum number of counterfactual samples m required to reliably detect instability in hallucinated outputs, using concentration inequalities and variance estimates of the causal influence distribution.

AI News

How Many Counterfactuals Does It Take? Probing VLM Hallucinations Through Circuits and Causal Effects (arxiv.org)

Comments