Counterfactual Reasoning for Robust Visual Question Answering
Overview
Visual question answering models often exploit language priors instead of grounding their predictions in image content.
This project develops a robust training framework that encourages models to distinguish between valid visual evidence and spurious correlations in the training data.
Method
The framework combines:
- Answer-contrastive regularization to improve discrimination between semantically related answer candidates.
- Gradient-discrepancy regularization to reduce reliance on biased question-answer patterns.
- Curriculum training to progressively introduce counterfactual and debiasing objectives.
Results
The method achieved 61.64% accuracy on VQA-CP v2 and reduced the generalization gap relative to VQA v2 to 1.16 percentage points.