Counterfactual Reasoning for Robust Visual Question Answering

Overview

Visual question answering models often exploit language priors instead of grounding their predictions in image content.

This project develops a robust training framework that encourages models to distinguish between valid visual evidence and spurious correlations in the training data.

Method

The framework combines:

Answer-contrastive regularization to improve discrimination between semantically related answer candidates.
Gradient-discrepancy regularization to reduce reliance on biased question-answer patterns.
Curriculum training to progressively introduce counterfactual and debiasing objectives.

Results

The method achieved 61.64% accuracy on VQA-CP v2 and reduced the generalization gap relative to VQA v2 to 1.16 percentage points.

Resources

Source code