Describe Anything Model for Visual Question Answering on Text-Rich Images
Summary
This work investigates how region-level descriptions generated by the Describe Anything Model can support visual question answering on text-rich images.
The study evaluates the approach across multiple datasets involving documents, infographics, charts, and other visually structured content.
My Contribution
I led the comprehensive experimental evaluation and validation of multiple vision-language models across six benchmark datasets.