Describe Anything Model for Visual Question Answering on Text-Rich Images

October 2025 Yen-Linh Vu*, Dinh-Thang Duong*, Truong-Binh Duong, et al. VisionDocs Workshop at ICCV 2025

#Vision-Language Models #Text-Rich Images #Visual Question Answering #Document Intelligence

Summary

This work investigates how region-level descriptions generated by the Describe Anything Model can support visual question answering on text-rich images.

The study evaluates the approach across multiple datasets involving documents, infographics, charts, and other visually structured content.

My Contribution

I led the comprehensive experimental evaluation and validation of multiple vision-language models across six benchmark datasets.

Resources

Paper
Code