Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering

Abhirama Subramanyam Penamakuri1, Manish Gupta2, Mithun Das Gupta2, Anand Mishra1

,

1Indian Institute of Technology Jodhpur 2Microsoft, India

IJCAI 2023

[Paper] [arxiv] [Slides] [Poster] [Short talk] [Data] [Code]


     

Abstract

We study visual question answering in a setting where the answer has to be mined from a pool of relevant and irrelevant images given as a context. For such a setting, a model must first retrieve relevant images from the pool and answer the question from these retrieved images. We refer to this problem as retrieval-based visual question answering (or RETVQA in short). The RETVQA is distinctively different and more challenging than the traditionally studied Visual Question Answering (VQA), where a given question has to be answered with a single relevant image in context. Towards solving the RETVQA task, we propose a unified Multi Image BART (MI-BART) that takes a question and retrieved images using our relevance encoder for free-form fluent answer generation. Further, we introduce the largest dataset in this space, namely RETVQA, which has the following salient features: multi-image and retrieval requirement for VQA, metadata-independent questions over a pool of heterogeneous images, expecting a mix of classification-oriented and open-ended generative answers. Our proposed framework achieves an accuracy of 76.5% and a fluency of 79.3% on the proposed dataset RETVQA and also outperforms state-of-the-art methods by 4.9% and 11.8% on the image segment of the publicly available WebQA dataset on the accuracy and fluency metrics, respectively.

Highlights

  • We introduce RetVQA, the largest dataset in this space with multi-image and retrieval requirement for VQA.
  • We propose a unified Multi Image BART (MI-BART) that takes a question and retrieved images using our relevance encoder for free-form fluent answer generation.

Code and Data

  1. [Code]
  2. [Data]

Bibtex

Please cite our work as follows:

@inproceedings{retvqa,
  author       = {Abhirama Subramanyam Penamakuri and
                  Manish Gupta and
                  Mithun Das Gupta and
                  Anand Mishra},
  title        = {Answer Mining from a Pool of Images: Towards Retrieval-Based Visual
                  Question Answering},
  booktitle    = {IJCAI},
  publisher    = {ijcai.org},
  year         = {2023},
  url          = {https://doi.org/10.24963/ijcai.2023/146},
  doi          = {10.24963/ijcai.2023/146},
}

Acknowledgements

Abhirama S. Penamakuri is supported by Prime Minister Research Fellowship (PMRF), Minsitry of Education, Government of India.
We thank Microsoft for supporting this work through the Microsoft Academic Partnership Grant (MAPG) 2021.