Empirical Benchmarking of Vision-Language Transformer Combinations for Visual Question Answering Tasks

Satyam  Singh; Aditya  Sharma; Smita  Tiwari

Authors

Satyam Singh Department of Computer Science & Engineering, Sharda University Greater Noida, Uttar Pradesh, India Author
Aditya Sharma Department of Computer Science & Engineering, Sharda University Greater Noida, Uttar Pradesh, India Author
Smita Tiwari Department of Computer Science & Engineering, Sharda University Greater Noida, Uttar Pradesh, India Author

Keywords:

VQA, Multimodal, Vision-Language Models, NLP, CV, Transformer.

Abstract

The task of Visual Question Answering (VQA) also requires intelligent methods that bring together visual and textual information in order to compute accurate answers. An exhaustive empirical evaluation of six specialized vision language transformer models for VQA are conducted on the DAQUAR dataset. The experiments use a Vision Transformer (ViT), a Data-efficient Image Transformer (DeiT), and a Swin Transformer as image encoders, and BERT and RoBERTa as language models. Following the same experimental protocol, we track per-epoch validation accuracy, and consider a potential impact on computational efficiency. The results produced the following peak accuracies: ViT + BERT (65.50%), ViT + RoBERTa (68.47%), DeiT + BERT (61.15%), DeiT + RoBERTa (66.47%), Swin + BERT (73.42%) and Swin + RoBERTa (64.50%). The Swin+BERT combination produced the best accuracy among the combinations tested, achieving a peak accuracy of 73.42%. This illustrates the possible power of hierarchical vision transformers combined with powerful language comprehension on VQA tasks. We hope that this benchmark provides useful guidance to researchers when seeking effective vision-language models based on performance and resource limitations.

Empirical Benchmarking of Vision-Language Transformer Combinations for Visual Question Answering Tasks

Authors

Keywords:

Abstract

Downloads

Published

Conference Proceedings Volume

Section

License

How to Cite

Information

Make a Submission

After