Empirical Benchmarking of Vision-Language Transformer Combinations for Visual Question Answering Tasks

Authors

  • Satyam Singh Department of Computer Science & Engineering, Sharda University Greater Noida, Uttar Pradesh, India Author
  • Aditya Sharma Department of Computer Science & Engineering, Sharda University Greater Noida, Uttar Pradesh, India Author
  • Smita Tiwari Department of Computer Science & Engineering, Sharda University Greater Noida, Uttar Pradesh, India Author

Keywords:

VQA, Multimodal, Vision-Language Models, NLP, CV, Transformer.

Abstract

The task of Visual Question Answering (VQA) also requires intelligent methods that bring together visual and textual information in order to compute accurate answers. An exhaustive empirical evaluation of six specialized vision language transformer models for VQA are conducted on the DAQUAR dataset. The experiments use a Vision Transformer (ViT), a Data-efficient Image Transformer (DeiT), and a Swin Transformer as image encoders, and BERT and RoBERTa as language models. Following the same experimental protocol, we track per-epoch validation accuracy, and consider a potential impact on computational efficiency. The results produced the following peak accuracies: ViT + BERT (65.50%), ViT + RoBERTa (68.47%), DeiT + BERT (61.15%), DeiT + RoBERTa (66.47%), Swin + BERT (73.42%) and Swin + RoBERTa (64.50%). The Swin+BERT combination produced the best accuracy among the combinations tested, achieving a peak accuracy of 73.42%. This illustrates the possible power of hierarchical vision transformers combined with powerful language comprehension on VQA tasks. We hope that this benchmark provides useful guidance to researchers when seeking effective vision-language models based on performance and resource limitations.

Downloads

Published

13-03-2026

Conference Proceedings Volume

Section

Articles

How to Cite

Singh, S. ., Sharma, A. ., & Tiwari, S. . (2026). Empirical Benchmarking of Vision-Language Transformer Combinations for Visual Question Answering Tasks. DMPedia Lecture Notes in Computer Science & Engineering, IMPACT26, 199-209. https://digitalmanuscriptpedia.com/conferences/index.php/DMP-LNCSE/article/view/144