Empirical Benchmarking of Vision-Language Transformer Combinations for Visual Question Answering Tasks
Keywords:
VQA, Multimodal, Vision-Language Models, NLP, CV, Transformer.Abstract
The task of Visual Question Answering (VQA) also requires intelligent methods that bring together visual and textual information in order to compute accurate answers. An exhaustive empirical evaluation of six specialized vision language transformer models for VQA are conducted on the DAQUAR dataset. The experiments use a Vision Transformer (ViT), a Data-efficient Image Transformer (DeiT), and a Swin Transformer as image encoders, and BERT and RoBERTa as language models. Following the same experimental protocol, we track per-epoch validation accuracy, and consider a potential impact on computational efficiency. The results produced the following peak accuracies: ViT + BERT (65.50%), ViT + RoBERTa (68.47%), DeiT + BERT (61.15%), DeiT + RoBERTa (66.47%), Swin + BERT (73.42%) and Swin + RoBERTa (64.50%). The Swin+BERT combination produced the best accuracy among the combinations tested, achieving a peak accuracy of 73.42%. This illustrates the possible power of hierarchical vision transformers combined with powerful language comprehension on VQA tasks. We hope that this benchmark provides useful guidance to researchers when seeking effective vision-language models based on performance and resource limitations.
Downloads
Published
Conference Proceedings Volume
Section
License
Copyright (c) 2026 DMPedia Lecture Notes in Computer Science & Engineering

This work is licensed under a Creative Commons Attribution 4.0 International License.