Singh, Satyam, et al. “Empirical Benchmarking of Vision-Language Transformer Combinations for Visual Question Answering Tasks”. DMPedia Lecture Notes in Computer Science & Engineering, no. IMPACT26, Mar. 2026, pp. 199-0, https://digitalmanuscriptpedia.com/conferences/index.php/DMP-LNCSE/article/view/144.