Explainable Robustness Against Localized Malware Obfuscation: A Hybrid CNN-ViT Approach with Comparative Attention Analysis

Ayush Raj; Shadan  Ahmad; Pranjal  Kumar Jha

doi:10.65890/race.v2i1.176

Authors

Ayush Raj School of Computer Science Engineering and Technology, Bennett University, Greater Noida, India Author
Shadan Ahmad School of Computer Science Engineering and Technology, Bennett University, Greater Noida, India Author
Pranjal Kumar Jha School of Computer Science Engineering and Technology, Bennett University, Greater Noida, India Author

DOI:

https://doi.org/10.65890/race.v2i1.176

Keywords:

Malware Classification, Vision Transformer, Convolutional Neural Network, Explainable AI, Obfuscation Robustness, Focal Loss, Spatial Dropout

Abstract

The rapid proliferation of obfuscated and zero-day malware variants poses a critical challenge to modern cybersecurity defences. Traditional Convolutional Neural Network (CNN) classifiers, while effective on clean binary images, exhibit catastrophic vulnerability when adversaries apply localised packing or byte-level scrambling to evade detection. In this paper, a hybrid deep learning architecture that combines a lightweight CNN feature extractor with a Vision Transformer (ViT) encoder to achieve explainable robustness against localised malware obfuscation. The CNN extracts hierarchical texture patterns from grayscale binary images, while the ViT models long-range spatial dependencies through multi-head self-attention across 196 image patches. Crucially, this approach injects Spatial Dropout (nn.Dropout2d) into the CNN layers and employs Focal Loss to handle severe class imbalance, forcing the model to learn from partially corrupted feature maps during training itself. Evaluated on the Malimg benchmark (9,339 samples, 25 families), the hybrid model achieves 93.80% accuracy with a Macro AUC of 0.9980. Under simulated obfuscation with 30% local byte noise, the hybrid architecture maintains 81.82% accuracy, while an equivalent pure CNN baseline using the same CNN feature extractor collapses to 52.62%, a gap of 29.20 percentage points. Through comparative attention heatmaps, this paper provides visual, interpretable proof that while CNN activations scatter across noise artifacts under obfuscation, the ViT’s self-attention dynamically shifts to focus on persistent global structural payloads, explaining why the architecture survives where CNNs fail.

References

[1] “AV-TEST – the independent IT security institute: Malware statistics,” https://www.av-test.org/en/ statistics/malware/, 2024, accessed: 2024-12-01.

[2] D. Ucci, L. Aniello, and R. Baldoni, “Survey of machine learning techniques for malware analysis,” Computers & Security, vol. 81, pp. 123–147, 2019.

[3] L. Nataraj, S. Karthikeyan, G. Jacob, and B. S. Manjunath, “Malware images: Visualization and automatic classification,” Proceedings of the International Symposium on Visualization for Cyber Security (VizSec), pp. 1–7, 2011.

[4] M. Kalash, M. Rochan, N. Mohammed, N. D. B. Bruce, Y. Wang, and F. Iqbal, “Malware classification with deep convolutional neural networks,” in Proceedings of the 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS). IEEE, 2018, pp. 1–5.

[5] D. Vasan, M. Alazab, S. Wassan, B. Safaei, and Q. Zheng, “Image-based malware classification using ensemble of CNN architectures (IMCFN),” Computers & Security, vol. 92, p. 101748, 2020.

[6] D. Gibert, C. Mateu, and J. Planes, “The rise of machine learning for detection and classification of malware: Research developments, trends and challenges,” Journal of Network and Computer Applications, vol. 153, p. 102526, 2020.

[7] M. Brosolo, P. Vinod, and M. Conti, “The road less traveled: Investigating robustness and explainability in CNN malware detection,” arXiv preprint arXiv:2503.01684, 2025.

[8] M. Asam, S. H. Khan, A. Akbar, S. Bibi, T. Jamal, and A. Khan, “An explainable visual malware classification framework using convnextswin hybrid architecture,” Information Sciences, vol. 661,

p. 120200, 2024.

[9] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,

M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proceedings of the International Conference on Learning Representations (ICLR), 2021.

[10] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2980– 2988.

[11] K. S. Han, J. H. Lim, B. Kang, and E. G. Im, “Malware analysis using visualized images and entropy graphs,” in International Journal of Information Security, vol. 14, no. 1. Springer, 2015,

pp. 1–14.

[12] Z. Dai, H. Liu, Q. V. Le, and M. Tan, “CoAtNet: Marrying convolution and attention for all data sizes,” Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 3965–3977, 2021.

[13] F. U. Khan, S. Aziz, and N. Iqbal, “LeViT-MC: A lightweight vision transformer for malware classification,” arXiv preprint arXiv:2402.16995, 2024.