SanskritBERT: Language-Specific Transformer Modelling for Classical Sanskrit Texts

Authors

  • Soumya Sharma Department of Computer Science Engineering, Sharda School of Engineering and Technology, Sharda University, Greater Noida, India Author
  • Tanuj Saxena Department of Computer Science Engineering, Sharda School of Engineering and Technology, Sharda University, Greater Noida, India Author
  • Kusum Lata Department of Computer Science Engineering, Sharda School of Engineering and Technology, Sharda University, Greater Noida, India Author

Keywords:

Sanskrit NLP, BERT, Transformer Models, Morphologically Rich Languages, SentencePiece, Masked Language Modelling.

Abstract

Recent breakthroughs in transformer-based architectures, such as BERT, have revolutionised natural language processing (NLP) across many languages. However, low-resource and morphologically complex languages like Sanskrit remain poorly represented in large-scale pretrained models due to scarce digital corpora, orthographic variation, and compounding. This paper presents a fully custom Sanskrit BERT model trained from scratch on a corpus of 21 million+ curated Sanskrit sentences written in pure Devanagari script. To represent the morphological richness of the language, a Sentence-Piece Unigram tokeniser with a 64k subword vocabulary was built, and a light 6-layer BERT architecture with 256-dimensional hidden states was used to balance performance and compute efficiency. Experimental results show that the model significantly outperforms multilingual baselines such as mBERT, IndicBERT, and MuRIL on masked language modelling test sets, achieving Top-1 accuracy of 0.35, Top-5 accuracy of 0.50, and a perplexity score of 69.0. These outcomes confirm the merits of corpus-specific tokenisation and pretraining in monolingual style for morphologically rich classical languages. Future work will investigate scaling the model to larger architectures, incorporating more complex subword representations, and fine-tuning for Sanskrit NLP downstream tasks such as word segmentation, translation, and semantic role labelling.

Downloads

Published

15-03-2026

How to Cite

Sharma, S. ., Saxena, T. ., & Lata, K. . (2026). SanskritBERT: Language-Specific Transformer Modelling for Classical Sanskrit Texts. DMPedia Lecture Notes in Multidisciplinary Research, IMPACT26, 1313-1323. https://digitalmanuscriptpedia.com/conferences/index.php/DMP-LNMR/article/view/170