SanskritBERT: Language-Specific Transformer Modelling for Classical Sanskrit Texts
Keywords:
Sanskrit NLP, BERT, Transformer Models, Morphologically Rich Languages, SentencePiece, Masked Language Modelling.Abstract
Recent breakthroughs in transformer-based architectures, such as BERT, have revolutionised natural language processing (NLP) across many languages. However, low-resource and morphologically complex languages like Sanskrit remain poorly represented in large-scale pretrained models due to scarce digital corpora, orthographic variation, and compounding. This paper presents a fully custom Sanskrit BERT model trained from scratch on a corpus of 21 million+ curated Sanskrit sentences written in pure Devanagari script. To represent the morphological richness of the language, a Sentence-Piece Unigram tokeniser with a 64k subword vocabulary was built, and a light 6-layer BERT architecture with 256-dimensional hidden states was used to balance performance and compute efficiency. Experimental results show that the model significantly outperforms multilingual baselines such as mBERT, IndicBERT, and MuRIL on masked language modelling test sets, achieving Top-1 accuracy of 0.35, Top-5 accuracy of 0.50, and a perplexity score of 69.0. These outcomes confirm the merits of corpus-specific tokenisation and pretraining in monolingual style for morphologically rich classical languages. Future work will investigate scaling the model to larger architectures, incorporating more complex subword representations, and fine-tuning for Sanskrit NLP downstream tasks such as word segmentation, translation, and semantic role labelling.
Downloads
Published
Conference Proceedings Volume
Section
License
Copyright (c) 2026 DMPedia Lecture Notes in Multidisciplinary Research

This work is licensed under a Creative Commons Attribution 4.0 International License.