Research Article Open Access

Evaluating Machine Translation for Domain Specific Low-Resource Nepali-English Language Pairs: The Impact of Tokenization on Statistical and Neural Techniques

Amit Kumar Roy1 and Bipul Syam Purkayastha1
  • 1 Department of Computer Science, Assam University, Silchar, Cachar, Assam, India

Abstract

In the modern era, the field of Machine Translation (MT) has seen a significant shift towards Neural Machine Translation (NMT) techniques, which have surpassed traditional Statistical Machine Translation (SMT) models in terms of the quality of translation. Despite this, the efficacy of these techniques may differ based on the language combination in consideration. While SMT is somewhat more flexible in this regard, NMT often needs sizable parallel corpora to attain high translation accuracy. As a result, a benchmark system capable of offering sufficient translation for languages with limited resources, like Nepali, remains a pipe dream. This paper focuses on translating text using statistical and neural MT techniques for the under-resourced English-Nepali language pair. As a part of this system development, we built a parallel corpus of English-Nepali in the tourism domain. We explore the impact of different tokenization techniques on translation outcomes. A substantial analysis is also done for the performance of both approaches using automatic evaluation metrics, BLEU and TER. This paper aims to provide insights into the applicability of SMT and NMT for the under-resourced English-Nepali language pair in light of two popular epitomes of tokenization and to determine the most effective approach for achieving accurate translations.

Journal of Computer Science
Volume 21 No. 12, 2025, 3041-3050

DOI: https://doi.org/10.3844/jcssp.2025.3041.3050

Submitted On: 8 April 2025 Published On: 24 January 2026

How to Cite: Roy, A. K. & Purkayastha, B. S. (2025). Evaluating Machine Translation for Domain Specific Low-Resource Nepali-English Language Pairs: The Impact of Tokenization on Statistical and Neural Techniques. Journal of Computer Science, 21(12), 3041-3050. https://doi.org/10.3844/jcssp.2025.3041.3050

  • 50 Views
  • 5 Downloads
  • 0 Citations

Download

Keywords

  • Statistical MT
  • Neural MT
  • Tokenization
  • Sentence Piece
  • Low-Resource MT
  • Nepali Language