A Statistical Machine Translation Approach to Sinhala-Tamil Language Translation

Ruvan Weerasinghe


Data-driven approaches to Machine Translation have come to the fore of Language Processing Research over the past decade. The relative success in terms of robustness of Example Based and Statistical approaches have given rise to a new optimism and an exploration of other data-driven approaches such as Maximum Entropy language modeling. Much of the work in the literature however, largely report on translation between languages within the European Family of languages. This research is an attempt to cross this language family divide in order to compare the performance of these techniques on Asian languages. In particular, this work reports on Statistical Machine Translation experiments carried out between language pairs of the three major languages of Sri Lanka: Sinhala, Tamil and English. Results indicate that current models perform significantly better for the Sinhala- Tamil pair than the English-Sinhala pair. This in turn appears to confirm the assertion that these techniques work better for languages that are not too distantly related to each other.