A Stochastic Part of Speech Tagger for Sinhala

Dulip Lakmal Herath, A.R.Weerasinghe


This paper presents the results of the experiment on part of speech tagging (POS) for Sinhala using Hidden Markov Models (HMMs) based on bi-gram probabilities. POS tagging process is needed to resolve the syntactic ambiguities exist in natural language texts. Two kinds os ambiguities have been handled in the present work: Known Word Ambiguity and Unknown Word Ambiguity An annotated corpus of Sinhala was built for HMM parameter estimation. A comprehensive POS tag set has been designed and used for corpus annotation. The paper describes the process of developing the POS tagger and its related issues in the context of Sinhala. The tagger has shown an interesting performance even under several constraints with respect to training data. The paper also makes important suggestions on further improvements to achieve higher level of accuracy.

Citation Info :

In Conference Proceedings - 6th International Information Technology Conference on From Research to Reality, Infotel Lanka Society Colombo, Sri Lanka, 29 Nov- 01 Dec 2004, pp. 17-22, ISBN 955-8974-01-3.