A Shallow Parser for Tamil

I. Ariaratnam, R. Weerasinghe, C. Liyanage

Abstract— This research is an attempt to build a shallow parser designed to assign a partial structure to natural language sentences in order to recover useful syntactic information from Sri Lankan Tamil sentences. It uses a combination of a maximum entropy based part-of-speech (POS) tagger which automatically labels each word in a sentence with the appropriate POS tag, and a rule-based chunker which segments the sentences into syntactically correlated word groups, without the need for a large annotated corpus. To do this, we developed a POS tagset consisting of 20 POS tags using expert input, manually annotated a corpus of approximately 12500 words, and identified 390 chunk patterns to extract the chunks. Our POS tagger and chunker demonstrated promising f-measures of 81.72% and 78.3% respectively. Our combined shallow parser gives an f-measure of 66.6% owing to error propagation.

Subscribe to ICTer News