Disambiguation of Lithuanian Homographs Based on the Frequencies of Lexemes and Morphological Tags

Apply

Disambiguation of Lithuanian Homographs Based on the Frequencies of Lexemes and Morphological Tags

Journal Title: Kalbu studijos / Studies about Languages - Year 2009, Vol 14, Issue 0

Abstract

In the text-to-speech synthesis it is necessary to stress the text. The main problem is that currently existing algorithms of stress for Lithuanian produce more than a single stressing possibility for some words (homographs). The method based on frequency of occurrences of certain lexemes and morphological tags was proposed in this work. Such method has never been used for Lithuanian. The frequencies were calculated using text corpus containing 1 million words. Text corpus was stressed automatically and then corrected manually. Disambiguation of homographs is performed by removing less frequently used grammatical forms and lexemes. Additional problems arise due to the fact that a single word can correspond to more than two grammatical forms. The method based on the frequencies of pairs of grammatical forms was proposed in this work. It was shown that the frequencies of morphological tags play more important role than the frequencies of lexemes. The method proposed allows disambiguating the homographs with the accuracy of 85.01%. Despite the fact that the method proposed does not employ contextual information, the results achieved are comparable with those achieved with the algorithm ID3 that uses the context.

Authors and Affiliations

Tomas Anbinderis, Pijus Kasparaitis

Keywords

teksto kirčiavimas tomografai vienareikšminimas leksema morfologinė pažyma balso sintezė

EP ID EP85952
DOI -
Views 131
Downloads 0