Disambiguation of Lithuanian Homographs Based on the Frequencies of Lexemes and Morphological Tags
Journal Title: Kalbu studijos / Studies about Languages - Year 2009, Vol 14, Issue 0
Abstract
In the text-to-speech synthesis it is necessary to stress the text. The main problem is that currently existing algorithms of stress for Lithuanian produce more than a single stressing possibility for some words (homographs). The method based on frequency of occurrences of certain lexemes and morphological tags was proposed in this work. Such method has never been used for Lithuanian. The frequencies were calculated using text corpus containing 1 million words. Text corpus was stressed automatically and then corrected manually. Disambiguation of homographs is performed by removing less frequently used grammatical forms and lexemes. Additional problems arise due to the fact that a single word can correspond to more than two grammatical forms. The method based on the frequencies of pairs of grammatical forms was proposed in this work. It was shown that the frequencies of morphological tags play more important role than the frequencies of lexemes. The method proposed allows disambiguating the homographs with the accuracy of 85.01%. Despite the fact that the method proposed does not employ contextual information, the results achieved are comparable with those achieved with the algorithm ID3 that uses the context.
Authors and Affiliations
Tomas Anbinderis, Pijus Kasparaitis
Case Study: English for Specific Purposes in Moodle Area
This article examines application of Moodle tasks for vocabulary revision in English for Specific Purposes (ESP). The study is based on the analysis of data obtained from the survey of students’ attitudes to Moodle task...
Politinių kalbų patriotizmo elementai tinklaraščiuose
Tinklaraščiai vis labiau traukia dėmesį ne tik kaip asmeniniai dienoraščiai, bet ir kaip visuomenės gyvenimą atspindintys rašiniai, pavyzdžiui, tekstai, paskelbti rinkimų kampanijos metu. Tačiau lingvistinių tinklaraščių...
Integrating Content and Language in Higher Education: A Case of KTU
Content and language integrated learning (CLIL) is a good way to develop both: language and content skills. Language, meaning and content are integrated, and by extending language, meaning and content resources extend ac...
Semantiniai pleonazmai anglų ir lietuvių kalbose ir jų vertimas
Straipsnyje nagrinėjami anglų kalbos semantiniai pleonazmai ir jų vertimas į lietuvių kalbą. Pleonazmas dažnai laikomas klaida arba keistu absurdišku reiškiniu. Tačiau vertėjams dažnai daro įtaką originalo kalbos pleonaz...
Lietuvių kalbos būtieji vientisiniai laikai ir jų atitikmenys anglų kalboje
veiksmą, vykusį prieš kalbamąjį momentą, o skiria juos pagrindinė santykio su kalbamuoju momentu reikšmė: būtasis kartinis laikas gali reikšti labai artimą veiksmą atskaitos momentui: ar tai būtų dabartis, ar kitas praei...