An Efficient Document Categorization Approach for Turkish Based Texts
Journal Title: International Journal of Intelligent Systems and Applications in Engineering - Year 2015, Vol 3, Issue 1
Abstract
Since, it is infeasible to classify all the documents with human effort due to the rapid and uncontrollable growth in textual data, automatic methods have been approached in order to organize the data. Therefore a support vector machine (SVM) classifier is used for text categorization in this study. In text categorization applications, the text representation process could take a huge computation time on weighting the huge size of terms. So far, lexicons that contain less number of terms are used for the solution in the literature. However it has been observed that these kinds of solutions reduce the accuracy of the text classification. In this paper, the term-document matrix is constructed as user dependent according to the purpose of classification. Since the number of terms is still relatively large, we used a hash table for efficient search of terms. Hereby an efficient and rapid TF-IDF method is introduced to construct a weight-matrix to represent the term-document relations and a study concerning classification of the documents in Turkish based news and Turkish columnists is conducted. With the proposed study, the computational time that is required for term-weighting process is reduced substantially; also 99% accuracy is achieved in determination of the news categories and 98% accuracy is achieved in detection of the columnists.
Authors and Affiliations
Sevinç İlhan Omurca*| Kocaeli University, Faculty of Engineering, Computer Engineering Department Umuttepe Campus, Kocaeli – 41380, Turkey, Semih Baş| Tubitak Marmara Research Center Technology Free Zone, IBTECH, Kocaeli – 41470, Turkey, Ekin Ekinci| Kocaeli University, Faculty of Engineering, Computer Engineering Department Umuttepe Campus, Kocaeli – 41380, Turkey
Avoiding Premature Convergence of Genetic Algorithm in Informational Retrieval Systems
Genetic algorithm is been adopted to implement information retrieval systems by many researchers to retrieve optimal document set based on user query. However, GA is been critiqued by premature convergence due to falling...
Dependability Assessment of the Railway Signalling Systems Based on the Stochastic Petri Nets Analysis
In this article, we propose a methodology to evaluate the performances of the railway signalling systems in terms of the availability. Firstly, level crossings in Morocco are presented. Secondly, a railway signalling sys...
PID Parameters Prediction Using Neural Network for A Linear Quarter Car Suspension Control
Providing control for suspension systems in vehicles is an enhancing factor for comfort and safety. With the improvement of control conditions, it is possible to design a cost-efficient controller which will maintain opt...
Fuzzy Multicriterial Methods for the Selection of IT-Professionals
This paper presents the solution of issues related to selection based on evaluation of demand set forth to IT specialists, to develop appropriate decision support system. In this case problem is reduced to multicriterial...
Comparison among Feature Encoding Techniques for HIV-1 Protease Cleavage Specificity
HIV-1 protease which is responsible for the generation of infectious viral particles by cleaving the virus polypeptides, play an indispensable role in the life cycle of HIV-1. Knowledge of the substrate specificity of HI...