NADA: New Arabic Dataset for Text Classification

Abstract

In the recent years, Arabic Natural Language Processing, including Text summarization, Text simplification, Text Categorization and other Natural Language-related disciplines, are attracting more researchers. Appropriate resources for Arabic Text Categorization are becoming a big necessity for the development of this research. The few existing corpora are not ready for use, they require preprocessing and filtering operations. In addition, most of them are not organized based on standard classification methods which makes unbalanced classes and thus reduced the classification accuracy. This paper proposes a New Arabic Dataset (NADA) for Text Categorization purpose. This corpus is composed of two existing corpora OSAC and DAA. The new corpus is preprocessed and filtered using the recent state of the art methods. It is also organized based on Dewey decimal classification scheme and Synthetic Minority Over-Sampling Technique. The experiment results show that NADA is an efficient dataset ready for use in Arabic Text Categorization.

Authors and Affiliations

Nada Alalyani, Souad Larabi Marie-Sainte

Keywords

Related Articles

School Manager System based on a Personal Information Architecture

The current technological revolution has provided multiple benefits to human activities. For their part, organizations have had the need to make changes to their business requirements, which have led them to migrate to s...

COMPARATIVE STUDY OF THE SOFTWARE METRICS FOR THE COMPLEXITY AND MAINTAINABILITY OF SOFTWARE DEVELOPMENT

Software metrics is one of the well-known topics of research in software engineering. Metrics are used to improve the quality and validity of software systems. Research in this area focus mainly on static metrics obtaine...

Experimental Analysis of Color Image Scrambling in the Spatial Domain and Transform Domain

This paper proposes two image-scrambling algorithms based on self-generated keys. First color image scrambling method works in the spatial domain, and second, works in the transform domain. The proposed methods cull the...

Comparative Analysis of K-Means and Fuzzy C-Means Algorithms

In the arena of software, data mining technology has been considered as useful means for identifying patterns and trends of large volume of data. This approach is basically used to extract the unknown pattern from the la...

 Analyzing Opinions and Argumentation in News Editorials and Op-Eds

 Analyzing opinions and arguments in news editorials and op-eds is an interesting and a challenging task. The challenges lie in multiple levels – the text has to be analyzed in the discourse level (paragraphs and ab...

Download PDF file
  • EP ID EP393876
  • DOI 10.14569/IJACSA.2018.090928
  • Views 104
  • Downloads 0

How To Cite

Nada Alalyani, Souad Larabi Marie-Sainte (2018). NADA: New Arabic Dataset for Text Classification. International Journal of Advanced Computer Science & Applications, 9(9), 206-212. https://www.europub.co.uk/articles/-A-393876