A Printed PAW Image Database of Arabic Language for Document Analysis and Recognition

Journal Title: Journal of ICT Research and Applications - Year 2017, Vol 11, Issue 2

Abstract

Document image analysis and recognition are important topics in the field of artificial intelligence. In this context, the availability of a database with good script samples is an important requirement for machine-learning processes. For Latin and Asian languages many suitable databases exist. However, there is a shortage of databases with Arabic samples. In this work, a new database of printed Arabic text is introduced. The new concept of collecting sub-words (PAWs) instead of words or individual character samples was adopted. These PAWs constitute all words in the Arabic language. The collected database consists of 83,056 images of PAWs extracted from approximately 550,000 different words. Each sample is presented in the database in five font types: Thuluth, Naskh, Andalusi, Typing Machine, and Kufi. In total, the database consists of 415,280 images. Moreover, ground truth information is included with each PAW image to describe its occurrence number, occurrence frequency, positions and the shapes of the characters. This paper presents a statistical analysis of the frequency of each PAW in the Arabic language.

Authors and Affiliations

Bilal Bataineh

Keywords

Related Articles

High Performance CDR Processing with MapReduce

A call detail record (CDR) is a data record produced by telecommunication equipment consisting of call detail transaction logs. It contains valuable information for many purposes in several domains, such as billing, frau...

Design of Triple-Band Bandpass Filter Using Cascade Tri-Section Stepped Impedance Resonators

In this research, a triple-band bandpass filter (BPF) using a cascade tri section step impedance resonator (TSSIR), which can be operated at 900 MHz, 1,800 MHz, and 2,600 MHz simultaneously, was designed, fabricated and...

Iris Segmentation using Gradient Magnitude and Fourier Descriptor for Multimodal Biometric Authentication System

Perfectly segmenting the area of the iris is one of the most important steps in iris recognition. There are several problematic areas that affect the accuracy of the iris segmentation step, such as eyelids, eyelashes, gl...

Automatic Title Generation in Scientific Articles for Authorship Assistance: A Summarization Approach

This paper presents a study on automatic title generation for scientific articles considering sentence information types known as rhetorical categories. A title can be seen as a high-compression summary of a document. A...

Hybrid Neural Network and Linear Model for Natural Produce Recognition Using Computer Vision

Natural produce recognition is a classification problem with various applications in the food industry. This paper proposes a natural produce recognition method using computer vision. The proposed method uses simple feat...

Download PDF file
  • EP ID EP324697
  • DOI 10.5614/ itbj.ict.res.appl.2017.11.2.6
  • Views 105
  • Downloads 0

How To Cite

Bilal Bataineh (2017). A Printed PAW Image Database of Arabic Language for Document Analysis and Recognition. Journal of ICT Research and Applications, 11(2), 200-212. https://www.europub.co.uk/articles/-A-324697