Automated Extraction of Large Scale Scanned Document Images using Google Vision OCR in Apache Hadoop Environment

Abstract

This Digitalization of documents is now being done in all fields to reduce paper usage. The availability of modern technology in the form of scanners and cameras supports the growth of multimedia data, especially documents stored in the form of image files. Searching a particular text in a large-scale scanned document images is a difficult task if the document is in the form of images where the text has not been extracted. In this research, text extraction method of large-scale scanned document images using Google Vision OCR on the Hadoop architecture is proposed. The object of research is student thesis documents, which includes the cover page, the approval page, and abstract. All documents are stored in the university's digital library. Extraction process begins with preparing the input folder that contains image documents (in JPEG format) in HDFS Apache Hadoop and followed by reading the image document. The image document is then extracted using Google Vision OCR in order to obtain text document (in TXT format) and the result is saved to output folder in Hadoop Distributed File System (HDFS). The same process is repeated for the entire documents in the folder. Test results have shown that the proposed methods was able to extract all test documents successfully. The recognition process achieved 100% accuracy and the extraction time is twice as fast as manual extraction. Google Vision OCR also shows better extraction performance compared to other OCR tools. The proposed automated extraction systems can recognize text in a large-scale image document accurately and can be operated in a real-time environment.

Authors and Affiliations

Rifiana Arief, Achmad Benny Mutiara, Tubagus Maulana Kusuma, Hustinawaty Hustinawaty

Keywords

Related Articles

Dimensions of Open Government Data Web Portals: A Case of Asian Countries

Citizen Factors of the open government data are being explored in this study in the selected Asian countries. As per the open data availability countries have been selected on global open data index and well-structured o...

Assessment for the Model Predicting of the Cognitive and Language Ability in the Mild Dementia by the Method of Data-Mining Technique

Assessments of cognitive and verbal functions are widely used as screening tests to detect early dementia. This study developed an early dementia prediction model for Korean elderly based on random forest algorithm and c...

An Ontology- and Constraint-based Approach for Dynamic Personalized Planning in Renal Disease Management 

Healthcare service providers, including those involved in renal disease management, are concerned about the planning of their patients’ treatments. With efforts to automate the planning process, shortcomings are apparent...

Current Trends and Research Challenges in Spectrum-Sensing for Cognitive Radios

The ever increasing demand of wireless communication systems has led to search of suitable spectrum bands for transmission of data. The research in the past has revealed that radio spectrum is under-utilized in most of t...

Analyzing Interaction Flow Modeling Language in Web Development Lifecycle

Two years ago, the Object Management Group (OMG) adopted a new standard method named Interaction Flow Modeling Language (IFML) for web engineering domain. IFML is designed to express the content, user interaction, and co...

Download PDF file
  • EP ID EP417614
  • DOI 10.14569/IJACSA.2018.091117
  • Views 109
  • Downloads 0

How To Cite

Rifiana Arief, Achmad Benny Mutiara, Tubagus Maulana Kusuma, Hustinawaty Hustinawaty (2018). Automated Extraction of Large Scale Scanned Document Images using Google Vision OCR in Apache Hadoop Environment. International Journal of Advanced Computer Science & Applications, 9(11), 112-116. https://www.europub.co.uk/articles/-A-417614