Author Topic: Bilingual OCR System for Myanmar and English Scripts with Simultaneous Recogniti  (Read 3445 times)

0 Members and 1 Guest are viewing this topic.

IJSER Content Writer

  • Sr. Member
  • ****
  • Posts: 327
  • Karma: +0/-1
    • View Profile
Quote
Author : Htwe Pa Pa Win, Phyo Thu Thu Khine, Khin Nwe Ni Tun
International Journal of Scientific & Engineering Research Volume 2, Issue 10, October-2011
ISSN 2229-5518
Download Full Paper : PDF

Abstract- The increasing amount of development of the digital libraries worldwide raises many new challenges for document image analysis research and development. Storing wide variety of document images in Digital library, for example, for cultural, technical or historical, that are written in many languages, also create many advancement for present day digital image analysis systems. And when the Digital Library is concerned with Science and Technology documents, it needs to advance the OCR system to bilingual nature as most of them are written in Myanmar in combination with English letters. In this paper a bilingual OCR to simultaneously recognize the printed English and Myanmar texts is proposed including segmentation mechanism for the overlapping nature of Myanmar scripts. The effectiveness of the proposed mechanism is proved with the experimental results of segmentation accuracy rates, comparisons of feature extraction methods and overall accuracy rates.

Index Terms—   Bilingual OCR, Machine Printed, Myanmar-English Scripts, SVM;

1   INTRODUCTION 
There is a considerable transformation from print based-formats to electronic-based formats thanks to advanced computing technology, which has a profound impact on the dissemination of nearly all previous formats of publications into digital formats on computer networks. Then, one of the important tasks in machine learning is the electronic reading of documents. All various fields of the documents, magazines, reports and technical papers can be converted to electronic form using a high performance Optical Character Recognizer (OCR). And optical character recognition is a key enabling technology critical to creating indexed, digital library content, and it is especially valuable for scripts, for which there has been very little digital access [1], [2], [4].
Furthermore, when the Digital Library is concerned with Science and Technology documents, it needs to advance the OCR system to bilingual nature as most of them are written in Myanmar in combination with English letters. Therefore, in this multilingual and multi-script world, OCR systems need to be capable of recognizing characters irrespective of the script in which they are written. In general, recognition of different script characters in a single OCR module is difficult. This is because features necessary for character recognition depend on the structural properties, style and nature of writing which generally differs from one script to another. For example, features used for recognition of English alphabets are in general not good for recognizing Chinese logograms [3].
Many OCR algorithms for English and other developed countries’ languages have been developed over the years for the paperless world and these can be available commercially or freely. But these systems can only recognize for specific single scripts and cannot do for Myanmar scripts. OCR system for Myanmar language is in little effort. In addition, there is no system that can recognize the documents that are written in Myanmar and English text. Therefore, a new system is proposed to recognize these documents simultaneously.

2   NATURE OF MYANMAR SCRIPT
Myanmar (Burmese) script is recognized as Tibeto/Burman language group, developed from the Mon script and descended from the Brahmi script of ancient South India. It is the official language of Myanmar, where over 35 million people speak it as their first language. The direction of writing is from left to right in horizontally. In Myanmar script, there is no distinction between Upper Case and Lower Case characters. The character set consists of 35 consonants (including ‘ႆ’ and ‘ဉ’), 8 vowels signs, 7 independent vowels, 5 combining marks, 6 symbols and punctuations , and 10 digits. Each word can be formed by combining consonants, vowels and various signs. There are total of above 1881glyphs and has many similarity scripts in this language (e.g., ယ and ဃ, ပ and ဎ, and so on). The shapes of Myanmar scripts are circular, consist of straight lines horizontally or vertically or slantways, and dots [11], [20].

3   RELATED WORK
Many researchers have proposed several ways to implement various OCR systems [4, 5]. The authors of [13-15] are discussed for the feature extraction methods. But in [7-9], they stated that the SVM classifier can be used as the effective recognizer. Some of the existing techniques used in OCR for Myanmar scripts are presented in [10, 11]. To the best of our knowledge, a comprehensive study on the success rate in terms of recognition accuracy for Myanmar printed text OCR system is yet to be reported.

4    PROPOSED METHOD
As other traditional OCR systems, the proposed system also includes five processing steps as shown in Fig. 1. 6 different types of documents written in Zawgyi-One font and font size 12 are taken to test the system. These are scanned on a flatbed scanner at 300 dpi for digitization go for the preprocessing steps.

4.1   PREPROCESSING
Preprocessing step is the basic crucial part of the OCR system. The recognition accuracy of OCR systems greatly depends on the quality of the input text image. Firstly, we convert the raw input image into grayscale and then denoise it by removing noise using low pass Finite State Impulse Response (FIR) filter. Next, we binarize the clean image to a bi-level image by turning all pixels below some threshold to zero and all pixels about that threshold to one. We find this threshold value using Otsu method. Finally, we deskew the binarized image with generalized Hough Transformed method.

Read More: Click here...