Author Topic: Building Bilingual Corpus based on Hybrid Approach for Myanmar-English Machine T  (Read 2897 times)

0 Members and 1 Guest are viewing this topic.

IJSER Content Writer

  • Sr. Member
  • ****
  • Posts: 327
  • Karma: +0/-1
    • View Profile
Author : Khin Thandar Nwet, Khin Mar Soe, Ni Lar Thein
International Journal of Scientific & Engineering Research Volume 2, Issue 9, September-2011
ISSN 2229-5518
Download Full Paper : PDF

Abstract—Word alignment in bilingual corpora has been an active research topic in the Machine Translation research groups. In this paper, we describe an alignment system that aligns English-Myanmar texts at word level in parallel sentences. Essential for building parallel corpora is the alignment of translated segments with source segments. Since word alignment research on Myanmar and English languages is still in its infancy, it is not a trivial task for Myanmar-English text. A parallel corpus is a collection of texts in two languages, one of which is the translation equivalent of the other.Thus, the main purpose of this system is to construct word-aligned parallel corpus to be able in Myanmar-English machine translation. The proposed approach is combination of corpus based approach and dictionary lookup approach. The corpus based approach is based on the first three IBM models and Expectation Maximization (EM) algorithm. For the dictionary lookup approach, the proposed system uses the bilingual Myanmar-English Dictionary.
Index Terms— EM Algorithm, IBM Models, Machine Translation, Word-aligned Parallel Corpus, Natural Language Processing

1   INTRODUCTION                                                                      
PROCESSING Myanmar texts is difficult in its compu-tation because sentences in Myanmar texts are represented as strings of Myanmar characters without spaces to indicate word boundaries. This cause problem for Machine Translation, Information Retrieval, Text Summarization and many other Natural Language Processing. Bilingual word  alignment  is  the  first  step  of most  current  approaches  to  Statistical Machine Translation or SMT [2]. One simple and very old but still quite useful approach for language modeling is n-gram modeling. Separate language models are built for the source language (SL) and the target language (TL). For this stage, monolingual corpora of the SL and the TL are required. The second stage is called  translation modeling  and  it  includes  the  step  of  finding  the word  alignments induced over a sentence aligned bilingual (parallel) corpus. This paper deals with the step of word alignment.
Corpora and other lexical resources are not yet widely available in Myanmar. Research in language technologies has therefore not progressed much. In this paper we describe our efforts in building an English-Myanmar aligned parallel corpus. A parallel corpus is a collection of texts in two languages, one of which is the translation equivalent of the other.

Although parallel corpora  are  very  useful  resources  for  many natural  languages  processing  applications  such as building machine translation systems, multilin-gual dictionaries and word sense disambiguation, they are not yet available  for many languages  of  the  world. Myanmar language is no exception. Building a parallel corpus manually is a very tedious and time-consuming task. A good way  to  develop  such  a  corpus  is  to  start  from available  resources  containing  the  translations from the source language to the target language. A parallel corpus becomes very useful when the texts in the two languages are aligned. This system used the IBM models to align the texts at word level.
Many words in natural languages have multiple meanings. It is important to identify the correct  sense  of  a  word  before  we  take  up translation,  query-based  information retrieval, information extraction, question answering, etc. Recently, parallel  corpora  are  being  employed  for detecting  the  correct  sense  of  a word. Ng [7] proposed that if  two languages  are  not  closely  related, different senses in the source language are likely to  be  translated  differently  in  the  target language.  Parallel  corpus  based  techniques  for word  sense  dis-ambiguation  therefore  work better when  the  two  lan-guages are dissimilar.
The remainder of the paper is formed as follows. Sec-tion 2 describes some related work. Alignment Model is presented in section 3. Section 4, discuss about Proposed Alignment Model. In section 5, we describe Overview of System. In section 6, we present experimental results. Finally, section 7 presents conclusion and future work.

A vast amount of research has been conducted in the alignment of parallel texts with various methodologies. G. Chinnappa and Anil Kumar Singh [6] proposed a java implementation of an extended word alignment algorithm based on the IBM models. They have been able to improve the performance by introducing a similarity measure (Dice coefficient), using a list of cognates and morph analyzer.  Li and Chengqing Zong [11] addressed the word alignment between sentences with different valid word orders, which changes the order of the word sequences (called word reordering) of the output hypotheses to make the word order more exactly match the alignment reference.
K-vec algorithm [13] makes use of the word position and frequency feature to find word correspondences using Eucli-dean distance. Ittycheriah  and  Roukos  [8]  proposed  a  maximum entropy  word  aligner  for Arabic-English  machine translation. Martin et al. [9] have discussed  word alignment for languages with scarce resources. Bing Xiang, Yonggang Deng and Bowen Zhou [1] proposed  Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages. This approach on an English-to-Pashto translation task by combining the alignments obtained from syntactic reordering, stemming, and partial words. Jamie Brunning, Adria de Gispert and William Byrne proposed Context-Dependent Alignment Models for Statistical Machine Translation [10]. This models lead to an improvement in alignment quality, and an increase in translation quality when the alignments are used in Arabic-English and Chinese-English translation.
Most current SMT systems [14] use a generative model for word alignment such as  the  one  implemented  in  the  freely  available  tool  GIZA++  [16].  GIZA++ is an implementation of the IBM alignment models [15]. These models treat word alignment as a hidden process, and maximize the probability of the observed (e, f) sentence pairs using the Expectation Maximization (EM) algorithm, where e and f are the source and the target sentences. In [4] all  the  conducted  experiments  prove that  the  augmented  approach,  on  multiple  corpuses, performs better when compared to the use of GIZA++ and NATools individually for the task of English-Hindi word alignment. D.Wu, (1994) [3] has developed Chinese and English parallel corpora in the Department of Computer Science and University of Science and Technology in Clear Water Bay, Hong Kong. Here two methods are applied which are important once. Firstly, the gale’s methods is used to Chinese and English which shows that  length-based methods give satisfactory result even between unrelated languages which is a surprising result. Next, it shows the effect of adding lexical cues to a length –based methods. According to these results, using lexical information increases accuracy of alignment from 86% to 92%.
A hybrid approach to align sentences and words in English-Hindi parallel corpora[12] presented an align-ment system that aligns English-Hindi texts at the sen-tence and word level in parallel corpora. They described a simple sentence length approach to sentence alignment and a hybrid, multi-feature approach to perform word alignment. They use regression techniques in order to learn parameters which characterize the relationship between the lengths of two sentences in parallel text. They used a multi-feature approach with dictionary lookup as a primary technique and other methods such as local word grouping, transliteration similarity (edit-distance) and a nearest aligned neighbors approach to deal with many-to-many word alignment. Their experiments are based on the EMILLE (Enabling Minority Language En-gineering) corpus. They obtained 99.09% accuracy for many-to-many sentence alignment and 77% precision and 67.79% recall for many-to-many word alignment.

Essential for building parallel corpora is the alignment of tanslated   segments  with   source   segments. Alignment is a central issue in the construction and exploitation of  parallel corpora. One of the central modeling problems in statistical machine translation (SMT) is alignment between parallel texts. The duty of alignment methodology is to identify translation equivalence between sentences, words and phrases within sentences. In most literature, alignment methods are categorized as either association approaches or estimation approaches (also called heuristic models and statistical models). Association approaches use string similarity measures, word order heuristics, or co-occurrence measures (e.g. mutual information scores).
The central distinction between statistical and heuristic approaches is that statistical approaches are based on well-founded probabilistic models while heuristic ones are not. Estimation approaches use probabilities estimated from parallel corpora, inspired from statistical machine translation, where the computation of word alignments is part of the computation of the translation model

Read More: Click here...