International Journal of Scientific and Engineering Research (IJSER)
Research Articles => Electronics => Topic started by: IJSER Content Writer on September 20, 2011, 06:22:35 am
-
Author : Khin Thandar Nwet, Khin Mar Soe, Ni Lar Thein
International Journal of Scientific & Engineering Research Volume 2, Issue 9, September-2011
ISSN 2229-5518
Download Full Paper : PDF (http://www.ijser.org/onlineResearchPaperViewer.aspx?Building-Bilingual-Corpus-based-on-Hybrid-Approach-for-Myanmar-English-Machine-Translation.pdf)
Abstract—Word alignment in bilingual corpora has been an active research topic in the Machine Translation research groups. In this paper, we describe an alignment system that aligns English-Myanmar texts at word level in parallel sentences. Essential for building parallel corpora is the alignment of translated segments with source segments. Since word alignment research on Myanmar and English languages is still in its infancy, it is not a trivial task for Myanmar-English text. A parallel corpus is a collection of texts in two languages, one of which is the translation equivalent of the other.Thus, the main purpose of this system is to construct word-aligned parallel corpus to be able in Myanmar-English machine translation. The proposed approach is combination of corpus based approach and dictionary lookup approach. The corpus based approach is based on the first three IBM models and Expectation Maximization (EM) algorithm. For the dictionary lookup approach, the proposed system uses the bilingual Myanmar-English Dictionary.
Index Terms— EM Algorithm, IBM Models, Machine Translation, Word-aligned Parallel Corpus, Natural Language Processing
1 INTRODUCTION
PROCESSING Myanmar texts is difficult in its compu-tation because sentences in Myanmar texts are represented as strings of Myanmar characters without spaces to indicate word boundaries. This cause problem for Machine Translation, Information Retrieval, Text Summarization and many other Natural Language Processing. Bilingual word alignment is the first step of most current approaches to Statistical Machine Translation or SMT [2]. One simple and very old but still quite useful approach for language modeling is n-gram modeling. Separate language models are built for the source language (SL) and the target language (TL). For this stage, monolingual corpora of the SL and the TL are required. The second stage is called translation modeling and it includes the step of finding the word alignments induced over a sentence aligned bilingual (parallel) corpus. This paper deals with the step of word alignment.
Corpora and other lexical resources are not yet widely available in Myanmar. Research in language technologies has therefore not progressed much. In this paper we describe our efforts in building an English-Myanmar aligned parallel corpus. A parallel corpus is a collection of texts in two languages, one of which is the translation equivalent of the other.
Although parallel corpora are very useful resources for many natural languages processing applications such as building machine translation systems, multilin-gual dictionaries and word sense disambiguation, they are not yet available for many languages of the world. Myanmar language is no exception. Building a parallel corpus manually is a very tedious and time-consuming task. A good way to develop such a corpus is to start from available resources containing the translations from the source language to the target language. A parallel corpus becomes very useful when the texts in the two languages are aligned. This system used the IBM models to align the texts at word level.
Many words in natural languages have multiple meanings. It is important to identify the correct sense of a word before we take up translation, query-based information retrieval, information extraction, question answering, etc. Recently, parallel corpora are being employed for detecting the correct sense of a word. Ng [7] proposed that if two languages are not closely related, different senses in the source language are likely to be translated differently in the target language. Parallel corpus based techniques for word sense dis-ambiguation therefore work better when the two lan-guages are dissimilar.
The remainder of the paper is formed as follows. Sec-tion 2 describes some related work. Alignment Model is presented in section 3. Section 4, discuss about Proposed Alignment Model. In section 5, we describe Overview of System. In section 6, we present experimental results. Finally, section 7 presents conclusion and future work.
2 RELATED WORK
A vast amount of research has been conducted in the alignment of parallel texts with various methodologies. G. Chinnappa and Anil Kumar Singh [6] proposed a java implementation of an extended word alignment algorithm based on the IBM models. They have been able to improve the performance by introducing a similarity measure (Dice coefficient), using a list of cognates and morph analyzer. Li and Chengqing Zong [11] addressed the word alignment between sentences with different valid word orders, which changes the order of the word sequences (called word reordering) of the output hypotheses to make the word order more exactly match the alignment reference.
K-vec algorithm [13] makes use of the word position and frequency feature to find word correspondences using Eucli-dean distance. Ittycheriah and Roukos [8] proposed a maximum entropy word aligner for Arabic-English machine translation. Martin et al. [9] have discussed word alignment for languages with scarce resources. Bing Xiang, Yonggang Deng and Bowen Zhou [1] proposed Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages. This approach on an English-to-Pashto translation task by combining the alignments obtained from syntactic reordering, stemming, and partial words. Jamie Brunning, Adria de Gispert and William Byrne proposed Context-Dependent Alignment Models for Statistical Machine Translation [10]. This models lead to an improvement in alignment quality, and an increase in translation quality when the alignments are used in Arabic-English and Chinese-English translation.
Most current SMT systems [14] use a generative model for word alignment such as the one implemented in the freely available tool GIZA++ [16]. GIZA++ is an implementation of the IBM alignment models [15]. These models treat word alignment as a hidden process, and maximize the probability of the observed (e, f) sentence pairs using the Expectation Maximization (EM) algorithm, where e and f are the source and the target sentences. In [4] all the conducted experiments prove that the augmented approach, on multiple corpuses, performs better when compared to the use of GIZA++ and NATools individually for the task of English-Hindi word alignment. D.Wu, (1994) [3] has developed Chinese and English parallel corpora in the Department of Computer Science and University of Science and Technology in Clear Water Bay, Hong Kong. Here two methods are applied which are important once. Firstly, the gale’s methods is used to Chinese and English which shows that length-based methods give satisfactory result even between unrelated languages which is a surprising result. Next, it shows the effect of adding lexical cues to a length –based methods. According to these results, using lexical information increases accuracy of alignment from 86% to 92%.
A hybrid approach to align sentences and words in English-Hindi parallel corpora[12] presented an align-ment system that aligns English-Hindi texts at the sen-tence and word level in parallel corpora. They described a simple sentence length approach to sentence alignment and a hybrid, multi-feature approach to perform word alignment. They use regression techniques in order to learn parameters which characterize the relationship between the lengths of two sentences in parallel text. They used a multi-feature approach with dictionary lookup as a primary technique and other methods such as local word grouping, transliteration similarity (edit-distance) and a nearest aligned neighbors approach to deal with many-to-many word alignment. Their experiments are based on the EMILLE (Enabling Minority Language En-gineering) corpus. They obtained 99.09% accuracy for many-to-many sentence alignment and 77% precision and 67.79% recall for many-to-many word alignment.
3 ALIGNMENT MODEL
Essential for building parallel corpora is the alignment of tanslated segments with source segments. Alignment is a central issue in the construction and exploitation of parallel corpora. One of the central modeling problems in statistical machine translation (SMT) is alignment between parallel texts. The duty of alignment methodology is to identify translation equivalence between sentences, words and phrases within sentences. In most literature, alignment methods are categorized as either association approaches or estimation approaches (also called heuristic models and statistical models). Association approaches use string similarity measures, word order heuristics, or co-occurrence measures (e.g. mutual information scores).
The central distinction between statistical and heuristic approaches is that statistical approaches are based on well-founded probabilistic models while heuristic ones are not. Estimation approaches use probabilities estimated from parallel corpora, inspired from statistical machine translation, where the computation of word alignments is part of the computation of the translation model
Read More: Click here... (http://www.ijser.org/onlineResearchPaperViewer.aspx?Building-Bilingual-Corpus-based-on-Hybrid-Approach-for-Myanmar-English-Machine-Translation.pdf)