International Journal of Scientific & Engineering Research Volume 2, Issue 9, September-2011 1

ISSN 2229-5518

Automatic Reordering Rule Generation Based On Parallel Tagged Aligned Corpus for Myanmar-English Machine Translation

Thinn Thinn Wai, Tin Myat Htwe, Ni Lar Thein

AbstractReordering is important problem to be considered when translating between language pairs with different word orders. Myanmar is a verb final language and reordering is needed when it is translated into other languages which are different from Myanmar word order. In this paper, automatic reordering rule generation for Myanmar-English machine machine translation is presented. In order to generate reordering rules; Myanmar-English parallel tagged aligned corpus is firstly created. Then reordering rules are generated automatically by using the linguistic information from this parallel tagged aligned corpus. In this paper, function tag and part-of-speech tag reordering rule extraction algorithms are proposed to generate reordering rules automatically. These algorithms can be used for other language pairs which need reordering because these rules generation is only depend on part-of-speech tags and function tags.

Index TermsConstituent Analysis, English-Myanmar Machine translation, parallel tagged aligned corpus, Reordering, Syntactic Analysis,

1 INTRODUCTION

—————————— ——————————
he goal of statistical machine translation is to translate an input word sequence in the source language into a target language word sequence. In order to improve the translation process, it is poss-
ible to perform preprocessing steps before training and
translation in both source and target language sequence.
In machine translation, reordering is one of the major problems, since different languages have different word order requirements. When a Myanmar sentence is trans- lated into English sentence, the verb in the Myanmar sen- tence must be moved after the subject of the English sen- tence in order to obtain the correct word order. On a sub sentential level, Myanmar word order diverges from Eng- lish mostly within the noun phrase and verb phrase. Mo- reorver, there are many particles that support noun, ad- jective, and verb in Myanmar Language. They are subject marker particles, object marker particles, adjective sup- port particles and verb support particles. These particles do not exist in English and their missing can make the translation error. So, each particle is needed to move its respective places scuch as beside a noun, verb and so on. To allievate the tag missing, moving these particles to their respective places is essential. Without reordering, the particles can be far from their relative nouns, verbs and adjectives and the correct word order can‘t be ob- tained. In addition to this, the meaningful translation can‘t also be obtained. Therefore, reordering is necessary for translation from Myanmar language to English Lan- guage. In this work, corpus creation procedure and reor- dering rules generation procedures are proposed for Myanmar-English statistical machine translation.
The plan of this paper is as follows. In the next
section, related works which use reordering approaches
in a preprocessing step are reviewed. In Section 3, the significant differences of word order in English language and Myanmar language. Section 4 describes analysis steps and corpus creation. In Section 5, proposed reorder- ing rule extraction algorithm and reordering rules are explained in details. In the last two sections, the experi- ments are reported and then we conclude the experiments and discuss future work respectively.

2 RELATED WORK

Different approaches have been developed to deal with the word order problem. First approaches worked by constraining reordering at decoding time [7]. In [12], the alignment model introduced the restrictions in word order, which leads also to restrictions at decoding time. A comparison of these two approaches can be found in [2]. They have in common that they do not use any syntactic or lexical information; therefore they rely on a strong lan- guage model or on long phrases to get the right word order. Other approaches were introduced that use more linguistic knowledge, for example the use of bitext grammars that allow parsing the source and target lan- guage [13]. In [10], syntactic information was used to re rank the output of a translation system with the idea of accounting for different reordering at this stage. In [11], a lexicalized block-oriented reordering model is proposed that decides for a given phrase whether the next phrase should be oriented to its left or right.
The most recent and very promising approaches that
have been demonstrated reorder the source sentences based on rules learned from an aligned training corpus with a POS-tagged source side [8, 9, 20]. These rules are

IJSER © 2011

http://www.ijser.org

International Journal of Scientific & Engineering Research Volume 2, Issue 9, September-2011 2

ISSN 2229-5518

then used to reorder the word sequence in the most likely way.
In our approach we follow the idea proposed in [20] of using a parallel training corpus with a tagged source side to extract rules which allow a reordering before the trans- lation task.

3 DIFFERENCES OF WORD ORDER BETWEEN

ENGLISH LANGUAGE AND MYANMAR LANGUAGE

When Myanmar sentence is translated to English sen- tence, many differences of word order can be found. In this section, significant word order differences; adjective movement, and adverb movement will be described.
Some adjectives (JJ) in noun chunk (NC) of Myanmar sentence are necessary to move before its relative noun (NN.Person) to obtain the correct English order. For ex- ample, when the Myanmar phrase ― ‖ is translated into the English phrase ―rich man‖, the adjective ― (JJ)‖ must be moved before its relative noun ―

(NN.Person)‖ . This can be seen in the Example (1). Example (1),


Myanmar is also modifier and adjunct proceeding language. Therefore, these adjuncts are needed to move after its relative verb to make the correct word order in English Sentence. When the Myanmar sentence ― ‖ is translated into the English sen-
tence ―He runs quickly.‖ the adverb of manner ―
must be moved behind its relative verb ‖ in order to fit
the correct English order. Such adverb movement can be
seen in the Example (2).
Example (2),

In this example, the verb particle pos tag (Sf.Dec) is al- so needed to move beside its relative verb not to miss the Myanmar word meaning. Therefore, the pos tag
―VB.Common ―and ‗Sf.Dec‖ in Myanmar phrase are combined to form only one tag ―VB.Common‖ in English phrase.
All of these above necessities, word reordering is
needed for Myanmar-Englsih statistical machine transla- tion.

4 CORPUS CREATION

Corpus creation steps are described in fig 1. For cor- pus creation, plain text corpus is used as a resource. For each sentence in the corpus, analysis process is carried
out by using Chunk-based Syntax Analyzer [23]. This Syntax Analyzer consists of two components; Chunker and Grammatical Function Tagger. In this analysis process, there are three main steps.
(1) Morpho-lexical analysis
(2) Constituent analysis and
(3) Syntax analysis
Morpho-lexical analysis and constituent analysis are accomplished by the chunker and syntax analysis is the role of grammatical function tagger.
Morpho-lexical analysis contains tokenization, word segmentation and part-of-speech tagging. In tokenization, input text is divided into units called tokens where each is either a word or something else which make the word boundary for each syllable.Then the tokenized sentence is processed again by initial segmentation with using For- ward Maximum Matching Algorithm [24]. Part-of-speech (POS) tagging marks up the words in the text with their corresponding part-of-speech such as noun, verb, and adjective and so on. For this POS tagging, Bigram Part-of- Speech Tagger [22] is used.
Constituent analysis consists of chunking and merg- ing some chunks that are necessary to merge. Chunking is done by generating CFG rules based on part-of-speech (POS) tags.
In syntax analysis, Grammatical function tagger
searches the functional relation between chunks based on dependency grammar by using Maximum likelihood Es- timation and then identifies the function of each chunk [23].
By aligning the analyzed text resulted from Analyzer,
parallel tagged aligned corpus is created. Our tagged align corpus format can be seen in fig 2.

Fig. 1. Corpus Creation Steps

As we can see in fig 2, ―Subj‖,‖Obj‖ , ―Verb‖ and ―Sf‖ are function tags of each chunk. ―NC‖,‖VC‖ and ―SFC‖ refer the relevant chunk type and ―PRN.Person‖,

IJSER © 2011

http://www.ijser.org

International Journal of Scientific & Engineering Research Volume 2, Issue 9 3

ISSN 2229-5518

―NN.Objects‖, and ―Part.Type‖ are part of speech of each word. The numbers in the parentheses are alignment po- sition of function tags and part of speech tags. The first number before ―/‖ indicates the position of tags in source language and the number after ―/‖ indicates the position of tags in target language. Each chunk is separated by
―#‖.

5 REORDERING RULE EXTRACTION

By using the linguistic information from the corpus, two kinds of reordering rules are generated automatical- ly. They are function tags-based reordering rules and part-of-speech tags-based reordering rules. The former is generated for using in chunk-level reordering and the latter is for using word-level reordering. They are ex- tracted from corpus using the following rule extraction algorithms.

Thinn Thinn Wai, University of Computer Studies, Yangon, Myanmar, PH-095-09-5059987. E-mail: thin2wai@gmail.com

Ni Lar Thein , University of Computer Studies, Yangon, Myanmar. E-

5.1 Reordering Rule Generation

In order to generate reordering rules for English- Myanmar translation, there are two main cases that cause the same pattern with different reorderings. The first case is caused by active form translation and passive form translation. The second case is caused by different transla- tions.
According to the first case, reordering can be seen in

the following Example (4). Example (4),
(1)
mail: nilarthein@gmail.com

IJSER © 2011

http://www.ijser.org

International Journal of Scientific & Engineering Research Volume 2, Issue 9, September-2011 4

ISSN 2229-5518


(2) In this example, the Myanmar Sentence"
―can be translated in two ways; active form and passive
form. According to active form translation, the translated English sentence is "Ma Ma wrote this letter.‖. In passive translation, this sentence is translated into "This letter is written by Ma Ma". According to active and passive translation, this Myanmar sentence has two reordering rules.

In the second case, some Myanmar sentences such as ―
―have two different transla-
tions such as ―A book is on the table‖ and ―There is a book on the table.‖ According to these two different tras- lation, two different reordering rules are obtained. This can be seen in Example (5),
Example (5),

(1)

POS‘s tags. Therefore, ―Subj, PObj, ObjP, Verb, Sf‖ is function rule pattern and ―NN.Person, JJ‖ is POS rule pattern. The string of numbers between ―#‖ and ―:‖is the position of source and target words and source word po- sition is divided by ―/‖ with target word position. For example, in the rhs of the first pos rule pattern―0/1, 1/ 0‖, the 1/0 means that the pos tag at the position 1,‖JJ‖ is move to the position 0. In this model, we used array structure to store the position and so the starting index is
0. Moreover, in the part-of-speech tag rule, the verb par-
ticle part-of-speech tag (Part.Support) is not in English and so there is no alignment. This tag missing can cause the error in translation and so this tag is needed to add beside its relative verb to allievate the translation error caused by tag missing. In this reordering rule generation, this problem is solved by combining soure tag positions which have same target alignment postions. This can be seen in the third part-of-speech tag rule described above. In this rule, the string after #, ―0+1/0‖ means that the words at position ―0‖and ―1‖ are move together into the position ―1‖ because they have the same target position .
The sequences ―Subj, PObj, ObjP, Verb, Sf‖ and
―VB.Common, Part.Support‖ are function and pos rule

p n

patterns (

1 ).The strings of numbers in between the

n

symbols ―#‖ and represent suggested reordering ( r1 ):

(2)


The generated reordering rule consists of two sides:
each integer after ―/‖, ri represents the new position of (the translation of) pi. The two numbers after the colon (:) are collected from training data and are respectively the number of times the rhs (reordering suggestion) of the rule observed and (inside brackets) the number of occur- rences of the rule pattern. The probability of each reorder- ing suggestion is computed as in (1).
the left-hand-side (lhs), which is a function tags or POS
tags pattern, and the right-hand-side (rhs), which corres-

P(r n / pn )  count (r1 )

ponds to a possible reordering of that pattern. Different rules can share the lhs: in such cases, the same pattern can be reordered in more than one way. Rules are weighted,
according to statistics extracted from training data. There

1 1

6 EXPERIMENTS

count ( pn )

(1)
are two kinds of reordering patterns: function tag-based, which define reordering at the clause and phrase level, and pos tag-based, which defines reordering at the word level. Let us consider the following examples:
• Rules using function tag
-Subj, PObj, ObjP, Verb, Sf#0/0, 1+2/2, 3+4/1:7(10)
-Subj, PObj, ObjP, Verb, Sf#0/2, 1+2/0, 3+4/1:3(10)
-Subj, PPla, PlaP, Verb, Sf#0/0, 1/3, 2/2, 3+4/1:4(10)
-Subj, PPla, PlaP, Verb, Sf#0/1, 1/3, 2/2, 3+4/0:6(10)
• Rules using pos tag
-NN.Person, JJ#0/1, 1/0:10(10)
-NN.Objects,CRD,Part.Type#0/1, 1+2/0:10(10)
-VB.Common, Part.Support#0+1/0:10(30)
-VB.Common, Part.Support#0/1, 1/0:20(30)
In the above rules, Subj, PObj, ObjP, Verb, Sf, PPla and PlaP are function tags and NN.Person, JJ, NN.Objects, CRD, Part.Type, VB.Common, and Part.Support are
These generated reordering rules are tested on the
Myanmar-English machine translation system. Our Expe-
riment shows that the use of reordering rules provide trans-
lation effectively. Moreover, these reordering rules can be used as a rule base for Myanmar-English machine transla- tion. Besides, they can be combined with mathematical models to create reordering model for Myanmar-English translation. By using these rules as an embedded compo- nent, Myanmar-English translation system can perform translation effectively and efficiently.

6.1 Accuracy of Reordering Rules

The purpose of this experiment was to see how many reordering rules are accurate when they are applied to the test set. The test set was obtained randomly from High School Myanmar Grammar book. The test set was split into two subsets:
 1000 simple sentences

IJSER © 2011

http://www.ijser.org

International Journal of Scientific & Engineering Research Volume 2, Issue 9, September-2011 5

ISSN 2229-5518

 1000 compound sentences
After reordering the test set by using these generated reordering rules, the accuracy values of the reordering rules was collected for each subset on the test set. The accuracy values are given in percentage form. Human evaluation was used for evaluating how accurately the reordering rules are applied to the test set.
Table 1 shows the accuracies of the reordering rules for
each subset of English sentences on the test set. The experi- ment showed that the most common causes of errors of the reordering rules are incorrect part-of-speech tagging and function tagging. Moreover, descriptions of some function tags and pos tags are described in Table 2 and 3.

TABLE 1

ACCURACY OF REORDERING RULES

CRD

Cardinal Number

Part.Type

Particle of Cardinal Number

Part.Support

Support Particle

TABLE 3

SOME FUNCTION TAGS DESCRIPTIONS

English test subsets

Accuracy

Simple sentences

98.9%

Complex sentences

97.2%

7 CONCLUSION AND FUTURE WORK

This paper proposes automatic reordering rules genera- tion algorithms for Myanmar-English machine transla- tion.These algorithms use the parallel tagged aligned corpus as a resource. These rules are extracted based on the part-of- speech tags and function tags extracted from Chunk based Analyzer. These rule extraction algorithms can be used to reorder other language pairs those have their own analyzer because the input of these algorithm only depend on the result of Language Analyzer. In this work, rules are ex- tracted for simple sentences and complex sentences and my future work is to generate reordering rules for more complex Myanmar sentences and the to implement the novel Myanmar-English reodering model effectively.

TABLE 2

SOME PART-OF-SPEECH TAG DESCRIPTIONS

Part-of-speech Tag

Description

JJ

Adjective

NN.Person

Noun indicates person

PRN.Person

Personal pronoun

RB.Manner

Adverb of Manner

Sf.Dec

Declarative Sentence Final

VB.Common

Common Verb

NN.Objects

Noun indicates objects

IJSER © 2011

http://www.ijser.org

International Journal of Scientific & Engineering Research Volume 2, Issue 9, September-2011 6

ISSN 2229-5518

Fig. 2. Parallel Tagged Aligned Corpus

REFERENCES

[1] C. Tillmann and H. Ney. 2002, ―Word reordering and DP beam search for statistical machine translation to appear in Computational Linguis- tics.,‖ Neurocomputing—Algorithms, Architectures and Applications, F. Fo- gelman-Soulie and J. Herault, eds., NATO ASI Series F68, Berlin: Sprin- ger-Verlag, pp. 227-236, 1989. (Book style with paper title and editor)

[2] R. Zens and H. Ney. 2003. A comparative study on reordering

constraints in statistical machine trans lation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol ume 1, pages 144–151, Sapporo, Japan.

[3] S. Vogel, F.J. Och, C. Tillmann, S. Nießen, H. Sawaf, and H.

Ney. 2000. Statistical methods for machine translation. InW.Wahlster, editor, Verbmobil: Foundations of Speech-to- Speech Translation, pages 377–393. Springer Verlag: Berlin, Heidelberg, New York.

[4] Y.Y. Wang and A. Waibel. 1997. Decoding algorithm in statis- tical translation. In Proc. 35th Annual Meeting of the Assoc. for Computational Linguistics, pages 366–372, Madrid, Spain, July.

[5] Ei Ei Han and Ni Lar Thein, "Morphological Synthesis For Myanmar Language", Proceeding of International Conference on Internet Information Retrieval, Korea, 2007.

[6] Yaser Al-Onaizan and Kishore Papineno. 2006. Distortion mod- els for statistical machine translation. In Proceedings of the 21st International Conference on Computational Linguistics and the

4th annual meeting of the ACL, pages 529–536, Sydney, Aus- tralia

[7] A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra,1996. A

maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39.

[8] B. Chen, M. Cettolo, and M. Federico. 2006. Reordering rules for phrase-based statistical machine translation. In Int. Workshop on Spoken Language Translation Evaluation Campaign on Spoken Language Translation, pages 1–15.

[9] M. Popovic and H. Ney. 2006. POS-based word reorderings for

statistical machine translation. In Proc. of the 5th Int. Conf. on

Language Resources and Evaluation (LREC), page 1278, Genoa, Italy.

[10] L. Shen, A. Sarkar, and F. J. Och. 2004. Discriminative reranking for machine translation. In HLTNAACL 2004: Main Proc., page

177.

[11] C. Tillmann and T. Zhang. 2005. A localized prediction model for statistical machine translation. In Proceedings of the 43rd Annual Meeting of the As-soc. for Computational Linguistics (ACL), pages 557–564, Ann Arbor, MI.

[12] D. Wu. 1996. A polynomial-time algorithm for statistical ma- chine translation. Proc. 34th Annual Meeting of the Assoc. for Computational Linguistics, page 152.

[13] D. Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguis- tics, 23(3):377.

[14] Y. Zhang, R. Zens, and H. Ney. 2007. Chunk-Level Reordering

of Source Language Sentences with Automatically Learned Rules for Statistical Machine Translation. In Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL): Pro- ceedings of the Workshop on Syntax and Structure in Statistical Translation (SSST), pages 1–8, Rochester, NY.

[15] Myat Thuzar Tun and Ni Lar Thein, " English Syntax Analyzer for English-to-Myanmar Machine Translation", In proceedings of the Fifth International Conference on Computer Application, Myanmar, February, 8-9,2007.

[16] Myat Thuzar Tun, Tin Myat Htwe and Ni Lar Thein, "EMTM: An Effective Language Translation Model", In proceedings of International Conference on Internet Information Retrieval, Ko- rea, November 30, 2005.

[17] Shankar Kumar ―Local Phrase Reordering Models for Statistical

Machine Translation‖, Center for Language and Speech Processing, Johns Hopkins University, 3400 North Charles Street, Baltimore, MD 21218, U.S.A.

[18] P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, and R. L.

Mercer, ―The Mathematics of Statistical Machine Translation: Parameter Estimation,‖ Computational Linguistics, vol. 19(2), pp. 263–312, 1993.

[19] Kenji Yamada and Kevin Knight. 2000. A Syntax based Statis-

tical Translation Model. ACL 2000.

[20] Josep M. Crego and Jose B. Marino. 2006. Reordering Experi- ments for N-Gram-based SMT. In Spoken Language Technolo- gy Workshop, pages 242-245, Palm Beach, Aruba.

[21] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, ―BLEU: a Me- thod for Automatic Evaluation of Machine Translation‖, Asso- ciation for Computational Linguistics, 2002, pp. 311-318.

[22] Phyu Hnin Myint, Tin Myat Htwe and Ni Lar Thein. ―Bigram Part-of-Speech Tagger for Myanmar Language‖, Proceedings of International Conference on Information Communication and Management (ICICM 2011), October 14-16, 2011, Singapore.

[23] Win Win Thant,Tin Myat Htwe and Ni Lar Thein .― Syntactic Analysis of Myanmar Language‖, Proceedings of International Conference on Computer Applications (ICCA 2011), Yangon, Myanmar, May 5-6, 2011.

[24] Win Pa Pa and Ni Lar Thein. ―Myanmar Word Segmentation using Hybrid Approach.‖ In Proc. 7th International Conference for Computer Application. Yangon, Myanmar, May 5-6, 2009..

IJSER © 2011

http://www.ijser.org

International Journal of Scientific & Engineering Research Volume2, Issue 9, September-2011 7

ISSN 2229-5518

IJSER 2011

http/lwww l!ser o rg