Inte rnatio nal Jo urnal o f Sc ie ntific & Eng inee ring Re se arc h Vo lume 3, Issue 2 , Fe bruary -2012 1

ISSN 2229-5518

Web Mining Using Topic Sensitive Weighted

PageRank

Shesh Narayan Mishra, Alka Jaiswal, Asha Ambhaikar

Abs tractThe World Wide Web contains the large amount of inf ormation sources. While searching the w eb for particular topics, users usually f etch irrelevant and redundant inf ormation causing a w aste in user time and accessing time of the search engine. So narrow ing dow n this problem, user’s interests and needs from their behavior have become increasingly important. Web structure mining p l ays an eff ective role in this approach. Some page ranking algorithms Page Rank, Weighted Page Rank are commonly used in w eb structure min ing. The original PageRank algorith m search-query results independent of any particular search query. To yield more specif ic and accurate search results against a particular topic, w e proposed a new algorithm Topic Sensitive Weighted PageRan k based on w eb structure mining that w ill show the relevancy of the pages of a given topic is better determined, as compared to the existin g PageRank, Topic sensitive PageRank and Weighted PageRank algorithms. For ordinary keyw ord search queries, Topic Sensitive Weigted PageRan k scores w ill satisfy the topic of the query.

Inde x TermsWeb structure mining; Weighted Page Rank; Topic sensitive PageRank; TSWPR.

—————————— ——————————

1 INTRODUCTION

TODAY, the World W ide Web is the popula r a nd inte ractive me dium to dissemina te informa tion. The Web is huge , diverse a nd dyna mic. The We b conta ins vas t amount of informa tion
a nd provides a n access to it a t a ny place a t any time. The mos t of the pe ople use the interne t for retrie ving information. But mos t of the time , they ge ts lots of ins ignifica nt a nd irre le va nt document eve n a fter na vigating se vera l links . For re trie ving informa tion from the We b, We b mining techniques are use d.

2 WEB M INING OV ERV IEW

Web mining is a n a pplica tion of the data mining techniques to a utoma tica lly dis cover a nd e xtract knowle dge from the Web.
According to Kosa la e t a l [3], Web mining cons is ts of the fo l- lowing ta s ks :

Resource finding: the task of retrieving intended Web documents.

Informat ion select ion and pre-processing: a utoma tica lly se lecting a nd pre-process ing s pecific informa tion from re trie ve d Web re- s ources.

———— ——— ——— ——— ———

 Mr. S hesh Naray an Mishra is pre se ntly study ing in the final se me- ste r o f M. Tec h (Co mpute r Tec hno logy) at Rung ta Co llege o f Eng i- nee ring and Tec hno logy, Bhilai, C.G., India. Has obtaine d deg ree o f MCA fro m CSVTU, Bhilai Unive rsity in 2009. Re se arc h Inte rest in-

c ludes We b Mining Tec hniques

Alka Ja iswa l presently working a s Asst. Prof. in the Depa rtment of Infor-

ma tion Technology a t Rungta College of Engineering a nd Tec hnology, Bhi-

la i, C.G., India. Ha s been awa rded degree of M.Tech (Informa tion Security)

with honors from Na tional Institute of Technology, Bhopa l (M.P.), India in

2010. Resea rch Interest includes Da ta base Security, Da ta Mining. 1 Re- sea rch pa per in interna tiona l Journal has been published.

Prof. Asha Ambhaikar is currently working as Associate Professor in Com- puter Science Engineering in RCET Bhilai, India, PH -09229655211. E- mail: asha31.a@rediffmail.com

Generalizat ion: a utoma tica lly discovers ge nera l pa tterns at indi- vidua l Web s ites as we ll as across multiple s ites .

Analysis: va lida tion a nd/or inte rpre ta tion of the mine d pa t- terns .

The re a re three a reas of We b mining according to the usage of the Web da ta use d as input in the da ta mining process, na me ly, Web Conte nt Mining (W CM), Web Usage Mining (W UM) a nd Web Structure Mining (WSM).

Web conte nt usage mining, We b s tructure mining, a nd Web content mining. Web usage mining refe rs to the discove ry of user access pa tterns from Web usa ge logs . Web s tructure mining tries to dis cove r use ful knowle dge from the s tructure of hype rlinks which he lps to investiga te the node a nd connection s tructure of web s ites . According the type of web s tructura l data , we b structure mining ca n be divide d into two kinds 1) e xtra cting the docume nts from hype rlinks in the we b 2) a na lys is of the tree -like s tructure of page s tructure. Base d on the topology of the hype r links , we b s tructure mining will ca tegorize the web page a nd ge nera te the information,

IJSER © 2012

http :// www.ijser.org

Inte rnatio nal Jo urnal o f Sc ie ntific & Eng inee ring Re se arc h Vo lume 3, Issue 2 , Fe bruary -2012 2

ISSN 2229-5518

s uch as the s imila rity and mining is conce rne d with the re trieva l of informa tion from WWW into more s tructure d form a nd inde xing the informa tion to retrie ve it quickly. We b usage mining is the process of ide ntifying the brows ing pa tterns by a na lyzing the us- er’s naviga tiona l be ha vior. We b s tructure mining is to discover the mode l unde rlying the link s tructures of the Web pa ges, ca ta log the m a nd ge ne rate informa tion s uch as the s imila rity a nd re latio n- s hip betwee n the m, ta king a dva nta ge of the ir hype rlink topology. Web class ifica tion is s hown in Fig 1.

2.1 Web Content Mining (WCM)

Web Content Mining is the process of e xtracting use ful informa- tion from the conte nts of we b docume nts . The web documents ma y cons is ts of te xt, ima ges, a udio, vide o or s tructure d records like tables a nd lis ts . Mining ca n be a pplie d on the web documents as we ll the res ults pa ges produce d from a search engine . The re are two types of a pproach in content mining ca lle d age nt base d a p- proa ch a nd da ta base base d a pproa ch. The age nt base d a pproach conce ntra te on sea rching re leva nt informa tion us ing the cha racte- ris tics of a particular doma in to interpre t a nd organize the co l- lecte d informa tion. The data base a pproach is use d for re trie ving the se mi-s tructure da ta from the web.

2.2 Web Usage Mining (WUM)

Web Usa ge Mining is the process of e xtra cting use ful information from the secondary da ta derive d from the inte ractions of the user while s urfing on the Web. I t e xtracts data s tore d in serve r access
a nothe r page ma y be cons ide re d a s a vote. Howe ver, not only the numbe r of votes a page rece ives is cons idere d importa nt, but the “importa nce ” or the “re leva nce ” of the ones tha t cas t these votes as we ll.
As s ume a ny a rbitrary pa ge A ha s pa ges T1 to Tn point-
ing to it (incoming link). Pag eRank ca n be ca lcula te d by the follo w- ing.

PR(A) = (1- d) + d(PR(T1) /C(T1) + ...+ PR( Tn /C(Tn )) (1)

The pa rame ter d is a da mping factor, us ua lly se ts it to 0.85 (to s top the other pa ges ha ving too much influe nce , this tota l vote is “da mpe d down” by mult iplying it by 0.85). C(A) is de fine d as the numbe r of links going out of pa ge A. The PageRanks form a prob- a bility dis tribution ove r the Web pages , so the s um of a ll Web pa ges ’ PageRank will be one . Pag eRank ca n be ca lcula te d us ing a s imple ite ra tive a lgorithm, a nd corres ponds to the principa l e i- ge nvector of the norma lize d link ma trix of the We b.

3.2 Weighted PageRank

We npu Xing a nd Ali Ghorba ni [1] propose d a Weig hted Pag eRank (WPR) a lgorithm which is a n e xte ns ion of the PageRank a lgorithm. This a lgorithm ass igns a la rge r ra nk va lues to the more importa nt pa ges ra ther tha n dividing the ra nk va lue of a pa ge eve nly a mong its outgoing linke d pa ges . Ea ch outgoing link ge ts a va lue propo r- tiona l to its importa nce .
The importa nce is ass igne d in te rms of we ight va lues to the in-
logs, re ferrer logs , age nt logs, clie nt-s ide cookie s, user profile a nd
coming a nd outgoing links a nd a re de note d

W(m,n) as a nd

me ta da ta .

out

(m,n)

res pective ly. W in ,n)

as s hown in (2) is the we ight of

2.3 Web Structure Mining

The goa l of the Web Structure Mining is to ge nera te the s tructura l s umma ry about the Web s ite a nd We b pa ge. I t tries to discove r the link s tructure of the hype rlinks a t the inter -docume nt le ve l. Base d

link(m, n) ca lcula te d base d on the number of incoming links of

pa ge n a nd the numbe r of incoming links of a ll re fere nce pages of
pa ge m.
on the topology of the hype rlinks , Web Structure mining will ca te- gorize the Web pa ges a nd gene ra te the informa tion like s imilarity a nd re la tions hip be tween diffe re nt We b s ites . This type of mining ca n be performe d a t the docume nt le ve l (intra -pa ge) or a t the hype rlink le ve l (inte r-page ). I t is importa nt to unde rs ta nd the Web

W in

(m, n)

out

 In

 Ip

pR(m)

On

(2)
da ta s tructure for I nforma tion Re trieva l.

3 RELATED WORK

W(m, n)

Op

pR(m)

(3)

3.1 PageRank

Brin a nd Pa ge de ve lope d PageRank a lgorithm during the ir Ph D a t
W here a nd Ip a re the number of incoming links of pa ge n a nd
pa ge p res pe ctive ly. R(m) de notes the re fe rence pa ge lis t of page

out

(m,n)

is as s hown in (3) is the we ight of link(m, n) ca lcula te d
Sta nford Univers ity base d on the cita tion a na lys is . Pag eRank a lgo-
rithm is use d by the famous sea rch engine , Google . The y a pplie d
the cita tion a na lys is in Web sea rch by treating the incoming links
as cita tions to the Web pages . Howe ver, by s imply a pplying the cita tion a na lys is techniques to the diverse set of Web docume nts did not res ult in e fficie nt outcomes . The refore, PageRank provides a more a dva nce d wa y to compute the importa nce or re le va nce of
base d on the numbe r of outgoing links of page n a nd the number of outgoing links of a ll re fe re nce pa ges of m. W here On and Op a re the number of outgoing links of pa ge n a nd p res pective ly. The formula as propose d by We npu e t a l for the WPR is as s hown in (4) which is a modifica tion of the Pag eRank formula .
a Web pa ge tha n s imply counting the numbe r of pa ges tha t are

WPR(n)  (1  d)  d 

Win

Wout

(4)
lin king to it (ca lle d as “back links ”).
I f a back lin k comes from a n “importa nt” pa ge , the n tha t back lin k is give n a higher we ighting tha n those back links comes from non-importa nt pages . I n a s imple way, link from one pa ge to

IJSER © 2012

http :// www.ijser.org

mB(n)

WPR(m)

(m,n)

(m,n)

Inte rnatio nal Jo urnal o f Sc ie ntific & Eng inee ring Re se arc h Vo lume 3, Issue 2 , Fe bruary -2012 3

ISSN 2229-5518

3.3 Topic Sensitive PageRank

I n Topic Sens itive Pa ge Ra nk, seve ra l scores a re compute d: mul- tiple importa nce s cores for each pa ge under seve ra l to pics tha t form a compos ite Pa ge Rank s core for those pa ges ma tching the query. During the offline cra wling proce ss , 16 topic -se ns itive Pa- ge Ra nk vectors a re ge nera te d, us ing as a guide line the top-le ve l ca te gory from Ope n Directory Project (ODP). At que ry time, the s imila rity of the que ry is compa re d to each of these vectors or topics ; a nd s ubseque ntly, ins tea d of us ing a s ingle globa l ra nking ve ctor, the linea r combina tion of the topic -se ns itive vectors is
we ighe d us ing the s imila rity of the que ry to the topics . This me-
thod yie lds a very a ccura te se t of res ults re leva nt to the conte xt of
the pa rticula r query.

3.3.1 ODP Bia sing

The firs t s te p in this a pproa ch is performe d once during the o f- fline pre process ing of the Web cra wl. I t ge nera tes a se t of biase d Pa geRa nk ve ctors us ing a se t of bas is topics . The s pecific me tho- dology of crea ting the biase d Pa geRa nk vectors is to use the URLs lis te d be low the 16 top-le ve l cate gorie s in ODP. These then, con- s titute the pe rsona liza tion vectors . A s ingle , non-biase d Pa ge Rank ve ctor is a ls o compute d for compa rison purposes . Third, a set of
16 cla ss te rm-vectors is a lso compute d, which cons is ts of the terms in the doc ume nts be low ea ch of the 16 top-leve l ca tegories .

3.3.2 Query-Time Importance Score

The se cond s te p is pe rforme d a t que ry time . I f a que ry q was is- s ue d by highlighting the term q in some we b pa ge u, the n q (the conte xt of q) cons is ts of the terms in u. For ordina ry que ries not done in conte xt, le t q = q. us ing a unigra m la ngua ge mode l, with pa ra mete rs se t to the ir ma ximum -like lihood es tima tes, the class proba bilitie s for ea ch of the 16 top-leve l classes in ODP condi- tione d on q are compute d. Us ing a te xt inde x, the URLs for a ll docume nts conta ining the origina l que ry te rms q a re re trieve d. Fina lly, the que ry-se ns itive importa nce score of each of these re- trieve d URLs is compute d. The res ults a re the n ra nke d according to the compos ite score .

4 METHODOLOGY

We a re us ing We ighte d Page Ra nk to imple ment our Topic Se ns i- tive We ighte d Page Ra nk, we precompute importa nce scores o f- fline us ing We ighte d Pa ge Ra nk. For each pa ge, we compute mul- tiple importa nce s cores , with res pe ct to va rious topics . At time of query, the c ompos ite we ighte d page ra nk will be forme d by com- bining these importa nce s cores .

5 CONCLUSION

The ra pid prolife ra tion of World W ide We b has le d the web co n- te nt to increase treme ndous ly. He nce , there is a grea t re quire ment to ha ve a lgorithms tha t could lis t re leva nt we b pages accura te ly a nd e fficie ntly on the top of fe w pa ges . Mos tly sea rch e ngines use d Page Ra nk, We ighte d Pa geRa nk but us e rs may not ge t re-
quire d documents eas ily. W ith a vie w to resolve the e xis ting prob- le ms , a ne w a lgorithm ca lle d Topic Se ns itive We ighte d Pa ge Rank has bee n propose d which e mploys We b Structure mining. This a lgorithm will improve the orde r of web pa ges in the res ult lis t s o tha t use r ma y get the mos t re leva nt pa ges eas ily.

REFERENCES

[1] W. Xing and Ali Gho rbani, “We ig hte d Page Rank A lgo rithm”, Proc. ofthe Second Annua l Conference on Communica tion Networks a nd Services Resea rch (CNSR ’04), IEEE, 2004.

[2] Tahe r H. Have liwala. To pic -Se nsitiv e Page Rank: A Co nte xt-S e nsitive Ranking Algo rithm fo r We b Se arc h. IEEE Transac tio ns o n Kno wle dge and Data Eng ine e ring , Vo l. 15, No4, July /Aug ust 2003,

784-796.

[3] R. Kosala, H. Bloc kee l, “We b Mining Re se arc h: A S urvey ”, S IGKDD Explo ratio ns, Ne wsle tte r o f the ACM S pec ial Inte re st Gro up o n Kno wle dge Discove ry and Data Mining Vol. 2, No . 1 pp 1 -15, 2000.

[4] N. Duhan, A. K. S harma and K. K. Bhati a, “Pag e Ranking Algo rithms:A S urvey , Proceedings of the IEEE Interna tiona l Conference on Advance Computing, 2009.

[5] M. G. da Go mes Jr. and Z.Go ng , “We b S truc ture Mining : An Intro duc tio n”, Proceedings of the IEEE Interna tiona l Conference on Informa tion Acquisition, 2005.

[6] A. Bro de r, R. Kumar, F Mag ho ul, P. Rag hav an, S. Rajago palan, R.

S tata, A. To mkins, J. Wie ne r, “Graph S truc ture in the We b”, Computer

Networks: The Interna tiona l Journal of Computer a nd telecommunica tions

Networking, Vo l. 33, Issue 1 -6, pp 309-320, 2000.

[7] X. Wang, T. Tao , J. T. S un, A. S hake ry and C. Zhai, “Diric hle tRank: So lv ing the Ze ro -One Gap Pro ble m o f Page Rank”. ACM Tra nsaction on Informa tion Systems, Vo l. 26, Issue 2, 2008.

[8] Z. Gyo ngy i and H. Garc ia-Mo lina, “We b S pam Taxo no my ”. Proc. of the First Interna tiona l Workshop on Adversa ria l Informa tion Retrieva l on the Web”, 2005.

[9] M. Bianc hini, M.. Go ri and F. Sc arse lli, “Inside Page Rank”. ACM Tra nsa ctions on Internet Technology, Vo l. 5, Issue 1, 2005

[10] C.. H. Q. Ding, X. He, P. Husbands, H. Zha and H. D. S imo n, “Page Rank: HITS and a Uni fie d Frame wo rk fo r Link Analysis”. Proc. of the 25th Annual Interna tional ACM SIGIR Conference on Resea rch a nd Development in Informa tion Retrieva l, 2002.

[11] J. Cho and S. Roy , “Impac t o f S e arc h Eng ine s o n Page Po pularity ”.

Proc. of the 13th Interna tiona l Conference on WWW, pp. 20-29, 2004.

[12] J. Cho , S . Roy and R. E. Adams, “Page Quality : In se arc h o f an unbiase d we b ranking ”. Proc. of ACM Interna tiona l Conference on Ma nagement of Da ta ”. Pp. 551-562, 2005.

[13] A. M. Zare h Bido ki and N. Yazdani, “Distanc e Rank: An inte llige nt ranking algo rithm fo r we b pages” Informa tion Processing a nd Ma nagement, Vo l 44, No . 2, pp. 877 -892, 2008.

[14] S. Chakrabarti, B. Do m, D. Gibso n, J. Kle inbe rg , R. Kumar, P.

Rag hav an, S . Rajago palan, A. To mkins, “Mining the Link S truc ture

o f the Wo rld Wide We b”, IEEE Computer Society Press, Vo l 32, Issue 8

pp. 60 – 67, 1999.

[15] L. Page , S. Brin, R. Mo twani, and T. Winog rad, “The Page rank Citatio n Ranking : Bring ing o rde r to the We b”. Technica l Report, Sta nford Digita l Libra ries S IDL-WP-1999-0120, 1999.

[16] S. Brin, L. Page , “The Anato my o f a Large Sc ale Hy pe rte xtual We b se arc h e ng ine ,” Computer Network a nd ISDN Systems, Vo l. 30, Issue 1 -

7, pp. 107-117, 1998.

[17] J. Ho u and Y. Zhang , “Effe c tive ly Finding Re lev ant We b Page s fro m Linkage Info rmatio n”, IEEE Tra nsactions on Knowledge a nd Da ta Engineering, Vol. 15, No. 4, 2003.

[18] J. De an and M. He nzinge r, “Finding Re late d Page s in the Wo rld

Wide We b”, Proc. Eight Int’l World Wide Web Conf., pp. 389-401, 1999.

[19] R. Kumar, P. Rag hav an, S. Rajago palan, D. S iv akumar, A. To mpkins and E. Upfal, “We b as a Graph”, Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Da ta ba se systems, 2000.

[20] R. Coo ley , B. Mo bashe r and J. S riv astav a, “Web Minig: Info rmatio n

and Patte rn Discove ry o n the Wo rld Wide We b”. Proceedings of the 9th

IJSER © 2012

http :// www.ijser.org

International Journal of Scientific & E ineerirg Research Volume 3, Isste 2, February-2012 4

ISSN 2229-5518

IEEE International Conference on Tools with Artificill Intelligence, pp.

(ICfAI'97), 1997.

(21] Slll"g Jin Kim and Sarg Ho Lee, "Anhnproved Computation of the PageRank Algorithm", Jn procees of t:l-e Europ;!an Confen!rce on Information Retrieval (ECIR), 2002.

(22] Ricardo Baeza-Yates and Emilio Davis ,"Web page ranking usirg link attributes" , In procees of the 13th international World Wide Web confen!rce on Alternate track papers & r:osters, PP328-329,

2004.

IJSER 2012

tttp'((ww .,, ser .rra