Inte rnatio nal Jo urnal o f Sc ie ntific & Eng inee ring Re se arc h, Vo lume 3, Issue 2, February -2012 1

ISS N 2229-5518

Preprocessing in Web Usage mining

Ma rathe Dagadu Mithara m

ABS TRAC T - Web usage mining to discover history for login user to web based applicat ion. Web usage mining is the process of dat a mining t echniques. Web usage mining t o extract useful informat ion form server log files. It is an aut omatic discovery of patt erns in clickstreams and associated dat a collect ed or generated as a result of user int eract ions with one or more Web sit es.

Goal - Analysis for user int eract ion t o various websit e

Web usage mining consist s following sect ions.

1) Pre-processing

2) P attern discovery

3) P attern Analysis

In this paper describes First phase in detail.

—————————— ——————————

1) INTRODUCTION:-

The process may involve pre-process ing the original
data, integrating data from multiple sources , and trans forming the integrated data into a form s uitable for input into s pecific data mining operations . Collectively, we refer to this process as data Collection. After data collection we can do preprocess ing s ection.
Preprocess ing include the fus ion and synchronization of
data from multiple log files , data cleaning, pageview identification, us er identification, s ess ion identification (or s ess ionization), episode identification, and the integration of
clicks tream data with other data sources such as content or s emantic information.

2) DATA PRE-PROCESSING:-

Fig. 1 it s hows the process of web us ages mining. In
this s ection to be discuss about the pre-process ing section in brief. Pre-process ing s ection depends on web log files or various raw log files . Web usages mining process incomplete without us ing preprocess ing s ection. We now examine some of the ess ential tas ks in pre-process ing. Data preprocess ing depends on server log file.

Format of Server Log File:-

IJSER © 2012

http :// www.ijser.org

Inte rnatio nal Jo urnal o f Sc ie ntific & Eng inee ring Re se arc h, Vo lume 3, Issue 2, February -2012 2

ISS N 2229-5518

Web & Application Server

Data Pre-processing

Data Cleaning
Page view Identification
Us er Identification
Sess ionization
Path completion

Usages Mining

Transaction Clus tering Page view Clustering Association Rule Sequential Pattern

Patte rn Analysis

Pattern Filtering
Aggregation
Characterization

Fig. 1 Web Us ages mining process

2.1) Data Fus ion and Cleaning:-


In some cases , multiple servers are us ed to reduce the load on any particular s erver. Data fus ion refers to the merging of log files from s everal Web and application servers . A user comes from multiple Web or application servers then data fus ion merge data a nd s olved various user identification s ess ion etc.
Server Logs Server Logs Server Logs
Merge Logs Files (Data Fus ion)
Us er Identification, Sess ion etc.

Fig. 2 Data Fusion

Data cleaning is mostly use to removing extraneous references to embedded objects that may not be important for the purpos e o f analys is , including references to style files , graphics , or s ound files . Some information s hould not provide useful information in analys is or data mining tas ks then Data cleaning is us ed. Remove erroneous references .

IJSER © 2012

http :// www.ijser.org

Inte rnatio nal Jo urnal o f Sc ie ntific & Eng inee ring Re se arc h, Vo lume 3, Issue 2, February -2012 3

ISS N 2229-5518

2.2) Us er Identification:-

In web usages mining does not require knowledge about a user his tory because the users vis it or reques t given more than one time to the server. If we vis it more that one time, then it generate multiple sess ions for each user. It is also known as User activity Records . Us er Identify by us ing IP address and Us er Agent in log files . Client reques t to server then it generate log files at that time client also s end us er agent to server.
Cons ider, for instance, the example of Fig. 3 depicts a portion of a partly preprocess ed log file (the time stamps are given as hours and minutes only).

The combination of IP + AGENT we can find out users . In Fig. 3 s hows IP address and Us er agent then we judge user identifications .

TIME

IP

URL

REFF

AGENT

0:04

192.168.100.101

A

-

IE5; Win2k

0:10

192.168.100.101

B

A

IE5; Win2k

0:12

192.168.100.102

A

-

IE6;Xp

0:15

192.168.100.102

B

A

IE6;Xp

0:20

192.168.100.102

C

B

IE6;Xp

0:25

192.168.100.102

D

C

IE6;Xp

0:28

192.168.100.101

C

B

IE5; Win2k

0:33

192.168.100.101

D

C

IE5; Win2k

0:35

192.168.100.102

D

C

IE6;Xp

Fig. 3 Log file

Fig. Us er 1

Fig. Us er 2

In Fig.3 shows 192.168.100.101 this IP vis it more than one time as well as 192.168.100.102 also vis its . Then we judge or find user depends for IP and User Agent.
In the above (Us er 1 & User 2) we can find out us ers as per IP address and User Agent.

2.3) SESSION:-

A s ess ion is a s equence of page views by a s ingle us er during a s ingle vis it. A Sess ion is the process of Us er activity record of each us er in the log files . Sess ion it shows s ingle us er vis iting to web pages . In the ASP or ASP.Net s ess ion object is used, in this sess ion object is us ed s ingle user login s tatus manipulation purpos e. Same think should us e in web us age mining to find how many s ess ions create a s ingle us er login to webs ite. Sess ion is partitioned after user identification.
Sess ion captures in two way :- 1) Time oriented
2) Structure oriented
Time Oriented:- Time oriented is depends on the Time stamps or date and time of request in the server log file. In the time oriented s ess ion there are two types i) The difference between First request and las t request is < =30 minutes. ii) The difference between First request and next request is <= 10. Us ing thes e two points we judge time oriented sess ions .
In the above Fig. Us er2 firs t reques t given 0:12 and las t reques t given 0:35, then difference between <=30 minutes and difference between every request is <=10 minutes then it’s called as one s ess ion. Suppose that chart extend or request given then generate for different
output.

IJSER © 2012

http :// www.ijser.org

Inte rnatio nal Jo urnal o f Sc ie ntific & Eng inee ring Re se arc h, Vo lume 3, Issue 2, February -2012 4

ISS N 2229-5518

e.g.

TIME

IP

URL

REFF

AGENT

0:12

192.168.100.102

A

-

IE6;Xp

0:15

192.168.100.102

B

A

IE6;Xp

0:20

192.168.100.102

C

B

IE6;Xp

0:25

192.168.100.102

D

C

IE6;Xp

0:35

192.168.100.102

D

C

IE6;Xp

0:45

192.168.100.102

E

D

IE6;Xp

0:49

192.168.100.102

F

C

IE6;Xp

0:55

192.168.100.102

G

F

IE6;Xp

Fig. 4 log file
In the above Fig. 4 s hows log file then we find out the s ess ion by us ing time oriented then it’s generate two sess ions .

Session 1

Sess ion 2 s how as follow

Session 2

Structure Oriented:-Structure oriented capture in the referrer fields of the s erver logs . Structure oriented depends on Referrer

fields is currently open or that user currently login referrer. Means it’s belonging to more than one “open” cons tructed sess ion.
e.g.
IP 102 login

IP 102 login
Fig. 5 Login s tatus for 101 and 102 IP
In the above fig.5 s hows s tructure oriented sess ion means 192.168.100.102 this user sess ion is open for t ime s tamp 0:12 to 0:25 and
0:58. It is cons ider as one sess ion.

TIME

IP

URL

REFF

AGENT

0:12

192.168.100.102

A

-

IE6;Xp

0:15

192.168.100.102

B

A

IE6;Xp

IJSER © 2012

http :// www.ijser.org

Inte rnatio nal Jo urnal o f Sc ie ntific & Eng inee ring Re se arc h, Vo lume 3, Issue 2, February -2012 5

ISS N 2229-5518

Sess ion 1
Fig. 6 Exa mple of s ess ion with the structure oriented.
Us ing time oriented it generate two s ess ion as below, because the difference between first and las t request is >30 minutes . Sess ion 1
Sess ion 2

2.3) PATH CO MPLETION:-

Path completion it is also preprocess ing tas k. After completion sess ions we s tart path completion, becaus e that us er how web pages vis ited that should be confirmed us ing path completion phase. Path completion is depends on mostly URL and REFF fie lds in s erver log file. It is also graph model. Graph model represents s ome relation defined on Web pages (or web), and each tree of the graph repres ents a web
s ite. . Each node in the tree repres ents a web page (html document), and edges between trees repres ent the links between web s ites , while the edges between nodes ins ide a s ame tree represent links between documents at a web s ite.
In the path completion Miss ing Reference this method als o used. Miss ing Reference means the us er backtrack should not be s tored in s erver log file. It cached in client s ide.
e.g. URL Reff
A -- B A D B E D F E B C

IJSER © 2012

http :// www.ijser.org

Inte rnatio nal Jo urnal o f Sc ie ntific & Eng inee ring Re se arc h, Vo lume 3, Issue 2, February -2012 6

ISS N 2229-5518

Then we draw the s tructure of vis iting in Fig. 6

A

B

C

D

E

F


Fig. 7 Web s ite Structure
In above Fig. 7 Shows the vis iting web pages as per s erver log file. The doted arrow s hows back track means the click on back button this information not store in s erver log file. This information stored in only client s ide. It is known as Miss ing Reference.
URL Reff

A -- B A D B E D F E B C

In the above chart shows A to F Web pages vis iting as
linking by linking but in las t B to C at that time F to B vis iting as a back track then this information not store in server log file. At that time us er click on back button and this information store only a client s ide.

3) CONCLUSION:-

This paper has attempted to for the purpose of web us age mining. The proposed methods were s uccess fully tes ted on the log files . If we want to check Us er, Sess ion and Path completion then refer this paper. The results which were obtained after the analys is were s atis factory and contained valuable information about the Log Files .

4) REFERNECES:-

1) Web data mining – Bing Liu
2) PPT for Web us age mining - Bing Liu
3) Srivas tava, J., Cooley, R., Deshpande, M., Tan, P.N. (2000). Web Us age Mining: Dis covery and Applications of Usage Patterns from Web
Data. ACM
SIGKDD, Jan 2000.
4) Jaideep Srivastava Paper

IJSER © 2012

http :// www.ijser.org

Inte rnatio nal Jo urnal o f Sc ie ntific & Eng inee ring Re se arc h, Vo lume 3, Issue 2, February -2012 7

ISS N 2229-5518

4) WCA. Web characterization terminology &
definitions .
5) http://www.w3.org/1999/05/WCA -terms/. Vigente
al
19/11/2005
Author Name:- Dagadu Mitharam Marathe
R.C.Patel A.C.S. College, Shirpur,
Maharashtra (INDIA).
At/post- Thalner Tal-Shirpur Dis t- Dhule(MS) India

IJSER © 2012

http :// www.ijser.org