International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015

ISSN 2229-5518

694

Research Scholar, Computer Science, Research and Development Centre, Bharathiar University, Coimbatore, Tamil Nadu, India.

and

Assistant Professor, Department of MCA, Govt. Thirumagal Mills College, Vellore, Tamil Nadu, India.

Data mining is a tool that supports research and allows new assertions to be made by disclosing previously undisclosed details in large amounts of data. Data base mining algorithm is developing fast and efficient algorithms that can deal with large volume of data because most mining algorithms perform computation over the entire database and mostly the databases are very large.Association rule mining has been a topic of active research in the field of data mining. It has two steps to complete the task which are discovering frequent sets of items in a database and subsequently extracting rules from these frequent sets of items.

--------------------------------------------------------------------------------------------------- --------------------------- S.Kavitha is currently pursuing Ph.D in Computer Science in Bharathiar Univeristy and also working as

Assistant Professor, Department of Computer Applications, SRM University, Chennai, Tamil Nadu,India,

E-mail: kavitha.s@ktr.srmuniv.ac.in

Dr.K.V.ArulAnandam is an Assistant Professor in computer Science in Govt. Thirumagal Mills College , India. E-mail: sathisivamkva@gmail.com.

IJSER © 2015 http://www.ijser.org

International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015

ISSN 2229-5518

695

Mobile agents are software robots (mobile code) that can move autonomously from node to node within a network, executing its operations interacting with the host environment at each node that it visits. The process migration, remote evaluation and mobile objects are the concepts of mobile agents with the improvement over Remote Procedure Call (PRC) for distributed programming. It is sent to an information source together with a query. It can actively interact with the information source that is to refine the query. It shows a reduced communication load as the interaction occurs at the same place. Mobile agent used to conserving bandwidth reducing latencies, allow robust remote interaction and provide scalability. Mobile agent paradigm is considered to be better than client server paradigm for large-scale distributed heterogeneous network environment.

The main challenges of the parallel and distributed association include work-load balancing, synchronization, communication minimization, finding good data layout, data decomposition and disk I/O minimization which is especially important for distributed association rule mining. The mobile agent technology with parallel and distributed association rules are used to meet the above challenges and also shown the improvements with the experimental results.This module contains the following steps i). Find all frequent item-sets for a predetermined supports. ii). Generate the association rules from the frequent item-sets from all sites. iii). Calculate Elapsed time with different minimum support with different algorithms. iv) Calculate Elapsed time with different database size with different algorithms.

This paper is organized as follows. Related work of the parallel and distributed algorithms surveyed and presented in section 2. Section 3 provides the research methodology about the Improved Distributed Data Mining Algorithm using mobile agent. Section 4 describes the parallel frequent item sets mining. Section 5 presents mobile agent based distributed data mining. Section 6 discussed the notations which is used for measuring the performance of the DDM-Algorithms. Section 7 reports the experimental result.Finally, summarized work of the paper is in the section 8.

In the distributed environment, so many algorithms have been proposed to mine all association rules in distributed databases. DARM is the first algorithm proposed in the share nothing parallel systems and datasets are horizontally partitioned among different sites using the Apriori algorithm in parallel version. Fast Distributed Algorithm (FDM) has been proposed by Cheung et al. to mine rules from distributed data sets partitioned among different sites[3]. It prunes all infrequent local candidate sets with the help of the local support counts in each site. Then, each site broadcasts messages of candidate sets to other sites and request the support counts. Finally, finds the globally frequent item sets. ODAM algorithm proposed with the base concepts of FDM algorithm [4]. It has the advantage of greater message optimization and performance enhancement to FDM. Distributed Decision Miner (DDM) is proposed by Assaf Schuster and his colleagues which generates the rules that have confidence above the threshold level without generating a rule‟s exact confidence. Distributed Sampling (D-Sampling) algorithm is proposed by Assaf et al which is an extension of the sequential sampling in distributed databases[5][6].

Let DB be a database with D transactions. In a distributed system, there are n sites

(S1,S2,…,Sn). The database partitioned over the n sites into DB1, DB2, … ,DBn where DB=ՍDBi, i=1 to n

, DBiՈDBj= ɸ for i=j. Let the size of the partitions DBi be Di, for i=1, … , n. The support of an itemset S is

defined as the fraction of total transactions that contain S. An itemset is called frequent if its support is

above a user specified minimum support threshold t. Let X.sup and X.supi be the support counts of an

itemset X in DB and DBi respectively. Let ӀDBӀ be the number of transactions in global database DB and

ӀDBiӀbe the number of transactions in local database DBi.X.sup is called the global support count and

X.supithe local support count of X at site Si. For a given minimum support threshold minsup, X is globally

IJSER © 2015 http://www.ijser.org

International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015

ISSN 2229-5518

696

frequent if X.sup ≥ minsup X D ; correspondingly , X is locally frequent at site Si, if X.supi ≥ minsup X Di. GFI denotes the globally frequent itemsets in DB, LFIi denotes the locally frequent itemsets in DBi. The goal of parallel frequent itemsets mining algorithm is to find the globally frequent itemsets GFI.

Definition 1. X is globally large, if X.sup ≥ Minsup * IDBI.

**Definition 2****. **X is locally large in site Si, if X.supi ≥Minsup * IDBiI.**Lemma 1****. **If an itemsetX is globally large, there exists a site Si (1≤i≤n), such that *X *and all its subsets are locally frequent at site Si.**Proof: **If *X *is not locally large at any site, then *X.sup*i< s х D for all i=1,…,n. Therefore, *X.sup*i< s х D and X cannot be globally large. By contradiction, *X *must be locally large at some site Si and hence *X *is locally large at Si. Consequently, all the subsets of *X *must also be locally large at Si.**Lemma 2****. **If the set of the globally large itemsets is GFI, the set of the local large itemsets at site Si is

LFIi, then GFI ՍLFIi, i=1 to n.

**Lemma 3****. **For an itemset X, if the set of sites at which *X *is large noted as *X *in frequent sites, the set of sites at which *X *is not large as *X *in infrequent sites.**Proof: **It is clear that *X *in frequent sites Ս*X *in infrequent sites = {S1,S2,…,Sn} and *X *in frequent sites Ո*X*

in infrequent sites = ɸ.**Lemma 4****. **If *X*.Minsup ≥ sup хIDBI, then *X *is globally large itemset.**Proof: **If *X *is globally large itemset, it should satisfy the condition that the minimum support of *X *should be greater or equal to the multiplying of the support count with the number of transactions in the given database.**Lemma 5****. **If *X*.maxsup< sup хIDBI, then *X *cannot be globally large itemset.**Proof: **If *X *cannot be a globally large itemset, it should satisfy the condition that the minimum support of *X *should be less than the multiplying of the support count with the number of transactions in the given database.

To improve frequent pattern mining performance, many researchers distribute the mining computation over more than one processor. The transactional database (D) divides equally among the available processors (P). Each processor gets N/P transactions (DN/P) where N and P are the total number of transactions in the database and the number of processors available respectively. Each processor counts the frequency of each item x using its local data partition all worker processors sends the local count to master processor. The master processor collects all such items and combines them to generate a global count. The advantage of the computation power of parallel machines provides, reduces significantly, and requires scanning the dataset and generated Frequent Patterns [2]. The following steps shows the Parallel Execution[2]

1) Server starts processing by splitting the data.

2) Server sends the data and control to each node in distributed environment.

IJSER © 2015 http://www.ijser.org

International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015

ISSN 2229-5518

697

3) Each node generates patterns i.e., frequent item-sets as per their sets and support given.

4) Each node also generates association rules from the frequent item-sets as per the confidence given.

5) The nodes return the control to the server.

The mobile agent technology is an efficient tool for searching and retrieving information and also achieve the efforts for reducing the connection time, the communication time and the informational retrieval time [1]. In the network perspective, the mobile agent technology can reduce the network traffic, overcome network latencies and the enhance robustness and fault-tolerant capabilities of distributed applications. Because of these advantages, the mobile agent technology is applied in the distributed data mining and also parallel concepts used for speed up the process of mining[9].The following is the basics of the DDM algorithm of using mobile agent

To find the local knowledge from the distributed sites.

To integrate the local knowledge in oder to find global knowledge.

To check the quality in the global knowledge.

The aim of the algorithm is to reduce the scan of the database and decrease the number of candidate of global frequent itemsets in the distributed database using mobile agents. So for, this algorithm uses the less communication overhead and improves the efficiency of mining global frequent itemsets. It is proved from the experimental results.

The following notations are used in the DDM-algoritms.

DB - Database.

D - Number of Transaction.

n - Number of sites ( S1, S2, … Sn ).

DBi - Distributed Data sets at Si ,DB =U DBi, i= 1 to n.

X.Sup - Support count of a X at DB –Global.

X.Supi - Support count of a X at DBi –Local.

Minsup-Minimum support threshold.

GFI - Global Frequent Item Set.

CGFI - Candidate Global Frequent Item Set.

X - Global Frequent Item iff, X.Sup>= minsup * D.

X - Local Frequent Item iff, X.supi>= minsup * Di.

LFI I - Local Frequent Item set at site-i.

PGFI - Possible Global Frequent Item Sets- These are item sets at sites-i, which are not part of LFI i, but by adding these count at central place converts CGFI to GFI [7].

The goal of this algorithm is to minimize scans of the database and data is easy to use and update. Moreover it makes each processor to process independently and decrease the number of candidate global frequent item sets according to the relation between local frequent item sets and global frequent sets. It has two steps as follows [8]

1. Mining Local Frequent Item Sets (LFI) at each site in parallel and send them to central site to calculate Global Frequent Item Stets.

2. Central site calculates the Candidate Global Frequent Item Sets – CGFI and send them to

all sites. Each local site computers counts the CGFI and sends them back to central site to complete the process of finding GFI.

IJSER © 2015 http://www.ijser.org

International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015

ISSN 2229-5518

698

Input : Distributed-Data-Set DBii=1 to n, minsup, Distributed sites‟ address.

Output : Global Frequent Item set – GFI

1. Send Mining Agent (MA) to all sites with support value.

2. Each Cooperative MAi, Computes LFI i in parallel.

3. Send LFIi(i=1 to n) to central site.

4. Compute GFI & CGFI: GFI=∩ LFI I, CGFI= Ù LFI i - ∩ LFI I, i=1 to n.

5. Add an item set to a GFI, if its count in frequent sites is greater or equal to minimum support value.

For all X ε CGFI do

{

If Σ i= 1 to n X.Sup i >= minsupх D Then

{

GFI = GFI Ù {X}; CGFI = CGFI - {X};

}

}

6. Send CGFI to each site Si ,i=1 to n.

7. Compute counts of any item set X.(X ε CGFI) in infrequent sites.

For i = 1 to n do

{

Search *X *at site Si;

Get *X*.Supi and send to central site.

}

8. Send count of each CGFI to central site.

9. At Central site: Compute Global Counts of each item set in CGFI

IJSER © 2015 http://www.ijser.org

International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015

ISSN 2229-5518

699

For all X ε CGFI do

{

If *X*.Sup = Σ X.Sup I, I= 1 to n>= minsupхD then

{

GFI = GFI Ù {X};

}

}

The basic objective is to reduce the time required to compute GFI. This algorithm performs two tasks parallel.

1. Local sites send LFIs to central site and also to all their neighbors.

2. Calculation of GFI/CGFI at central site and counts of CGFI at local sites is done as overlapped operation. That is, local sites need not wait for central site to send CGFI. Thus total time taken is reduced drastically.

Input: Distributed-Data-Set DBi, i=1 to n, Minsup. Output: Global Frequent Item set – GFI.

1. Send Mining Agent (MA) to all sites.

For I= 1 to n do

{

MA.send (Location = I, S=Support, Addresses of all distributed sites);

}

2. Each Cooperative MAi Computes LFI I in parallel.

3. Send LFI to central site and also to all its neighbors.

4. Compute GFI & CGFI at central site.

GFI=∩ LFI I, i=1 to n; CGFI= Ù LFI i - ∩ LFI I, I = 1 to n.

5. Calculate PGFI and their count at each site.

PGFI j=all sites= All Item Sets at site-j ∩ LFIi, i=1 to n, i<>j

6. Send count of PGFIi, i=1 to n to central site from each infrequent site.

Note: Step-4 and Step-5, 6 are performed in parallel.

7. Calculate GFI at central site using PGFI count. for all X ε CGFI do

{

IJSER © 2015 http://www.ijser.org

International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015

ISSN 2229-5518

700

If X.Sup = Σ X.Supi,i= 1 to n>= Minsupх D then

{

GFI = GFI Ù {X};

}

}

The new algorithm performs the following tasks parallely.

Mining Local Frequent Item sets (LFI) and Possible Global Frequent Item Sets with count which is not part of LFI at

1. each site in parallel and send them to central site for calculation Global Frequent Item sets.

2. Calculation of Gobal Frequent Item Sets (GFI)/Candidate Global Frequent Item Sets (CGFI) and also find final Gobal Frequent Item Sets at cental site in parallel. Central site need not wait for PGFI count for GFI calculation. So, Total time taken is reduced [7].

Input : Distributed-Data-Set DBi, i=1 to n, Minsup, Distributed sites‟ address. Output : Global Frequent Item set – GFI.

1. Send Mining Agent (MA) to all sites. for I=1 to n do

{ MA.send(Location=I,S=Support,Address of all distributed sites);

}

2. Each Cooperative MAi Computes LFII in parallel, PGFI and their count at each site . for I=1 to n do

{If frequent present then Compute LFII else

PGFI j=all sites = All Item Sets at site-j ⋂LFIi, i=1 to n,

i<>j

}

3. Send LFI to central site and also send count of PGFIi, i=1 to n, to central site from each infrequent site.

4. Compute GFI and CGFI at central site.GFI=⋂ LFIi,i=1 to n ; CGFI= ⋃ LFIi - ⋂ LFIi ,i=1 to n.

5. Calculate GFI at central site using PGFI count.

IJSER © 2015 http://www.ijser.org

International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015

ISSN 2229-5518

701

for all X ε CGFI do

{

If X.Sup= 𝛴 X.Sup i,i=1 to n>=MinSup*D then

{

GFI=GFI ⋃ {X};

}

}

Note : Step 4 and step 5 are performed parallel.

The DDM Algorithm-3 has improved because of the implementation of steps reduction methods and parallel programming when compared to the existing algorithms. This will lead to the reduction of message passing among the nodes. It will increase the execution time and decrease the communication overhead.

The followings are the notations used for measuring the performance of the above DDM Algorithms and also shows the differentiation among the DDM-Algorithms[10].

Ts – Time required sending „minsup‟ from main site to all distributed sites.

Tc-LFI – Time required calculating LFI at all distribute sites.

Ts-LFI-m – Time required sending LFI from all distributed sites to central site.

Ts-LFI-n – Time required sending LFI from all distributed sites to their neighbors.

Tc-C/GFI-m –Time required to find GFI and CGFI at central site.

Ts-CGFI-ds – Time required sending CGFI from central site to all distributed sites.

Ts-c-CGFI-m – Time required finding count of CGFI at each distributed sites and sending to central site.

Ts-PGFI-m – Time required finding and sending PGFI and its count to central site.

Tco-P/CGFI – Time required converting CGFI to GFI using count of PGFI received from all distributed sites.

T1 ,T2& T3 – Time required to find GFI at central site using existing DDM Algorithm-1 , DDM Algorithm-2 and proposed methods respectively.

Total time required for calculating GFI using DDM algorithm-1, DDM algorithm-2 and DDM-Algorithm-3 are T1, T2 and T3 respectively.

T1 = Ts + Tc-LFI +Ts-LFI-m + Tc-C/GFI-m + Ts-CGFI-ds + Ts-c-CGFI-m + Tco-P/CGFI- (1) T2 = Ts + Tc-LFI + Ts-LFI-m + Tc-C/GFI-m + Tco-P/CGFI - (2)

T3=Ts+Tc-LFI+Ts-LFI-m+Tco-P/CGFI - (3)

From the above formula, it gives clear solution that T3<T2<T1 according to less synchronization steps and parallelism. It is evident that DDM-Algorithm-3 has less time requirement for obtaining the global knowledge.

These experimental results have been taken from the real time data sets of super market using

MATLAB programming. The following figures (1 and 2) shows that the comparison of the different algorithms in terms of performance of speed and scale-up.If the percentage of the support threshold is

IJSER © 2015 http://www.ijser.org

International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015

ISSN 2229-5518

702

increased, the elapsed time in secondsof the algorithms decreased. The DDMA-3 has very less elapsed time in secondsthat shown in the fig.1 and the performance scalability is more when compared to the existing algorithmsthat shown in the fig.2. From that the DDMA-3 is improved algorithm than the DDMA-

1 and DDMA-2.

Fig. 1 Comparison of different algorithms DDMA-1, DDMA-2 and DDMA-3

Fig.2 Performance of Scale-Up

8. Conclusion

Association rules mining in distributed database using mobile agent technology is an important aspect in the field of data mining. Our algorithm showed that it can reduce the computation overhead of discovering frequent item sets in a distributed environment and also scalable in terms of the required communication and computation costs. This algorithm can provide valuable information for decision- making and prove with experimental result.

[1]. Manvi.S.S, Venkataram.P, “Applications of agent technology in communication: a review”,

Computer Communication”, (2004), 1493-1508, Elsevier, Science Direct.

[2]. Agrawal, R. and Shafer,J. (1996). “Parallel mining of association rules”. IEEE Transaction on

Knowledge and Data Engineering, Vol.*, No.6, pp.962-969.

[3]. Cheung, D.W. et al. (1996). “A fast distributed algorithm for mining association rules”, Proc.

Parallel and Distributed Information Systems, IEEE CS Press, pp. 31-42.

[4]. Ashrafi, M. Z., Taniar, D. and Smith, K. (2004). ODAM: An Optimized distributed association rule mining algorithm. IEEE distributed systems online, Vol. 05, No.3.

9

IJSER © 2015 http://www.ijser.org

International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015

ISSN 2229-5518

703

[5]. Schuster, A. and Wolf, R. (2001). “Communication-efficient distributed mining of association rules”.

Proc. ACM SIGMOD International Conference on Management of Data, ACM Press, pp. 473-484. [6]. Schuster, A., Wolf, R. and Trock, D. (2005). “A high-performance distributed algorithm for mining

association rules”. Knowledge and Information Systems (KAIS) Journal, Vol.7, No.4.

[7]. Kavitha.S, and Dr.KV.ArulAnandam, “Research on improved distributed data mining algorithm using mobile agent framework”, Intl. j. of scientific and Engineering Research (IJSER), vol.4, issue

6,(2013),pp.740-744, ISSN 2229-5518.

[8]. Yun-Lan Wang, Zeng-Zhi Li, Hai-Ping Zhu, “Mobile-Agent-Based Distributed and Incremental Techniques for Association Rules”.Procceedings of Second Int. Conf. on Machine Learning and Cybernetics, Xi‟an, (2003), IEEE.

[9]. J.Han, J.Pei, Y.Yin, R.Mao, “Mining frequent patterns without candidate generation: a frequent-

pattern tree approach”, Journal of Data Mining and Knowledge and Grid, (2009), pp.128-135.

[10]. U.P.Kulkarani, P.D.Desai, Tanveer Ahmed, J.V.Vadavi and A.R.Yardi, “Mobile Agent Based

Dustributed Mining”, International Conference on Computational Intelligence and Multimedia

Applications (2007).

IJSER © 2015 http://www.ijser.org