Author Topic: Association Rule Mining on Distributed Data  (Read 2676 times)

0 Members and 1 Guest are viewing this topic.

IJSER Content Writer

  • Sr. Member
  • ****
  • Posts: 327
  • Karma: +0/-1
    • View Profile
Association Rule Mining on Distributed Data
« on: February 18, 2012, 01:50:32 am »
Quote
Author : Pallavi Dubey
International Journal of Scientific & Engineering Research Volume 3, Issue 1, January-2012
ISSN 2229-5518
Download Full Paper : PDF

Abstract - Applications requiring large data processing, have two major problems, one a huge storage and its management and second processing time, as the amount of data increases. Distributed databases solve the first problem to a great extent but second problem increases. Since, current era is of networking and communication and people are interested in keeping large data on networks, therefore, researchers are proposing various algorithms to increase the throughput of output data over distributed databases. In my research, I am proposing a new algorithm to process large amount of data at the various servers and collecting the processed data on client machine as much as he/she is requiring. The data is kept in XML format, which allows processing it further, if needed.

The local copy of searched data is provided to the users if he/she requires it again, this allows making a proxy server where frequently searched items can be kept with the frequency of their access. This not only allows providing fast access to the data but will also provide to maintain list of frequently accessed data.

For accessing the data from the various servers, there are several methods such as mobile agents, direct networked access, client-server techniques Etc. I have used multithreaded environment to map various distributed servers to collect data. For processing of data at the server end, Apriori Algorithm has been applied to get the outputs, which are then sent to the client. At client data from various servers is collected and list of uncommon data is created which is then converted into XML data format. If the search is successful then user is allowed to store the search locally or at proxy server, this will reduce the future processing time of the same data search. In this paper an Optimized Distributed Association Rule mining algorithm for geographically distributed data is used in parallel and distributed environment so that it reduces communication costs. The response time is calculated in this environment using XML data.

Keywords - Association rules, Apriori algorithm, parallel and distributed data mining, Multiprocessing Environment, XML data, response time.

1.    INTRODUCTION
Association rule mining (ARM) has become one of the core data mining tasks and has attracted tremendous interest among data mining researchers. ARM is an undirected or unsupervised data mining technique which works on variable length data, and produces clear and understandable results. There are two dominant approaches for utilizing multiple Processors that have emerged; distributed memory in which each processor has a private memory; and shared memory in which all processors access common memory [5]. Shared memory architecture has many desirable properties. Each processor has direct and equal access to all memory in the system. Parallel programs are easy to implement On such a system. In distributed memory architecture each processor has its own local memory that can only be accessed directly by that processor [10]. For a processor to have access to data in the local memory of another processor a copy of the desired data element must be sent from one processor to the other through message passing. XML data are used with the Optimized Distributed Association Rule Mining Algorithm.
A Parallel application could be divided into number of tasks and executed concurrently on different processors in the system [9]. However the performance of a parallel application on a distributed system is mainly dependent on the allocation of the tasks comprising the application onto the available processors in the system.In different kinds of information databases, such as scientific data, medical data, financial data, and marketing transaction  data; analysis and finding critical hidden information has been a focused area for researchers of data mining. How to effectively  analyze and apply these data and find the critical hidden information from these databases, data mining technique has been the most widely discussed and frequently applied tool from recent decades. Although the data mining has been successfully  applied in the areas of scientific analysis, business application, and medical research and its computational efficiency and accuracy are also improving, still manual works are required to complete the process of extraction. Association rule mining model among data mining several models, including Association rules, clustering and classification models, is the most widely applied method. The Apriori algorithm is the most representative algorithm for association rule mining. It consists of many modified algorithms that focus on improving its efficiency and accuracy. For the purpose of simulation, I have employed the database of Industries to assess the proposed algorithm.The rest of this study is organized as follows. Section 2 briefly presents the general background, while the proposed method is explained in Section 3. Sections 4 and 5 illustrate the computational results of the Industry database. The concluding remarks are finally made in Section 6.

2. LITERATURE REVIEW
Association Rule Mining: In data mining, association rule Learning is a popular and well researched method for discovering interesting relations between variables in large databases. It analyzes and present strong rules discovered in databases using different measures of interestingness. Based on the concept of Strong, rules, Agrawal et al., introduced association rules for discovering regularities between products in large scale transaction data recorded by point-of-sale (POS) systems in supermarkets.

For example, the rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy burger. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements. In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection and bioinformatics. Three parallel algorithms for mining association rules [3], an important data mining problem is formulated in this paper. These algorithms have been designed to investigate and understand the performance implications of a spectrum of trade-offs between computation, communication, memory usage, synchronization, and the use of problem-specific information in parallel data mining [11]. Fast Distributed Mining of association rules, which generates a small number of candidate sets and substantially reduces the number of messages to be passed at mining association rules [4].

Algorithms for mining association rules from relational data have been well developed. Several query languages have been proposed, to assist association rule mining such as [12], [13]. The topic of mining XML data has received little attention, as the data mining community has focused on the development of techniques for extracting common structure from heterogeneous XML data. For instance, [14] has proposed an algorithm to construct a frequent tree by finding common sub trees embedded in the heterogeneous XML data. On the other hand, some researchers focus on developing a standard model to represent the knowledge extracted from the data using XML. JAM [15] has been developed to gather information from sparse data sources and induce a global classification model. The PADMA system [16] is a document analysis tool working on a distributed environment, based on cooperative agents. It works without any relational database underneath. Instead, there are PADMA agents that perform several relational operations with the information extracted from the documents.
 
ASSOCIATION RULE MINING ALGORITHMS
An association rule is a rule which implies certain association relationships among a set of objects (such as ``occur together'' or ``one implies the other'') in a database. Given a set of transactions, where each transaction is a set of literals (called items), an association rule is an expression of the form X Y, where X and Y are sets of items. The intuitive meaning of such a rule is that transactions of the database which contain X tend to contain Y Association rule mining(ARM) is one of the data mining technique used to extract hidden knowledge from datasets that can be used by an organizations decision makers to improve overall profit.[2].

2.1 Apriori Algorithm
An association rule mining algorithm, Apriori has been developed for rule mining in large transaction databases by
IBM's Quest project team [4]. An {item set} is a non-empty set of items.
They have decomposed the problem of mining association rules into two parts:
1.   Find all combinations of items that have transaction support above minimum support. Call those combinations frequent item sets. Item.
2.   Use the frequent item sets to generate the desired rules. The general idea is that if, say, ABCD and AB are
3.   frequent item sets, and then we can determine if the Rule AB CD holds by computing the ratio
                     r = support (ABCD)/support (AB).
The rule holds only if r >= minimum                                confidence. Note that the International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010 90 rule will have minimum support because ABCD is frequent. The algorithm is highly scalable [8].
The Apriori algorithm used in Quest for finding all frequent item sets
is given below.

Read More: Click here...