International Journal of Scientific & Engineering Research Volume 2, Issue 6, June-2011 1

ISSN 2229-5518

Evolving Data Mining Algorithms on the Prevailing Crime Trend – An Intelligent Crime Prediction Model

A. Malathi and Dr. S. Santhosh Baboo

Abstract— Crime is a behavior deviation from normal activity of the norms giving people losses and harms. Crimes are a social nuisance and cost our society dearly in several ways. In this paper we look at use of missing value and clustering algorithm for crime data using data mining. We will look at MV algorithm and Apriori algorithm with some enhancements to aid in the process of filling the missing value and iden- tification of crime patterns. We applied these techniques to real crime data. Crime prevention is a significant issue that people are dealing with for centuries. We also use semi-supervised learning technique in this paper for knowledge discovery from the crime records and to help in- crease the predictive accuracy.

Index Terms— Crime-patterns, clustering, data mining, law-enforcement, Apriori.
—————————— • ——————————

1 INTRODUCTION

rime is a behavior disorder that is an integrated re- sult of social, economical and environmental factors. Crimes are social nuisance and cost our society in several ways. In the world today crime analysis is gaining signi- ficance and one of the most popular disciplines is crime prediction. Stakeholders of crime intend to forecast the place, time, number of crimes and crime types to get pre- cautions. With respect to these intentions, in this paper a

crime prediction model is generated. .

We today, security are considered to be one of the major concerns and the issue is continuing to grow in intensity and complexity. Security is an aspect that is given top priority by all political and government worldwide and are aiming to reduce crime incidence[ 5]. Reflecting to many serious situations like September 11, 2001 attack, Indian Parliament Attack, 2001, Taj Hotel Attack, 2006 and amid growing concerns about theft, arms trafficking, murders, the importance for crime analysis from previous history is growing. The law enforcement agencies are ac- tively collecting domestic and foreign intelligence to pre- vent future attacks.

The model is generated by utilizing crime data for few years from the years 2006 to 2010. Methodology starts with obtaining clusters with different clustering algo- rithms. Then clustering methods are compared to select the most appropriate clustering algorithms.

• A. Malathi, Assistant Professor, PG and Research Department of Com- puter Science, Government Arts College, Coimbatore. She is currently pursuing Docrate program in Research and Developemnt centre, Bhara- thiar University, India, PH-09942526000.

E-mail: malathi.arunachalam@yahoo.com

• Dr. S. Santhosh Baboo, Reader, Post Graduate and Research Department of Computer Science, D. G. Vaishnav College, Chennai. India. PH-0999.

E-mail: santhos2001@sify.com

Later crime data is divided into daily apoch, to observe spatiotemporal distribution of crime. In order to predict crime in time dimension is fitted for each week day, then the forecasted crime occurrences in time are disaggre- gated according to spatial crime cluster patterns. Hence the model proposed in this thesis can give crime predic- tion in both space and time to help police departments in tactical and planning operations.

The high volume of crime datasets and also the
complexity of relationships between these kinds of data
have made criminology an appropriate field for applying
data mining techniques. Identifying crime characteristics
is the first step for developing further analysis. The know-
ledge that is gained from data mining approaches is a very useful tool which can help and support police forces [8]. According to[9], solving crimes is a complex task that requires human intelligence and experience and data
mining is a technique that can assist them with crime de- tection problems. The idea here is to try to capture years of human experience into computer models via data min- ing.
In the present scenario, the criminals are becom- ing technologically sophisticated in committing crimes [1]. Therefore, police needs such a crime analysis tool to catch criminals and to remain ahead in the eternal race between the criminals and the law enforcement. The po- lice should use the current technologies [4] to give them- selves the much-needed edge. Availability of relevant and timely information is of utmost necessity in conducting of daily business and activities by the police, particularly in crime investigation and detection of criminals. Police or- ganizations everywhere have been handling a large amount of such information and huge volume of records. There is an urgent need to analyzing the increasing num- ber of crimes as approximately 17 lakhs Indian Penal Code (IPC) crime, and 38 lakhs local and Special Law

International Journal of Scientific & Engineering Research Volume 2, Issue 6, June-2011 2

ISSN 2229-5518

crimes per year.
An ideal crime analysis tool should be able to identify crime patterns quickly and in an efficient manner for future crime pattern detection and action. However, in the present scenario, the following major challenges are encountered.
• Increase in the size of crime information that has to be stored and analyzed.
• Problem of identifying techniques that can accurately and efficiently analyze this growing volumes of crime data
• Different methods and structures used for recording crime data.
• The data available is inconsistent and are incomplete thus making the task of formal analysis a far more difficult.
• Investigation of the crime takes longer duration due to complexity of issues
All the above challenges motivated this research work to focus on providing solutions that can enhance the process of crime analysis for identifying and reducing crime in India. The main aim of this research work consist of developing analytical data mining methods that can systematically address the complex problem related to various form of crime. Thus, the main focus is to develop a crime analysis tool that assists the police in

o Detecting crime patterns and perform crime analysis

o Provide information to formulate strate- gies for crime prevention and reduction

o Identify and analyze common crime pat- terns to reduce further occurrences of similar incidence

The present research work proposes the use of an amal- gamation of data mining techniques that are linked with a common aim of developing such a crime analysis tool. For this purpose, the following specific objectives were formulated.

o To develop a data cleaning algorithm that

• cleans the crime dataset, by re-
moving unwanted data
• Use techniques to fill missing values in an efficient manner

o To explore and enhance clustering algo- rithms to identify crime patterns from historical data

o To explore and enhance classification al- gorithms to predict future crime beha- viour based on previous crime trends

o To develop anomalies detection algo-

rithms to identify change in crime pat- terns
These techniques do not have a set of predefined classes for assigning items. Some researchers use the statistics-based concept space algorithm to automati- cally associate different objects such as persons, or- ganizations, and vehicles in crime records [7]. Using link analysis techniques to identify similar transac- tions, the Financial Crimes Enforcement Network AI System [10] exploits Bank Secrecy Act data to support the detection and analysis of money laundering and other financial crimes. Clustering crime incidents can automate a major part of crime analysis but is limited by the high computational intensity typically re- quired.

2 LITERATURE REVIEW

Data mining in the study and analysis of criminology can be categorized into main areas, crime control and crime suppression. Crime control tends to use knowledge from the analyzed data to control and prevent the occurrence of crime, while the criminal suppression tries to catch a criminal by using his/her history recorded in data min- ing.
constructed a software framework called ReCAP (Regional Crime Analysis Program) for mining data in order to catch professional criminals using data mining and data fusion techniques[3]. Data fusion was used to manage, fuse and interprets information from multiple sources. The main purpose was to overcome confusion from conflicting reports and cluttered or noisy back- grounds. Data mining was used to automatically discover patterns and relationships in large databases.
Crime detection and prevention techniques are applied to different applications ranging from cross- border security, Internet security to household crimes. Proposed a method to employ computer log files as histo- ry data to search some relationships by using the fre- quency occurrence of incidents[2]. Then, they analyzed the result to produce profiles, which can be used to perce- ive the behavior of criminal.

International Journal of Scientific & Engineering Research Volume 2, Issue 6, June-2011 3

ISSN 2229-5518

Introduced a framework for crime trends using a new distance measure for comparing all individuals based on their profiles and then clustering them accor- dingly[6]. This method also provided a visual clustering of criminal careers and identification of classes of crimi- nals.
From the literature study, it could be concluded that crime data is increasing to very large quantities running into zota bytes (1024bytes). This in turn is increasing the need for advanced and efficient techniques for analysis. Data mining as an analysis and knowledge discovery tool has immense potential for crime data analysis. As is the case with any other new technology, the requirement of such tool changes, which is further augmented by the new and advanced technologies used by criminals. All these facts confirm that the field is not yet mature and needs further investigations.

3. PREPROCESSING

Data preprocessing is a process that consists of data cleaning, data integration and data transformation which is usually processed by a computer program. It intends to reduce some noises, incomplete and inconsis- tent data. The results from preprocessing step can be later proceeding by data mining algorithm.
The dataset used in experiment contains various items like year, state code, status of administrative unit, name of the administrative unit, number of crimes with respect to murder, dacoity, riots and Arson, area in sq. meters of the administrative unit, Estimated Mid-Year Population of the Administrative Unit in 1000s (begins in
1964), Actual Civil Police Strength (numbers of person- nel), Actual Armed Police Strength (numbers of person- nel) and Total Police Strength (Civil and Armed Police).

3.1 Missing value handling

The experiment concentrate on only those attributes that are related to crime data, that is year, state, administrative name, number of crimes for the years 1971 to 2006. The quality of the results of the mining process is directly proportional to the quality of the preprocessed data. Careful scrutiny revealed that the dataset have miss- ing data in state and number of crimes attributes.

3.2 Missing value handling for number of crimes ocurred attribute

In the present research work, while considering filling missing number of crimes related murder, dacoity, riots and arson, two methods were used. Initially, all the four fields are analyzed for empty values. If all the four attributes have empty values for a particular record, then the entire record is considered as irrelevant information and is deleted.
0 While taking individual attributes into considera- tion, a novel KNN-based imputation method is proposed. In this method, the missing values of an instance are imputed by considering a given number of instances that are most similar to the instance of interest. The similarity of two in- stances is determined using a distance function.
0 The new algorithm is as follows
1. Divide the data set D into two parts. Let Dm be the set containing the instances in which at least one of the features is missing. The remaining in- stances will complete feature information form a set called Dc.
For each vector x in Dm:
a. Divide the instance vector into observed and missing parts as x = [xo, xm].
b. Calculate the distance between the xo and all the instance vectors from the set Dc.
c. Use only those features in the instance vectors from the complete set Dc, which are observed in the vector x.
d. Use the P closest instances vectors and perform a majority voting estimate of the missing values for categorical attributes. For continuous attributes re- place the missing value using the mean value of the attribute in the P (related instances)
The challenging decisions that have to be carefully chosen are:
(i) The choice of the distance function. In the present work, four distance measures, Eucli- dean, Manhattan, Mahalanobis and Pearson, are considered and the one that produced best result is considered.
(ii) The KNN algorithm searches through all the
dataset looking for the most similar in- stances. This is a very time consuming process and it can be very critical in data
mining where large databases are analyzed.

International Journal of Scientific & Engineering Research Volume 2, Issue 6, June-2011 4

ISSN 2229-5518

To speed up this process a method that com- bines missing value handling process with classification is proposed.
(iii) The choice of k, the number of neighbors.
Experiments showed that a value of 10 pro- duce best results in terms of accuracy and hence is used in further experimentation.
Thus, the traditional KNN Imputation method was enhanced in two manners. The first enhancement is achieved by proposing a new distance metric and the second enhancement is achieved by using LVQ (Learning Vector Quantization) methods combined with genera- lized relevance learning to perform the classification and missing value treatment simultaneously. Both these en- hancement when combined together produces a model (E-KDD) that is efficient in terms of speed and accuracy.

3.3 Missing value handling in the prediction of the size of Population of the city

The first task is the prediction of the size of the popu- lation of a city. The calculation of per capita crime statistics helps to put crime statistics into proportion. However, some of the records were missing one or more values. Worse yet, half the time, the missing value was the "city population size", which means there was no per capita statistics for the entire record. Over some of the cities did not report any population data for any of their records. To im- prove the calculation of "yearly average per capita crime rates", and to ensure the detection of all "per capita outliers", it was necessary to fill in the missing values. The basic approach to do this was to cluster population sizes, create classes from the clusters, and then classify records with unknown population sizes. The justification for using clustering is as follows: Classes from clusters are more likely to represent the actual population size of the cities. The only value needed to cluster population sizes was the popula- tion size of each record. These values were clustered using EM algorithm and initially 10 clusters were chosen because it produced clusters with mean val- ues that would produce per capita calculations close to the actual value

4 CRME PREDICTION MODEL

Given a set of objects, clustering is the process of class discovery, where the objects are grouped into clus-
ters and the classes are unknown beforehand. Two clus- tering techniques, K-means and DBScan (Density-Based Spatial Clustering Application with Noise) algorithm are considered for this purpose. The algorithm for k-means is given below.
The HYB algorithm is given below.
The HYB algorithm clusters the data m groups where m is predefined
Input – Crime type, Number of Clusters, Number of
Iteration
Initial seeds might produce an important role in the final result
Step 1: Randomly Choose cluster centers;
Step 2: Assign instances to clusters based on their dis- tance to the cluster centers
Step 3: centers of clusters are adjusted
Step 4: go to Step 1 until convergence
Step 5: Output C0, C1, C2, C3
From the clustering result, the city crime trend for each type of crime was identified for each year. Further, by slightly modifying the clustering seed, the various states were grouped as high crime zone, medium crime zone and low crime zone. From these homogeneous groups, the efficiencies of police administration units i.e. states can be measured and the method used is given below.
Output Function of Crime Rate = 1/Crime Rate
Here, crime rate is obtained by dividing total crime densi- ty of the state with total population of that state since the police of a state are called efficient if its crime rate is low i.e. the output function of crime rate is high.
Thus the two clustering techniques were ana- lyzed in their efficiency in forming accurate clusters, speed of creating clusters, efficiency in identifying crime trend, identifying crime zones, crime density of a state and efficiency of a state in controlling crime rate. Experi- mental results showed that HYB algorithm show im- proved results when compared with k-means algorithm and therefore was used in further investigations.

Crime Trend Prediction

The next task is the prediction of future crime trends. This involves tracking crime rate changes from one year to the next and used data mining to project those changes into the future. The basic method involves cluster the states

International Journal of Scientific & Engineering Research Volume 2, Issue 6, June-2011 5

ISSN 2229-5518

having the same crime trend and then using ”next year” cluster information to classify records. This is combined with the state poverty data to create a classifier that will predict future crime trends.
The Major crimes under property crime are discussed here. There are many categories of crimes like Crime against women, property crime, Road Accident.
• Murder
• Murder for Gain Dacoity
• Robbery
• Burglary
• Theft
To the clustered results, a classification algorithm was applied to predict the future crime pattern. The classifica- tion was performed to find in which category a cluster would be in the next year. This allows us to build a pre- dictive model on predicting next year’s records using this year’s data. The C4.5 decision tree algorithm was used for this purpose. The generalized tree was used to predict the unknown crime trend for the next year. Experimental results proved that the technique used for prediction is accurate and fast. The following are four different clusters produced depends upon the crime nature
0 C0: Crime is steady or dropping. Theft is the primary crime little increased and dropping.
0 C1: Crime is rising or in flux. Dacoity is the pri- mary crime rates changing..
0 C2: Crime is generally increasing. Robbery, Murder, Murder for gain, and Burglery are the primary crime on the rise.
0 C3: Few crimes are in flux. Dacoity is in flux. It has gone down and increased then once again gone down.

5. IMPLEMENTATION

Major two crimes Burglary and Murder were taken to analyse the existing crime. Crime Burglary was in in- creasing, In the year 2006 it got decreased, then it keeps increasing till 2010. Crime Murder kept in- creasing from 2006 to 2010. The sample crimes Bur- glary and Murder belong to the cluster C2.

Fig. 1. Crime Burglary Analysis

Fig. 2. Crime Murder Analysis
The Murder crime was taken to analyse the future crime prediction. This crime was analysed for the pe- riod 2006 to 2009. Both existing algorithm and the new algorithm are executed for the same data set. The existing algorithm predicted the crime as 83%. The new algorithm predicted the crime as 89%.

6 CONLUSION

A major challenge facing all law-enforcement and intelligence-gathering organizations is accurately and effi- ciently analyzing the growing volumes of crime data. As information science and technology progress, sophisticated data mining and artificial intelligence tools are increasingly accessible to the law enforcement community. These tech- niques combined with state-of-the-art Computers can process thousands of instructions in seconds, saving precious time. In addition, installing and running software often costs less than hiring and training personnel. Computers are also

International Journal of Scientific & Engineering Research Volume 2, Issue 6, June-2011 6

ISSN 2229-5518

less prone to errors than human investigators, especially those who work long hours.

This research work focus on developing a crime analysis tool for Indian scenario using different data mining techniques that can help law enforcement department to efficiently handle crime investigation. The proposed tool enables agencies to easily and economically clean, character- ize and analyze crime data to identify actionable patterns and trends. The proposed tool, applied to crime data, can be used as a knowledge discovery tool that can be used to re- view extremely large datasets and incorporate a vast array of methods for accurate handling of security issues.

The development of the crime analysis tool has four steps, namely, data cleaning, clustering, classification and outlier detection. The data cleaning stage removed unwanted records and predicted missing values. The clustering tech- nique is used to group data according to the different type of crime. From the clustered results it is easy to identify crime trend over years and can be used to design precaution me- thods for future. The classification of data is mainly used predict future crime trend. The last step is mainly used to identify future crimes that are emerging newly by using out- lier detection on crime data.

Experimental results prove that the tool is effective in terms of analysis speed, identifying common crime pat- terns and future prediction. The developed tool has promis- ing value in the current changing crime scenario and can be used as an effective tool by Indian police and enforcement of law organizations for crime detection and prevention.

REFERENCES

1. Amarnathan, L.C. (2003) Technological Advance- ment: Implications for Crime, The Indian Police

Journal, April June.

2. Abraham, T. and de Vel, O. (2006) Investigative profiling with computer forensic log data and asso- ciation rules," in Proceedings of the IEEE Interna- tional Conference on Data Mining (ICDM'02), Pp. 11

– 18.

3. Brown, D.E. (1998) The regional crime analysis pro- gram (RECAP): A frame work for mining data to catch criminals," in Proceedings of the IEEE Interna- tional Conference on Systems, Man, and Cybernet- ics, Vol. 3, Pp. 2848-2853.

4. Corcoran J.J., Wilson I.D. AND Ware J.A. (2003) Predicting the geo-temporal variations of crime and disorder, International Journal of Forecasting, Vol.

19, Pp.623–634.

Changed?, Annual meeting of the International Studies Association, California, USA, http://www.allacademic.com/meta/ p98627_index.html.

6. de Bruin, J.S. , Cocx, T.K. , Kosters, W.A. , Laros, J. and Kok, J.N. (2006) Data mining approaches to criminal career analysis,” in Proceedings of the Sixth International Conference on Data Mining (ICDM’06), Pp. 171-177.

7. Hauck, R.V.Atabakhsh, H., Ongvasith, P., Gupta, H. and Chen, H. (2002) Using Coplink to Analyze Criminal-Justice Data, Computer, Volume 35 Issue

3, Pp. 30-37.

8. Keyvanpour, M.R., Javideh, M. and Ebrahimi, M.R. (2010) Detecting and investigating crime by means of data mining: a general crime matching frame- work, Procedia Computer Science, World Confe- rence on Information Technology, Elsvier B.V., Vol.

3, Pp. 872-830.

9. Nath, S. (2007) Crime data mining, Advances and innovations in systems, K. Elleithy (ed.), Computing Sciences and Software Engineering, Pp. 405-409.

10. Senator, T.E., Goldberg, H.G., Wooton, J., Cottini, M.A., Khan, A.F.U., Klinger, C.D., Llamas, W.M., Marrone, M.P. and Wong, R.W.H. (1995) The Fin- CEN Artificial Intelligence System: Identifying Po- tential Money Laundering from Reports of Large Cash Transactions, AI Magazine, Vol.16, No. 4, Pp.

21-39.

5. David, G. (2006) Globalization and International Se- curity: Have the Rules of the Game