Author Topic: A DKIM based Architecture for Combating Good Word Attack in Statistical Spam Fil  (Read 2201 times)

0 Members and 1 Guest are viewing this topic.

IJSER Content Writer

  • Sr. Member
  • ****
  • Posts: 327
  • Karma: +0/-1
    • View Profile
Author : Kashefa Kowser.K, Saruladha.K, Packiavathy.M
International Journal of Scientific & Engineering Research Volume 2, Issue 6, June-2011
ISSN 2229-5518
Download Full Paper : PDF

Abstract— Abuse of E-Mail by unwanted users causes an exponential increase of E-Mails in user mailboxes which is known as Spam. It is an unsolicited commercial E-mail or unsolicited bulk E-Mail produces huge economic loss to large scale organizations due to high network bandwidth consumption and heavy mail server processing overload. Statistical spam filters could be used to categorize incoming E-Mails into legitimate and spam but they are vulnerable to Good Word attack which obfuscates “good words” in spam messages to make it legitimate. This paper attempts for a counterattack strategy to eradicate insertion of good words by proposing architecture of enhanced DKIM (DomainKeys Identified Mail) as a solution. Our experimental result shows that DKIM serves to be the best as it incorporates sender evidence with random values in the E-Mail messages which is critical for the spammers to evade E-Mail filtering process. The misclassification of the spam E-Mail as legitimate E-Mail would reduce the performance of text classifiers. As the number of E-Mail increases, the misclassification percentage decreases by using DKIM
Index Terms— spam filtering, good word attack, DomainKeys Identified Mail (DKIM)

1   INTRODUCTION                                                                     
THE statistical spam filters use Machine Learning Techniques for automatically sorting text sets into categories from a predefined set. They are broadly classified into Reinforcement learning, supervised learning, semi-supervised learning and unsupervised learning. The learning method for each technique differs. In supervised learning method all training data are mostly labeled, unsupervised method train machines to learn by using unlabelled data, Semi-supervised learning technique uses both labeled and unlabeled data for training whereas reinforcement learning makes use of an agent to train data. 
Text Categorization approach has considerable sav-ings in labor power for organizing and handling text data than the knowledge engineering approach which requires data to be collected with the help of the domain experts either through direct interaction or through question raise with the help of the domain experts. Though Text Classification filtering Techniques is proven useful in statistical spam filters, spammers systematically modify the E-Mail messages and malicious contents enter the user’s host bypassing the filters. One such type of attack is known as Good Word Attack in which spam messages are injected with enough good words which tends the text classifier system to classify a spam as a legitimate E-Mail. Spam-mers are explicitly trained to learn the  features (key-words) which mostly occur in legitimate E-Mails and add those sets of good feature words( Most frequently occurring words in legitimate E-mails) to make the spam messages legitimate.
Also they append the spam keywords with spaces and punctuation symbols so that they are not filtered by the statistical spam filters. Even though a large body of research was proposed to this good word attack, there is paucity of misclassifications of features. DKIM [8] is a defense mechanism which uses digital signatures and guarantees authenticated E-Mail service. Further Domain Keys offers end-to-end integrity from a sender to the intended recipient with randomly generated evidence values.
This paper is organized as follows. Section 2 summarizes the related work, Section 3 discusses the architecture design of the proposed work, Section 4 discusses the experimental results and Section 5 is the conclusion.
Enrico Blanzieri presents an overview of machine learning applications for spam filtering and compares the different filtering methods. They also discuss other branches of anti-spam protection and use of various approaches in commercial and noncommercial anti-spam software solutions [1]
Fabrizo Sebastiani compares the various auto-mated approaches of text categorization algorithms in the way the classifiers are constructed and further eva-luate the above said approaches for document indexing within the general machine learning Paradigm [2].
Sirisanyalak et. al uses an E-Mail feature extraction technique that extracts a set of four features and has used those features as input for spam detection model in artificial immune spam systems [3].
Gregory Wittel et. al examines the general attack me-thod like common word attack and dictionary attack in the filter’s features generation through tokenization or obfuscation along with the challenges faced by develop-ers and spammers [4].
Daniel Lowd et. al describes the naïve bayes, maxi-mum entropy statistical spam filters and evaluates the effectiveness of active and passive good word attacks on those filters [5].
Zach Jorgensen et. al applies multiple instance logistic regression on the multiple bags of instances (segments) and an E-Mail is classified as legitimate if all the instances in it are legitimate and as spam if at least one instance in the corresponding bag is spam [6]
Allman [7] defines DKIM as a digital signature domain-level authentication framework that permits potential E-mail signers to publish E-Mail signing practices information for the E-Mail receivers to make additional assessments about messages using key server technology, public-key cryptography and Mail Transport Agents (MTAs) or Mail User Agents (MUAs).
Barry Leiba focuses on verifying the digital signature that creates the evidence and ensuring both the sender and the recipient about the mail origin from where it says it does [8]
Erkut Sinan Ayla Havelsan discusses intra-domain E-mail security system. It keeps E-Mail messages in corresponding mailboxes as encrypted messages. Trusted Mail Gateway process keeps encrypted E-Mail messages in mail boxes and records processing results in a database as notary information [9]
Ya-Jeng Lin discusses the Lightweight, Pollution-Attack Resistant Multicast authentication scheme (PARM), which generates evidence that receivers can validate on a fast, per-packet basis. Fault-tolerance coding [10] algorithm which is discussed tolerates the loss of packet and signature amortization reduces the computation and communication overhead.

Read More: Click here...