RapidMiner Tutorial: Email Spam Detection Using Support Vector Machine Classification

August 19, 2013 at 11:51 PMBuddy James

RapidMiner Email spam detection using support vector machine classification.

Hello and welcome to another tutorial on RapidMiner, one of the best open source data mining tools available.  This article will discuss how to use Support Vector Machine classification in RapidMiner to determine spam in your pop3 email account.  Since we will read email from a pop3 or imap account, you may want to check out my previous post Text mining: How to mine e-mail data from an IMAP account using RapidMiner.

Process setup

Unlike my other tutorials, this tutorial will explain a RapidMiner solution that consists of two processes.  The first process will be used to create a support vector machine classification model that is trained to identify two classes of text documents, spam and not spam.  We will then store the model in the RapidMiner repository for use in the second process.  The second process will will read through an email account that you specify and classify each email as spam or not spam based on the SVM model that you created in the first process.  The methods used in these processes can be used to classify many different types of documents.  

Process one: Train the model using text files

In the first process, we will use the "Read Documents" operator to read two directories where you will create text files that represent examples of spam and non spam emails.  It's important to remember that the more examples used, the better the model will turn out when it's time to classify unlabeled emails.  As you may know, the process documents operator contains a subprocess (identified by the little blue box in the bottom right corner).  When you double click on the blue box, you can add operators that will process the documents that are read through the main operator.  When dealing with text mining to generator word vectors, there are several operators that you will find yourself using often.  In this particular example, you will use the Transform Case operator (to change all words to lowercase), Tokenize (separate words into tokens, FilterTokens (By Length) (Only process words that are at least 4 characters in length), Stem (Snowball) (This operator will change a word into it's "base word"), and finally filter stopwords which will remove common words that have no meaning.  The process documents operator has a property called Text directories with a button titled "Edit List".  You will click this button and create two entries.  The class name represents the classes that you will use for classification.  The directory will represent the directory that contains the training text files that you've created for each class.  For our experiment, we will create a class called Spam and provide a directory that contains text files where you've copied and pasted the text from legitimate emails.  The other class will be called not spam and the directory will contain text files full of spam email messages.  Please understand that the efficiency of the model will depend on the number of examples that you provide. 

Here is an image of the first process that we will use to train and store the model.

As you can see, we have a Read Documents operator, a X-Validation operator (cross validation) to calculate the performance of our model, and two Store operators that allow us to store the word vectors created from Process Documents and to store the SVM model which comes from the output of the cross validation operator.  Both the Process document and X-Validation operators are both nested processes (which means they contain sub processes).  I will provide screen shots so you can see how each process is setup in the sub process areas.

Process documents 

 

X-Validation

Run the process to save the model and to view the performance of the model.

Process 2

The next process, we will read the SVM model that we stored to the repository in process 1 and use it to classify emails from the Process Documents From Email Store operator.  This process is pretty straightforward.  Simply add a Read operator to read the model from the repository.  Next add a Read Documents From Email Store operator to read documents to be classified by the model.  If you need to know how to set the properties on the Read Documents From Mail Store operator, please check out my previous article Text mining: How to mine e-mail data from an IMAP account using RapidMiner .

Here is how the 2nd process should look:

The Read operator will read the previously trained SVM model from the repository and we will use the apply model to get the classification.  We will also create a process documents from mail store operator to read email messages and use them as the unlearned input to the apply model.  The read documents from mail store sub process contains the same text mining operators as used in the process documents operator in process one.  Next just run the process and you will see all emails and their predicted class.

Remember, the class properties of the model are arbitrary, you could just as easily set them to Positive/Negative, Good/Bad, Happy/Sad, etc.  This process can be used for many different applications such as sentiment analysis.

Thanks for reading!



Comments (2) -

Rizwan Ali
Islamic Republic of Pakistan Rizwan Ali says:

hi james

is there any operator that verifies the links in the mail body ?

i m considering that if we can identify links in mail body we can use it as one of the parameters to identify spam

also can you share thoughts on how we can test the model like the mail that comes in my inbox is usually checked for spam n stuff

kindly please reply

regards

Rizwan

Reply

Buddy James
United States Buddy James says:

Thanks for reading Rizwan!

As far as validating the email addresses, I'm not quite sure on any "out of the box" operators, however, I would probably write a RESTful web service in ASP.NET and then probably write a macro in RapidMiner that would extract the email address and call the webservice to validate the email address.  It would be interesting and a bit of a challenge but RapidMiner is a wonderful product, and I'm sure that you could find a way to make it happen.

As far as your second question, I believe you are asking how RapidMiner could check the email in your box and move mail that it considers spam out of your inbox.  If that is what you are asking, I'm not really sure how that would work.  For something like that, I'd probably suggest another web service to handle the house keeping of your mailbox and have RapidMiner call the web service and report the messages that should be marked as spam.  You can also extend RapidMiner if you know Java (RapidMiner 5 is open source).  I hope these suggestions help you on your quest to find answers.  Thanks again for reading.  Sincerely, Buddy.

Reply

Add comment

  Country flag

biuquote
  • Comment
  • Preview
Loading