Email spam detection using support vector machine classification.
Hello and welcome to another tutorial on RapidMiner, one of the best open source data mining tools available. This article will discuss how to use Support Vector Machine classification in RapidMiner to determine spam in your pop3 email account. Since we will read email from a pop3 or imap account, you may want to check out my previous post Text mining: How to mine e-mail data from an IMAP account using RapidMiner.
Unlike my other tutorials, this tutorial will explain a RapidMiner solution that consists of two processes. The first process will be used to create a support vector machine classification model that is trained to identify two classes of text documents, spam and not spam. We will then store the model in the RapidMiner repository for use in the second process. The second process will will read through an email account that you specify and classify each email as spam or not spam based on the SVM model that you created in the first process. The methods used in these processes can be used to classify many different types of documents.
Process one: Train the model using text files
In the first process, we will use the "Read Documents" operator to read two directories where you will create text files that represent examples of spam and non spam emails. It's important to remember that the more examples used, the better the model will turn out when it's time to classify unlabeled emails. As you may know, the process documents operator contains a subprocess (identified by the little blue box in the bottom right corner). When you double click on the blue box, you can add operators that will process the documents that are read through the main operator. When dealing with text mining to generator word vectors, there are several operators that you will find yourself using often. In this particular example, you will use the Transform Case operator (to change all words to lowercase), Tokenize (separate words into tokens, FilterTokens (By Length) (Only process words that are at least 4 characters in length), Stem (Snowball) (This operator will change a word into it's "base word"), and finally filter stopwords which will remove common words that have no meaning. The process documents operator has a property called Text directories with a button titled "Edit List". You will click this button and create two entries. The class name represents the classes that you will use for classification. The directory will represent the directory that contains the training text files that you've created for each class. For our experiment, we will create a class called Spam and provide a directory that contains text files where you've copied and pasted the text from legitimate emails. The other class will be called not spam and the directory will contain text files full of spam email messages. Please understand that the efficiency of the model will depend on the number of examples that you provide.
Here is an image of the first process that we will use to train and store the model.
As you can see, we have a Read Documents operator, a X-Validation operator (cross validation) to calculate the performance of our model, and two Store operators that allow us to store the word vectors created from Process Documents and to store the SVM model which comes from the output of the cross validation operator. Both the Process document and X-Validation operators are both nested processes (which means they contain sub processes). I will provide screen shots so you can see how each process is setup in the sub process areas.
Run the process to save the model and to view the performance of the model.
The next process, we will read the SVM model that we stored to the repository in process 1 and use it to classify emails from the Process Documents From Email Store operator. This process is pretty straightforward. Simply add a Read operator to read the model from the repository. Next add a Read Documents From Email Store operator to read documents to be classified by the model. If you need to know how to set the properties on the Read Documents From Mail Store operator, please check out my previous article Text mining: How to mine e-mail data from an IMAP account using RapidMiner .
Here is how the 2nd process should look:
The Read operator will read the previously trained SVM model from the repository and we will use the apply model to get the classification. We will also create a process documents from mail store operator to read email messages and use them as the unlearned input to the apply model. The read documents from mail store sub process contains the same text mining operators as used in the process documents operator in process one. Next just run the process and you will see all emails and their predicted class.
Remember, the class properties of the model are arbitrary, you could just as easily set them to Positive/Negative, Good/Bad, Happy/Sad, etc. This process can be used for many different applications such as sentiment analysis.
Thanks for reading!