Training the spam engine
From GWAVA4
Contents |
Training
For the GWAVA antispam engine to work properly, the engine needs to be provided a representative sample of ham and spam to learn from.
This section will describe methods for feeding Ham and Spam to the antispam engine. Please do not make any configuration changes before reading this entire section thoroughly. Misconfiguration can severely diminish the effectiveness of the spam engine.
Suggestions
Here are a few basic guidelines to follow when classifying ham and spam:
- It is extremely important to feed the antispam engine e-mail that represents what your company receives normally. A wide range of samples, both ham and spam is best. Adding the same message 500 times to your spam corpus will NOT lead to good results. While quantity is important, even more important is the quality of the messages you provide.
- The most important thing is to have positive results, and to obtain those consistently. Decide beforehand what is acceptable as ham, and what is unacceptable. If you deal with messages consistently you will receive positive results. For example, you might want newsletters (mass mailings from Novell, Dell, HP, etc.) to get through the system, so make sure that everyone helping you will mark those messages as ham.
Using a Mail Feeder to Train GWAVA
This is an easy way to sort Ham and Spam using an ordinary email account. This will require some intervention by the user, but will also be the most accurate way of adding Spam and Ham to the corpus because there should not be any errors. The following sections describe how to set this up.
Spam/Ham Mailbox setup
By now you should have realized that to use the new probability engine effectively, it is very important to provide the engine with adequate quality examples of ham and spam. The next few sections will outline the importance of obtaining the right type of messages in the most efficient manner
Initially GWAVA cannot automatically teach itself the difference between ham and spam, but once you have this set up properly it should require minimal if any direct attention to keep the anti-spam engine tuned.
The first thing you will want to do is set up a mailbox where ham and spam examples can be sent to. Make sure that this mailbox is IMAP enabled, and then create a ham/spam folder in the root of the mailbox. Refer to your mail client's documentation for details on how to do this.
For example we have set up a mailbox in GroupWise for the user GWAVAMAN. You could have users forward spam or ham to this mailbox to get some examples. To get some ham in this mailbox you could set up any released messages to be bcc'd to this mailbox. In order to do that you should follow the directions found here.
Once you have some mail in this box you need to sort between ham and spam messages. Since we have created a ham and spam folder you can simply drag and drop the messages into the appropriate folder. It is very important that ham is actually ham and that spam is really spam, this is why we want each example message to be reviewed to ensure accuracy.
Once a mailbox has been setup you can then set up GWAVA to pull those ham and spam messages automatically from that mailbox using those folders that were created.
GWAVA Mail Feeder Setup
The feeder for GWAVA will automatically pull email from the ham and spam folders from an IMAP enabled mailbox. To configure this, click Enable Learning Feeder Services, then select Spam or Ham under Spam/Ham, enter the IP address of the post office under Server, under Login enter the username of the mailbox, enter in your password, under Mailbox place the folder name you have created in your mailbox for either Spam or Ham dependent upon which type we already selected. Then click the green plus sign and save the changes.
You probably noticed that the flood protect option, has been skipped over and that is because it needs a little more explanation. In order to control the amount of Ham or Spam that is learned by the probability engine, GWAVA 4 uses flood protection. This will make more sense in the auto-learning section, as this option will be explained in more detail there. However, since you are manually sorting the mail in this mailbox, you probably won't want flood protect to be on for this mailbox.
Auto-Learning
It would be a laborious process if all you could do is teach the spam engine by sorting ham and spam each day and then letting the feeders do the rest of the work. This is very useful and necessary to begin with, but the goal of GWAVA4 is to automate the process. Keep in mind that with an incorrect configuration in these sections it will heavily skew the effectiveness of the new spam engine. Please do not make any configuration changes until you have read the suggestions for each section.
Spam Auto-Learning
After opening up the folder, you will have a list of all the events that could fire for a message. From here you can choose events that will only fire on spam (you need to be 100% sure it will be spam). When a message comes into the system and that event fires it will automatically be added to the spam corpus and be learned by the spam engine.
Because we want to be very certain that what we are feeding is actually spam, the easiest thing to do is to check SURBL, because SURBL is typically very accurate.
RBL is NOT recommended, because it is too susceptible to false positives. In general, none of the other services should be checked with the exception of SURBL, unless you are 100% confident that the source will be SPAM.
You could create some text filters with certain words that are only contained in spam messages. You could also have users send any spam that got through the mailbox you have set up for your mail feeders, and then put that in the spam folder for processing.
Ham Auto-Learning
After opening up the folder, you will have a list of all the events that could fire for a message. From here you can choose events that will only fire on ham (you need to be 100% sure it will be ham). When a message comes into the system and that event fires it will automatically be added to the ham corpus and be learned by the spam engine.
Watermarking
You can also "watermark" a message, using the Learn By Example link.
This will be your primary source for getting ham into your corpus. To get started click to add a training example.
You may either upload a previously saved MIME file, or copy and paste the contents of the MIME message and upload it that way (useful for copying and pasting the message source out of your mail client). When you add a message it will automatically add any messages from that sender to the ham corpus. Don't worry, there is no possibility of spoofing with the watermarking method.
Once you have uploaded a file, you can make sure that these messages are never blocked by the anti-spam system by checking Exclude this sender from spam scanning. You can put whatever label you want and the sender is pulled from the MIME file. Then click the Submit Watermark button and you're done. If you would like to submit another example, click the link to go through the process again.
As you add examples, you can see what you've added by going back to the initial watermark page.
Of course if you added one by accident you can delete the entry by clicking on the red 'x'.
One thing you could do is create a text filter that is in your company signature, but be sure that you do NOT associate any services (block, quarantine, etc.) with the message, because we want the mail to be delivered. This is what was done in the screenshot example, but watermarking would be a better choice. The only reason to create the filter is so that it would show up in the event list for Ham Auto-learning. By doing this when someone responds to an email that someone has sent with your company signature in it, that email will be added to the ham corpus.
For the most part we recommend that you use the Learn by example (watermarking) feature, because it is the most accurate. The more watermarks you have the better. Try to limit the watermarks you add to one on one communications, since that is probably more reliable. That way you have a consistent source of ham that gets submitted to the spam engine for training.
Watermark Feeder
Just like the mail feeders we set up for ham/spam we can do the same for watermarking. Follow the directions on how to set up the ham/spam mailbox, and then create another folder in the root of the mailbox called Watermark.
You then follow the same process as you would with ham and spam. Just drag and drop the messages you want watermarked into the Watermark folder.
Now we need to configure GWAVA so it will automatically pull those messages from the watermark folder. To do that, click Watermark Quick-Feeder and then click Add a new training source. Fill out the information that would connect you to your mailbox and the newly created Watermark folder. The create exceptions checkbox will make sure that message will never get blocked again.
- NOTE - It is a good idea to establish beforehand what you want to watermark. For example you might only want to watermark one-on-one communication, because every time a message comes in from that user it will automatically be added to the ham corpus, so you want to be sure that the source is valid.
Flood Protection
Flood protection limits the amount of mail that can be automatically learned by the spam system. The idea behind flood protection is that you have a quota of mail that needs to be added to the ham/spam corpus. About 2 messages of ham and 2 messages of spam are expected each minute. So if you had 100 spam messages added in one minute from SURBL hits, it would only take 2 of them per minute and discard the rest. The same would be true of email that is added from the ham auto-feeders. However, since we check every minute for mail, if for 5 minutes no mail was transferred, our quota would then be 12 messages for the next minute, and it could take up to that many messages.
The Spam/Ham Auto-Learning functions activate flood protection automatically. For Mail Feeders, you control whether flood protection is enabled. The administrator likely will not want to have flood protection enabled for your Mail Feeders, because you would want all mail you have sorted personally to be entered into the corpus.
Manually Train Engine
Some administrators may already have collected a spam/ham corpus and want to import the corpus manually into the GWAVA system. Please go through your ham/spam corpus and try to pick up any obvious classification errors, before you do this.
To import the files, simply drop the MIME files into the /opt/beginfinite/gwava/services/autoblocker/transfer/<interface_ID>/[Ham or Spam] folder. GWAVA will automatically use these files to train itself.
- NOTE - These folders do have flood protection turned on, so you will not be able to drop all of the files in at once. You will need to break it down into chunks and add pieces until you have added your entire corpus.









