public class BayesianAnalysis
extends org.apache.mailet.base.GenericMailet
Spam detection mailet using bayesian analysis techniques.
Sets an email message header indicating the probability that an email message is SPAM.
Based upon the principals described in: A Plan For Spam by Paul Graham. Extended to Paul Grahams' Better Bayesian Filtering.
The analysis capabilities are based on token frequencies (the Corpus)
learned through a training process (see BayesianAnalysisFeeder
) and
stored in a JDBC database. After a training session, the Corpus must be
rebuilt from the database in order to acquire the new frequencies. Every 10
minutes a special thread in this mailet will check if any change was made to
the database by the feeder, and rebuild the corpus if necessary.
A org.apache.james.spam.probability
mail attribute will be
created containing the computed spam probability as a
Double
. The headerName
message header string
will be created containing such probability in floating point representation.
Sample configuration:
<mailet match="All" class="BayesianAnalysis">
<repositoryPath>db://maildb</repositoryPath>
<!--
Set this to the header name to add with the spam probability
(default is "X-MessageIsSpamProbability").
-->
<headerName>X-MessageIsSpamProbability</headerName>
<!--
Set this to true if you want to ignore messages coming from local senders
(default is false).
By local sender we mean a return-path with a local server part (server listed
in <servernames> in config.xml).
-->
<ignoreLocalSender>true</ignoreLocalSender>
<!--
Set this to the maximum message size (in bytes) that a message may have
to be considered spam (default is 100000).
-->
<maxSize>100000</maxSize>
<!--
Set this to false if you not want to tag the message if spam is detected (Default is true).
-->
<tagSubject>true</tagSubject>
</mailet>
The probability of being spam is pre-pended to the subject if it is > 0.1 (10%).
The required tables are automatically created if not already there (see sqlResources.xml). The token field in both the ham and spam tables is case sensitive.
BayesianAnalysisFeeder
,
BayesianAnalyzer
,
JDBCBayesianAnalyzer
Constructor and Description |
---|
BayesianAnalysis() |
Modifier and Type | Method and Description |
---|---|
long |
getLastCorpusLoadTime()
Getter for property lastCorpusLoadTime.
|
String |
getMailetInfo()
Return a string describing this mailet.
|
int |
getMaxSize()
Getter for property maxSize.
|
void |
init()
Mailet initialization routine.
|
void |
service(org.apache.mailet.Mail mail)
Scans the mail and determines the spam probability.
|
void |
setDataSource(DataSource datasource) |
void |
setFileSystem(FileSystem fs) |
void |
setMaxSize(int maxSize)
Setter for property maxSize.
|
public String getMailetInfo()
getMailetInfo
in interface org.apache.mailet.Mailet
getMailetInfo
in class org.apache.mailet.base.GenericMailet
public int getMaxSize()
public void setMaxSize(int maxSize)
maxSize
- New value of property maxSize.public long getLastCorpusLoadTime()
public void setDataSource(DataSource datasource)
public void setFileSystem(FileSystem fs)
public void init() throws javax.mail.MessagingException
init
in class org.apache.mailet.base.GenericMailet
javax.mail.MessagingException
- if a problem arisespublic void service(org.apache.mailet.Mail mail) throws javax.mail.MessagingException
service
in interface org.apache.mailet.Mailet
service
in class org.apache.mailet.base.GenericMailet
mail
- The Mail message to be scanned.javax.mail.MessagingException
- if a problem arisesCopyright © 2002-2012 The Apache Software Foundation. All Rights Reserved.