org.apache.james.util
Class BayesianAnalyzer

java.lang.Object
  extended by org.apache.james.util.BayesianAnalyzer
Direct Known Subclasses:
JDBCBayesianAnalyzer

public class BayesianAnalyzer
extends java.lang.Object

Determines probability that text contains Spam.

Based upon Paul Grahams' A Plan for Spam. Extended to Paul Grahams' Better Bayesian Filtering.

Sample method usage:

Use: void addHam(Reader) and void addSpam(Reader) methods to build up the Maps of ham & spam tokens/occurrences. Both addHam and addSpam assume they're reading one message at a time, if you feed more than one message per call, be sure to adjust the appropriate message counter: hamMessageCount or spamMessageCount. Then...

Use: void buildCorpus() to build the final token/probabilities Map. Use your own methods for persistent storage of either the individual ham/spam corpus & message counts, and/or the final corpus. Then you can...

Use: double computeSpamProbability(Reader) to determine the probability that a particular text contains spam. A returned result of 0.9 or above is an indicator that the text was spam.

If you use persistent storage, use: void setCorpus(Map) before calling computeSpamProbability.

Since:
2.3.0
Version:
CVS $Revision: $ $Date: $

Constructor Summary
BayesianAnalyzer()
          Basic class constructor.
 
Method Summary
 void addHam(java.io.Reader stream)
          Adds a message to the ham list.
 void addSpam(java.io.Reader stream)
          Adds a message to the spam list.
 void buildCorpus()
          Builds the corpus from the existing ham & spam counts.
 void clear()
          Clears all analysis repositories and counters.
 double computeSpamProbability(java.io.Reader stream)
          Computes the probability that the stream contains SPAM.
 java.util.Map getCorpus()
          Public getter for corpus.
 int getHamMessageCount()
          Public getter for hamMessageCount.
 java.util.Map getHamTokenCounts()
          Public getter for the hamTokenCounts Map.
 int getSpamMessageCount()
          Public getter for spamMessageCount.
 java.util.Map getSpamTokenCounts()
          Public getter for the spamTokenCounts Map.
 void setCorpus(java.util.Map corpus)
          Public setter for corpus.
 void setHamMessageCount(int hamMessageCount)
          Public setter for hamMessageCount.
 void setHamTokenCounts(java.util.Map hamTokenCounts)
          Public setter for the hamTokenCounts Map.
 void setSpamMessageCount(int spamMessageCount)
          Public setter for spamMessageCount.
 void setSpamTokenCounts(java.util.Map spamTokenCounts)
          Public setter for the spamTokenCounts Map.
 void tokenCountsClear()
          Clears token counters.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

BayesianAnalyzer

public BayesianAnalyzer()
Basic class constructor.

Method Detail

setHamTokenCounts

public void setHamTokenCounts(java.util.Map hamTokenCounts)
Public setter for the hamTokenCounts Map.

Parameters:
hamTokenCounts - The new ham Token counts Map.

getHamTokenCounts

public java.util.Map getHamTokenCounts()
Public getter for the hamTokenCounts Map.


setSpamTokenCounts

public void setSpamTokenCounts(java.util.Map spamTokenCounts)
Public setter for the spamTokenCounts Map.

Parameters:
spamTokenCounts - The new spam Token counts Map.

getSpamTokenCounts

public java.util.Map getSpamTokenCounts()
Public getter for the spamTokenCounts Map.


setSpamMessageCount

public void setSpamMessageCount(int spamMessageCount)
Public setter for spamMessageCount.

Parameters:
spamMessageCount - The new spam message count.

getSpamMessageCount

public int getSpamMessageCount()
Public getter for spamMessageCount.


setHamMessageCount

public void setHamMessageCount(int hamMessageCount)
Public setter for hamMessageCount.

Parameters:
hamMessageCount - The new ham message count.

getHamMessageCount

public int getHamMessageCount()
Public getter for hamMessageCount.


clear

public void clear()
Clears all analysis repositories and counters.


tokenCountsClear

public void tokenCountsClear()
Clears token counters.


setCorpus

public void setCorpus(java.util.Map corpus)
Public setter for corpus.

Parameters:
corpus - The new corpus.

getCorpus

public java.util.Map getCorpus()
Public getter for corpus.


buildCorpus

public void buildCorpus()
Builds the corpus from the existing ham & spam counts.


addHam

public void addHam(java.io.Reader stream)
            throws java.io.IOException
Adds a message to the ham list.

Parameters:
stream - A reader stream on the ham message to analyze
Throws:
IOException - If any error occurs

addSpam

public void addSpam(java.io.Reader stream)
             throws java.io.IOException
Adds a message to the spam list.

Parameters:
stream - A reader stream on the spam message to analyze
Throws:
IOException - If any error occurs

computeSpamProbability

public double computeSpamProbability(java.io.Reader stream)
                              throws java.io.IOException
Computes the probability that the stream contains SPAM.

Parameters:
stream - The text to be analyzed for Spamminess.
Returns:
A 0.0 - 1.0 probability
Throws:
IOException - If any error occurs


Copyright © 2002-2007 The Apache Software Foundation. All Rights Reserved.