文摘
Spam (unsolicited and undesirable email) has become a significant problem for email users. This study investigated the current state-of-the-art in statistical spam filtering. Established methods, inspired by the work of Paul Graham, were examined, and new techniques were introduced and tested. Tests were conducted using two private corpora of email messages and one publicly available corpus.;A base configuration of a spam filter program, similar in technique to a popular production spam filter, was implemented and tested. This configuration achieved high accuracy while maintaining a low false positive rate. One main objective of this paper was to develop a new weighted token probability function. The data contained in header fields are important, and it was believed weighting header data higher than data in the body of the message could improve accuracy. This new weighted token probability function strengthens or weakens header and phrase tokens. Weighting headers applies the weight to any token from a header field, while all body tokens are given unit weight.