Shaun Savage wrote: > It looks at raw text. The tokens are found using a fixed set of > delimiters. The reason for this is the mozilla spam filter uses the > html tags to help determine spam, alot of spam uses 'color' font. Also > ~ one of the delimiters is '<' '>' so it can't determine what is a html > tag. Thanks! Unfortunate that it is only looking at raw text. There is valuable info in the formatted text, precisely because of this hack of splitting words with HTML comments, so that word-recognizing filters like Bayes won't recognize "pe<!-- interruption -->nis" as "penis". The spammer can move the interruption back and forth across the word, put arbitrarily clean text (e.g. from project Gutenberg) in the "interruption", forcing 10X training time on the Bayesian filter. Crispin -- Crispin Cowan, Ph.D. http://immunix.com/~crispin/ Chief Scientist, Immunix http://immunix.com http://www.immunix.com/shop/
This archive was generated by hypermail 2b30 : Tue Jun 03 2003 - 14:12:36 PDT