This note has been flagged as Likely PORN Possibly SPAM The following test(s) were positive the word(s) 'penis' followed by the phrase(s) 'move', 'back and forth across', and 'forcing' Canadian Grammar/Phrasing/Spelling The results of the test(s) show that this is 77% likely PORN (23/30) 81% likely SPAM (27/33) On Tue, 2003-06-03 at 13:39, Crispin Cowan wrote: > Shaun Savage wrote: > > > It looks at raw text. The tokens are found using a fixed set of > > delimiters. The reason for this is the mozilla spam filter uses the > > html tags to help determine spam, alot of spam uses 'color' font. Also > > ~ one of the delimiters is '<' '>' so it can't determine what is a html > > tag. > > Thanks! > > Unfortunate that it is only looking at raw text. There is valuable info > in the formatted text, precisely because of this hack of splitting words > with HTML comments, so that word-recognizing filters like Bayes won't > recognize "pe<!-- interruption -->nis" as "penis". The spammer can move > the interruption back and forth across the word, put arbitrarily clean > text (e.g. from project Gutenberg) in the "interruption", forcing 10X > training time on the Bayesian filter. > > Crispin
This archive was generated by hypermail 2b30 : Wed Jun 04 2003 - 01:22:46 PDT