that's hilarious!!! ;) On Wed, 4 Jun 2003, SPAM/PORN FILTER wrote: > This note has been flagged as > Likely PORN > Possibly SPAM > > The following test(s) were positive > the word(s) > 'penis' > followed by the phrase(s) > 'move', 'back and forth across', and 'forcing' > > Canadian Grammar/Phrasing/Spelling > > The results of the test(s) show that this is > 77% likely PORN (23/30) > 81% likely SPAM (27/33) > > > On Tue, 2003-06-03 at 13:39, Crispin Cowan wrote: > > Shaun Savage wrote: > > > > > It looks at raw text. The tokens are found using a fixed set of > > > delimiters. The reason for this is the mozilla spam filter uses the > > > html tags to help determine spam, alot of spam uses 'color' font. Also > > > ~ one of the delimiters is '<' '>' so it can't determine what is a html > > > tag. > > > > Thanks! > > > > Unfortunate that it is only looking at raw text. There is valuable info > > in the formatted text, precisely because of this hack of splitting words > > with HTML comments, so that word-recognizing filters like Bayes won't > > recognize "pe<!-- interruption -->nis" as "penis". The spammer can move > > the interruption back and forth across the word, put arbitrarily clean > > text (e.g. from project Gutenberg) in the "interruption", forcing 10X > > training time on the Bayesian filter. > > > > Crispin > >
This archive was generated by hypermail 2b30 : Wed Jun 04 2003 - 01:57:22 PDT