Re: CRIME Interesting way around spam filter

From: Alan (alan@private)
Date: Wed Jun 04 2003 - 22:29:12 PDT

  • Next message: Crispin Cowan: "Re: CRIME Interesting way around spam filter"

    On Wed, 2003-06-04 at 04:39, Crispin Cowan wrote:
    > Shaun Savage wrote:
    > 
    > > It looks at raw text. The tokens are found using a fixed set of
    > > delimiters.  The reason for this is the mozilla spam filter uses the
    > > html tags to help determine spam, alot of spam uses 'color' font.  Also
    > > ~ one of the delimiters is '<' '>'  so it can't determine what is a html
    > > tag. 
    > 
    > Thanks!
    > 
    > Unfortunate that it is only looking at raw text. There is valuable info 
    > in the formatted text, precisely because of this hack of splitting words 
    > with HTML comments, so that word-recognizing filters like Bayes won't 
    > recognize "pe<!-- interruption -->nis" as "penis". The spammer can move 
    > the interruption back and forth across the word, put arbitrarily clean 
    > text (e.g. from project Gutenberg) in the "interruption", forcing 10X 
    > training time on the Bayesian filter.
    
    This is one of the reasons that scoring filters need multiple stages of
    filtering in order to determine if it is spam or not.
    
    One of the changes I am thinking of making to spam assassin is having it
    count the number of html comments in the text. (Give it a 0.5 per
    comment.)  Then i would strip all the comments and pass it through the
    filter rules again to catch all the penis and viagra references. The
    baysian filter would only filter after the chaff was removed. (I have
    seen base-64 encoding used to avoid filters as well.)
    
    That would handle your "penis interuptus" problem.  (Until they find
    some other dirty trick.)
    
    
    
    -- 
    Alan <alan@private>
    



    This archive was generated by hypermail 2b30 : Wed Jun 04 2003 - 22:54:19 PDT