Re: CRIME Interesting way around spam filter

From: Crispin Cowan (crispin@private)
Date: Tue Jun 03 2003 - 13:39:49 PDT

  • Next message: Andrew Plato: "RE: CRIME Port scanning from an ISP"

    Shaun Savage wrote:
    
    > It looks at raw text. The tokens are found using a fixed set of
    > delimiters.  The reason for this is the mozilla spam filter uses the
    > html tags to help determine spam, alot of spam uses 'color' font.  Also
    > ~ one of the delimiters is '<' '>'  so it can't determine what is a html
    > tag. 
    
    Thanks!
    
    Unfortunate that it is only looking at raw text. There is valuable info 
    in the formatted text, precisely because of this hack of splitting words 
    with HTML comments, so that word-recognizing filters like Bayes won't 
    recognize "pe<!-- interruption -->nis" as "penis". The spammer can move 
    the interruption back and forth across the word, put arbitrarily clean 
    text (e.g. from project Gutenberg) in the "interruption", forcing 10X 
    training time on the Bayesian filter.
    
    Crispin
    
    -- 
    Crispin Cowan, Ph.D.           http://immunix.com/~crispin/
    Chief Scientist, Immunix       http://immunix.com
                http://www.immunix.com/shop/
    



    This archive was generated by hypermail 2b30 : Tue Jun 03 2003 - 14:12:36 PDT