FC: Hexamail's Finn Johansen on how to filter naughty words

From: Declan McCullagh (declanat_private)
Date: Thu Jun 12 2003 - 23:12:40 PDT

  • Next message: Declan McCullagh: "FC: Gohsuke Takama: Report on privacy enhancing technologies"

    Previous Politech message:
    http://www.politechbot.com/p-04831.html
    
    ---
    
    From: "Finn Johansen" <finnjat_private>
    To: <declanat_private>
    References: <1055311145.2fe7328.finna.net@[216.110.36.217]>
    Subject: Re: Interscan blocks musician's email due to use of "whore"
    Date: Thu, 12 Jun 2003 11:39:01 +0200
    
    Declan,
    
    I usually don't write this type of emails as it may be considered spam by
    the readers. However, the problem described is very interesting and shows
    the lack of intelligence in various spam filtering solutions.
    
    Blocking emails on the basis of single terms in the email context is rather
    pointless. It may sound amusing in the situation below, but it is certainly
    not amusing to Linda or her contacts. It is, as Thomas also says, a bit
    scary. To leave critical business correspondance to this type of context
    evaluation is a bit like gambling. If you're lucky, the information may pass
    through to the recipient, or it may as well just "disappear" somewhere
    without anyone knowing where it is.
    
    New spam filtering solutions is emerging almost every day. But just a
    minority of these are able to use a contextual approach in evaluating the
    emails. Even though reports shows that the global ratio of spam has reached
    the 50% mark in May 2003, there is still millions of legitimate emails
    passing among servers every day. Having to rely on solutions analyzing
    emails by single terms will certainly block a large amount of these
    legitimate emails and leave behind frustrated people like Linda - not
    getting their business information delivered correctly.
    
    The only way to overcome the limitation of keyword investigation of emails
    is to contextually analyze the content of the email. Words like f*ck has a
    pattern that is understanding to humans, but not to keyword searches, unless
    explicitly told so. Given the context of this pattern, statistical pattern
    matching technology is able to 'understand' this as either good or bad given
    the patterns surrounding it. Using this technique, new patterns from
    spammers can be catched as they are usually found together with other
    patterns that are already known by the system. The statistical approach will
    not catch 100% of spam emails without having to leave behind some false
    positives. However, our test shows that by accepting a block ratio of 96%,
    you end up with 0.01% false positives. Pretty good figures. And best of
    all - it doesn't block emails like this one containing single 'bad' terms
    scattered around the document.
    
    More readings about the method used by us can be found in Gary Robinson's
    execellent article on spam filtering:
    http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
    
    
    Regards,
    
    Finn Johansen
    CEO
    Hexamail Ltd.
    
    Email: finnjat_private
    http://www.hexamail.com/
    
    
    
    
    -------------------------------------------------------------------------
    POLITECH -- Declan McCullagh's politics and technology mailing list
    You may redistribute this message freely if you include this notice.
    -------------------------------------------------------------------------
    To subscribe to Politech: http://www.politechbot.com/info/subscribe.html
    This message is archived at http://www.politechbot.com/
    Declan McCullagh's photographs are at http://www.mccullagh.org/
    Like Politech? Make a donation here: http://www.politechbot.com/donate/
    -------------------------------------------------------------------------
    



    This archive was generated by hypermail 2b30 : Thu Jun 12 2003 - 23:58:40 PDT