FC: Hexamail's Finn Johansen on how to filter naughty words

From: Declan McCullagh (declanat_private)
Date: Thu Jun 12 2003 - 23:12:40 PDT

Next message: Declan McCullagh: "FC: Gohsuke Takama: Report on privacy enhancing technologies"

Previous message: Declan McCullagh: "FC: Senate votes on CAPPS II passenger profiling amendment"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Previous Politech message:
http://www.politechbot.com/p-04831.html

---

From: "Finn Johansen" <finnjat_private>
To: <declanat_private>
References: <1055311145.2fe7328.finna.net@[216.110.36.217]>
Subject: Re: Interscan blocks musician's email due to use of "whore"
Date: Thu, 12 Jun 2003 11:39:01 +0200

Declan,

I usually don't write this type of emails as it may be considered spam by
the readers. However, the problem described is very interesting and shows
the lack of intelligence in various spam filtering solutions.

Blocking emails on the basis of single terms in the email context is rather
pointless. It may sound amusing in the situation below, but it is certainly
not amusing to Linda or her contacts. It is, as Thomas also says, a bit
scary. To leave critical business correspondance to this type of context
evaluation is a bit like gambling. If you're lucky, the information may pass
through to the recipient, or it may as well just "disappear" somewhere
without anyone knowing where it is.

New spam filtering solutions is emerging almost every day. But just a
minority of these are able to use a contextual approach in evaluating the
emails. Even though reports shows that the global ratio of spam has reached
the 50% mark in May 2003, there is still millions of legitimate emails
passing among servers every day. Having to rely on solutions analyzing
emails by single terms will certainly block a large amount of these
legitimate emails and leave behind frustrated people like Linda - not
getting their business information delivered correctly.

The only way to overcome the limitation of keyword investigation of emails
is to contextually analyze the content of the email. Words like f*ck has a
pattern that is understanding to humans, but not to keyword searches, unless
explicitly told so. Given the context of this pattern, statistical pattern
matching technology is able to 'understand' this as either good or bad given
the patterns surrounding it. Using this technique, new patterns from
spammers can be catched as they are usually found together with other
patterns that are already known by the system. The statistical approach will
not catch 100% of spam emails without having to leave behind some false
positives. However, our test shows that by accepting a block ratio of 96%,
you end up with 0.01% false positives. Pretty good figures. And best of
all - it doesn't block emails like this one containing single 'bad' terms
scattered around the document.

More readings about the method used by us can be found in Gary Robinson's
execellent article on spam filtering:
http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html

Regards,

Finn Johansen
CEO
Hexamail Ltd.

Email: finnjat_private
http://www.hexamail.com/

-------------------------------------------------------------------------
POLITECH -- Declan McCullagh's politics and technology mailing list
You may redistribute this message freely if you include this notice.
-------------------------------------------------------------------------
To subscribe to Politech: http://www.politechbot.com/info/subscribe.html
This message is archived at http://www.politechbot.com/
Declan McCullagh's photographs are at http://www.mccullagh.org/
Like Politech? Make a donation here: http://www.politechbot.com/donate/
-------------------------------------------------------------------------

Next message: Declan McCullagh: "FC: Gohsuke Takama: Report on privacy enhancing technologies"
Previous message: Declan McCullagh: "FC: Senate votes on CAPPS II passenger profiling amendment"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b30 : Thu Jun 12 2003 - 23:58:40 PDT