Previous Politech message: http://www.politechbot.com/p-04831.html --- From: "Finn Johansen" <finnjat_private> To: <declanat_private> References: <1055311145.2fe7328.finna.net@[216.110.36.217]> Subject: Re: Interscan blocks musician's email due to use of "whore" Date: Thu, 12 Jun 2003 11:39:01 +0200 Declan, I usually don't write this type of emails as it may be considered spam by the readers. However, the problem described is very interesting and shows the lack of intelligence in various spam filtering solutions. Blocking emails on the basis of single terms in the email context is rather pointless. It may sound amusing in the situation below, but it is certainly not amusing to Linda or her contacts. It is, as Thomas also says, a bit scary. To leave critical business correspondance to this type of context evaluation is a bit like gambling. If you're lucky, the information may pass through to the recipient, or it may as well just "disappear" somewhere without anyone knowing where it is. New spam filtering solutions is emerging almost every day. But just a minority of these are able to use a contextual approach in evaluating the emails. Even though reports shows that the global ratio of spam has reached the 50% mark in May 2003, there is still millions of legitimate emails passing among servers every day. Having to rely on solutions analyzing emails by single terms will certainly block a large amount of these legitimate emails and leave behind frustrated people like Linda - not getting their business information delivered correctly. The only way to overcome the limitation of keyword investigation of emails is to contextually analyze the content of the email. Words like f*ck has a pattern that is understanding to humans, but not to keyword searches, unless explicitly told so. Given the context of this pattern, statistical pattern matching technology is able to 'understand' this as either good or bad given the patterns surrounding it. Using this technique, new patterns from spammers can be catched as they are usually found together with other patterns that are already known by the system. The statistical approach will not catch 100% of spam emails without having to leave behind some false positives. However, our test shows that by accepting a block ratio of 96%, you end up with 0.01% false positives. Pretty good figures. And best of all - it doesn't block emails like this one containing single 'bad' terms scattered around the document. More readings about the method used by us can be found in Gary Robinson's execellent article on spam filtering: http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html Regards, Finn Johansen CEO Hexamail Ltd. Email: finnjat_private http://www.hexamail.com/ ------------------------------------------------------------------------- POLITECH -- Declan McCullagh's politics and technology mailing list You may redistribute this message freely if you include this notice. ------------------------------------------------------------------------- To subscribe to Politech: http://www.politechbot.com/info/subscribe.html This message is archived at http://www.politechbot.com/ Declan McCullagh's photographs are at http://www.mccullagh.org/ Like Politech? Make a donation here: http://www.politechbot.com/donate/ -------------------------------------------------------------------------
This archive was generated by hypermail 2b30 : Thu Jun 12 2003 - 23:58:40 PDT