FC: What's so bad about Total Information Awareness? by Ben Brunk

From: Declan McCullagh (declanat_private)
Date: Mon Dec 09 2002 - 20:57:16 PST
Next message: Declan McCullagh: "FC: More on interview with Karl Auerbach: "ICANN is out of control!""
Previous message: Declan McCullagh: "FC: A first-hand account of Day One of Johansen trial in Norway"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
---

Date: Mon, 09 Dec 2002 22:34:13 -0500
From: Ben Brunk <brunkbat_private>
To: declanat_private
Subject: Debunking TIA

Declan,

I'm in the middle of writing a dissertation relating to online privacy, but 
I have been completely sidetracked by the recent discussion over the Total 
Information Awareness program authorized by the Homeland Security bill that 
just passed into law.  All I've seen so far are a lot of reactionary 
editorials written by people who haven't put an ounce of effort into 
analyzing the proposed system.  They seem infatuated with the TIA logo, its 
slogan, and Poindexter.  I have read, with avid fascination, all the dire 
predictions and scary stories about a new Big Brother system spearheaded by 
a felon who managed to avoid accountability.  What I have yet to see is a 
rational analysis of the idea itself from someone who knows something about 
computers, databases and statistics. I hope to fill in that gap as best I 
can, though I'm sure there are experts out there with even better 
background in the appropriate research fields.

 From what I have been able to find out about the TIA program, it is 
supposed to be a massive computerized dragnet that culls information from 
dozens of different sources and is intended to locate potential terrorists 
so that government agents can scrutinize them more closely.  This system 
will draw data from sources such as credit reports, bank records, airline 
reservation systems, police records, gun purchase records, and many others.

Many of these sources of information are private databases owned and 
maintained by the corporations that rely on them.  Even if they were all 
implemented in say, Oracle, it would be difficult to match up records to 
any reliable degree.  Who knows if the John Poindexter in one database is 
the same as Jon Pointdexter in another?  The social security number, which 
is apparently the holy grail of database keys, is not necessarily going to 
help since many of these companies did not collect it or use it as a key. 
Name and address might make a good cross referencing key, but people move 
all the time, and I get three catalogs from a company that I purchased 
items from three times-even their internal database is not sophisticated 
enough to detect slight differences in spacing or my apartment number using 
a '#' instead of  'apt' or 'apartment'.  This is just inside one 
organization; we're not even trying to connect any dots yet.  It will be 
easier to match records kept by the government, especially if they include 
SSNs and fingerprints.  However, errors in government databases are well 
documented (although not readily admitted to). Those systems contain large 
numbers of errors, and even when errors are located and fixed, they have a 
nasty tendency of recurring when data is shared or re-shared.  If you fix 
an error in your Experian credit report, but not TRW, often times, the 
Experian error will reappear.  Many people play this sort of "whack a mole" 
game for years.

Another matter that no journalist has touched on, and the one I think is 
the biggest nail in TIA's coffin, is the matter of database error are 
several orders of magnitude higher than the number of terrorists in the 
world.  All databases contain errors.  Data culled from multiple, 
heterogeneous sources is going to have lots of errors.  I don't have 
current estimates on the average expected error rate in a database, but 
let's suppose it is 5%.  That means that in any given database, 95% of the 
data is right and 5% of it is junk.  Garbage in, garbage out.  Errors such 
as misspellings, flipped bits, juxtaposed numbers, and transaction entries 
that never took place or were unintentionally duplicated or omitted.  Five 
percent isn't a big deal until you look at it on the scale of what TIA is 
proposing.  There are approximately 300 million people in the United 
States.  Those 300 million people are very busy consumers, and their paper 
trail is enormous.  There are trillions of transaction records, log 
entries, and records that TIA would have to amass, standardize, and then 
examine.  Even if the government buys all the necessary computing power and 
the very best staff, the government can't do anything about randomness. The 
5% expected error rate is the monkey wrench in the works.  5% of 300 
million is 15,000,000.  Multiply that number by however many data points 
will be looked at.  Say 500 data points for each person.  Now we are 
looking at 300 million times 500, or 150,000,000,000 data points. 5% of 
that number leaves us with 7,500,000,000.  Seven and one half billion data 
points if they want to look at every American.  Worse, this is not a 
one-time scan.  For any hope of success, they would have to look 
longitudinally.  That is, every year, month, day, hour, whatever.  Some 
indications of terrorism are very subtle:  People who plan terror don't 
just run out and buy their entire list of bomb making ingredients in one 
day and then book a flight.  Terrorists are slow and methodical.  They plan 
over months and years.  So what we're looking at here is 7.5 billion data 
points examined day in and day out for years and years.  With a 5% error 
rate, the number of false positives is outrageous, no matter what analysis 
technique used (and any analysis technique will have its own error rate). 
There is not enough manpower in the entire federal government to possibly 
track down every lead generated, even if much of that work is automated. 
With each passing day, homeland security will drown a little more in a 
hopeless pile of randomly generated false leads that grow even on weekends 
and holidays.

Let's suppose there are 1,000 terrorists hiding out in the USA, waiting to 
strike, which I personally think is a greatly exaggerated number.  We know 
from the actions taken on 9/11 that these people are fairly cunning.  They 
know how to hide from the system and how to hide in plain sight.  They pay 
in cash, or they buy what they need by proxy, and they don't act any 
different than anyone else.  Like the millions of illegal immigrants in the 
US, terrorist operatives are good at using social networks to "fly below 
the radar" and subvert the system.  One thousand people is a lot, but 1,000 
out of 300 million is 3.33 * 10^-6, or .000033%.  In other words, TIA would 
be looking for a miniscule fraction of 1% of the population in their 
database, the exact people who are going out of their way to escape 
detection.  With an error rate of even 1%, detecting such a tiny fraction 
would be impossible.  You would not be able to separate the signal from the 
noise, no matter what techniques were used.  Pollsters run into this 
problem every election season when the 'margin of error' rises to a level 
greater than the projected differential between the candidates.  3% margin 
of error in a race where the candidates differ by 1% is "too close to 
call."  The same problem exists for scanning all airport baggage, but that 
is fodder for another day.  The only way TIA would work is if some high 
percentage of Americans were terrorists-20%, 50%, whatever.  Only then 
could there be enough comparison data in both sets to draw testable 
conclusions from and be assured that those conclusions were not just random 
error phenomena.

Let's look at this on a much smaller scale:  Suppose the system worked well 
enough each day to render a list of 10,000 people, one (1) of which is an 
actual terrorist (unbelievably good odds for the government).  The 
government has a .0001% probability of successfully picking the terrorist 
each day (using this system alone).  Could the FBI/CIA/NSA/whatever even 
investigate 10,000 people with other techniques carefully enough each day 
to locate the one terrorist?  Could they do it in a month or a year?  I 
suppose the government could err on the side of caution and detain large 
numbers of people, place them in custody, and hold them indefinitely 
without due process until certain that they weren't terrorists.  But that 
action presents nightmarish logistical and humanitarian prospects.  The US 
prison population is bursting at the seams with an all time high of two 
million.  There would have to be enormous concentration camps for the 
millions of suspected terrorists who would be detained until their 
innocence is proven.  That begs the question:  Is it even possible to prove 
you are innocent in the current legal climate?  The Red Scare (and the more 
recent FBI watch lists) has already taught us the folly of black lists and 
unsubstantiated accusations.

Lastly, data mining as a useful technique has been thoroughly debunked.  It 
never lived up to its promises.  This is why you don't hear much about data 
mining in the CS and IS literature these days; what of it that is left has 
morphed into the more esoteric "knowledge management" or KD.  Like AI, it 
turned out to be quite a bit more difficult to do than expected and has 
been largely abandoned.  Had anyone in the government actually bothered to 
read any of the literature, they would already know this.

All in all, I can't see how TIA will do anything except harm innocent 
people and create new jobs for bureaucrats.  Any numerate person who spends 
five minutes thinking about what is proposed will come to the same 
conclusion.  If our system is going to become this arbitrary, there are 
going to be an awful lot of lives ruined in this country.  I fail to see 
how the TIA approach could do anything positive for the war on terror or 
for America in general.  It will eat up resources better spent on more 
proven and acceptable approaches.  In fact, such a data-drive approach 
might actually be more successful if it simply took a random sampling of 
the population each day.

My hope is that this editorial will awaken those who are even more skilled 
in computer science, statistics, game theory, etc. and that they find the 
courage to speak up so we can put the brakes on the wasteful and 
destructive blind alley called TIA.


Benjamin Brunk




-------------------------------------------------------------------------
POLITECH -- Declan McCullagh's politics and technology mailing list
You may redistribute this message freely if you include this notice.
To subscribe to Politech: http://www.politechbot.com/info/subscribe.html
This message is archived at http://www.politechbot.com/
Declan McCullagh's photographs are at http://www.mccullagh.org/
-------------------------------------------------------------------------
Like Politech? Make a donation here: http://www.politechbot.com/donate/
Recent CNET News.com articles: http://news.search.com/search?q=declan
-------------------------------------------------------------------------
Next message: Declan McCullagh: "FC: More on interview with Karl Auerbach: "ICANN is out of control!""
Previous message: Declan McCullagh: "FC: A first-hand account of Day One of Johansen trial in Norway"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
This archive was generated by hypermail 2b30 : Mon Dec 09 2002 - 21:14:37 PST