FC: What's so bad about Total Information Awareness? by Ben Brunk

From: Declan McCullagh (declanat_private)
Date: Mon Dec 09 2002 - 20:57:16 PST

  • Next message: Declan McCullagh: "FC: More on interview with Karl Auerbach: "ICANN is out of control!""

    ---
    
    Date: Mon, 09 Dec 2002 22:34:13 -0500
    From: Ben Brunk <brunkbat_private>
    To: declanat_private
    Subject: Debunking TIA
    
    Declan,
    
    I'm in the middle of writing a dissertation relating to online privacy, but 
    I have been completely sidetracked by the recent discussion over the Total 
    Information Awareness program authorized by the Homeland Security bill that 
    just passed into law.  All I've seen so far are a lot of reactionary 
    editorials written by people who haven't put an ounce of effort into 
    analyzing the proposed system.  They seem infatuated with the TIA logo, its 
    slogan, and Poindexter.  I have read, with avid fascination, all the dire 
    predictions and scary stories about a new Big Brother system spearheaded by 
    a felon who managed to avoid accountability.  What I have yet to see is a 
    rational analysis of the idea itself from someone who knows something about 
    computers, databases and statistics. I hope to fill in that gap as best I 
    can, though I'm sure there are experts out there with even better 
    background in the appropriate research fields.
    
     From what I have been able to find out about the TIA program, it is 
    supposed to be a massive computerized dragnet that culls information from 
    dozens of different sources and is intended to locate potential terrorists 
    so that government agents can scrutinize them more closely.  This system 
    will draw data from sources such as credit reports, bank records, airline 
    reservation systems, police records, gun purchase records, and many others.
    
    Many of these sources of information are private databases owned and 
    maintained by the corporations that rely on them.  Even if they were all 
    implemented in say, Oracle, it would be difficult to match up records to 
    any reliable degree.  Who knows if the John Poindexter in one database is 
    the same as Jon Pointdexter in another?  The social security number, which 
    is apparently the holy grail of database keys, is not necessarily going to 
    help since many of these companies did not collect it or use it as a key. 
    Name and address might make a good cross referencing key, but people move 
    all the time, and I get three catalogs from a company that I purchased 
    items from three times-even their internal database is not sophisticated 
    enough to detect slight differences in spacing or my apartment number using 
    a '#' instead of  'apt' or 'apartment'.  This is just inside one 
    organization; we're not even trying to connect any dots yet.  It will be 
    easier to match records kept by the government, especially if they include 
    SSNs and fingerprints.  However, errors in government databases are well 
    documented (although not readily admitted to). Those systems contain large 
    numbers of errors, and even when errors are located and fixed, they have a 
    nasty tendency of recurring when data is shared or re-shared.  If you fix 
    an error in your Experian credit report, but not TRW, often times, the 
    Experian error will reappear.  Many people play this sort of "whack a mole" 
    game for years.
    
    Another matter that no journalist has touched on, and the one I think is 
    the biggest nail in TIA's coffin, is the matter of database error are 
    several orders of magnitude higher than the number of terrorists in the 
    world.  All databases contain errors.  Data culled from multiple, 
    heterogeneous sources is going to have lots of errors.  I don't have 
    current estimates on the average expected error rate in a database, but 
    let's suppose it is 5%.  That means that in any given database, 95% of the 
    data is right and 5% of it is junk.  Garbage in, garbage out.  Errors such 
    as misspellings, flipped bits, juxtaposed numbers, and transaction entries 
    that never took place or were unintentionally duplicated or omitted.  Five 
    percent isn't a big deal until you look at it on the scale of what TIA is 
    proposing.  There are approximately 300 million people in the United 
    States.  Those 300 million people are very busy consumers, and their paper 
    trail is enormous.  There are trillions of transaction records, log 
    entries, and records that TIA would have to amass, standardize, and then 
    examine.  Even if the government buys all the necessary computing power and 
    the very best staff, the government can't do anything about randomness. The 
    5% expected error rate is the monkey wrench in the works.  5% of 300 
    million is 15,000,000.  Multiply that number by however many data points 
    will be looked at.  Say 500 data points for each person.  Now we are 
    looking at 300 million times 500, or 150,000,000,000 data points. 5% of 
    that number leaves us with 7,500,000,000.  Seven and one half billion data 
    points if they want to look at every American.  Worse, this is not a 
    one-time scan.  For any hope of success, they would have to look 
    longitudinally.  That is, every year, month, day, hour, whatever.  Some 
    indications of terrorism are very subtle:  People who plan terror don't 
    just run out and buy their entire list of bomb making ingredients in one 
    day and then book a flight.  Terrorists are slow and methodical.  They plan 
    over months and years.  So what we're looking at here is 7.5 billion data 
    points examined day in and day out for years and years.  With a 5% error 
    rate, the number of false positives is outrageous, no matter what analysis 
    technique used (and any analysis technique will have its own error rate). 
    There is not enough manpower in the entire federal government to possibly 
    track down every lead generated, even if much of that work is automated. 
    With each passing day, homeland security will drown a little more in a 
    hopeless pile of randomly generated false leads that grow even on weekends 
    and holidays.
    
    Let's suppose there are 1,000 terrorists hiding out in the USA, waiting to 
    strike, which I personally think is a greatly exaggerated number.  We know 
    from the actions taken on 9/11 that these people are fairly cunning.  They 
    know how to hide from the system and how to hide in plain sight.  They pay 
    in cash, or they buy what they need by proxy, and they don't act any 
    different than anyone else.  Like the millions of illegal immigrants in the 
    US, terrorist operatives are good at using social networks to "fly below 
    the radar" and subvert the system.  One thousand people is a lot, but 1,000 
    out of 300 million is 3.33 * 10^-6, or .000033%.  In other words, TIA would 
    be looking for a miniscule fraction of 1% of the population in their 
    database, the exact people who are going out of their way to escape 
    detection.  With an error rate of even 1%, detecting such a tiny fraction 
    would be impossible.  You would not be able to separate the signal from the 
    noise, no matter what techniques were used.  Pollsters run into this 
    problem every election season when the 'margin of error' rises to a level 
    greater than the projected differential between the candidates.  3% margin 
    of error in a race where the candidates differ by 1% is "too close to 
    call."  The same problem exists for scanning all airport baggage, but that 
    is fodder for another day.  The only way TIA would work is if some high 
    percentage of Americans were terrorists-20%, 50%, whatever.  Only then 
    could there be enough comparison data in both sets to draw testable 
    conclusions from and be assured that those conclusions were not just random 
    error phenomena.
    
    Let's look at this on a much smaller scale:  Suppose the system worked well 
    enough each day to render a list of 10,000 people, one (1) of which is an 
    actual terrorist (unbelievably good odds for the government).  The 
    government has a .0001% probability of successfully picking the terrorist 
    each day (using this system alone).  Could the FBI/CIA/NSA/whatever even 
    investigate 10,000 people with other techniques carefully enough each day 
    to locate the one terrorist?  Could they do it in a month or a year?  I 
    suppose the government could err on the side of caution and detain large 
    numbers of people, place them in custody, and hold them indefinitely 
    without due process until certain that they weren't terrorists.  But that 
    action presents nightmarish logistical and humanitarian prospects.  The US 
    prison population is bursting at the seams with an all time high of two 
    million.  There would have to be enormous concentration camps for the 
    millions of suspected terrorists who would be detained until their 
    innocence is proven.  That begs the question:  Is it even possible to prove 
    you are innocent in the current legal climate?  The Red Scare (and the more 
    recent FBI watch lists) has already taught us the folly of black lists and 
    unsubstantiated accusations.
    
    Lastly, data mining as a useful technique has been thoroughly debunked.  It 
    never lived up to its promises.  This is why you don't hear much about data 
    mining in the CS and IS literature these days; what of it that is left has 
    morphed into the more esoteric "knowledge management" or KD.  Like AI, it 
    turned out to be quite a bit more difficult to do than expected and has 
    been largely abandoned.  Had anyone in the government actually bothered to 
    read any of the literature, they would already know this.
    
    All in all, I can't see how TIA will do anything except harm innocent 
    people and create new jobs for bureaucrats.  Any numerate person who spends 
    five minutes thinking about what is proposed will come to the same 
    conclusion.  If our system is going to become this arbitrary, there are 
    going to be an awful lot of lives ruined in this country.  I fail to see 
    how the TIA approach could do anything positive for the war on terror or 
    for America in general.  It will eat up resources better spent on more 
    proven and acceptable approaches.  In fact, such a data-drive approach 
    might actually be more successful if it simply took a random sampling of 
    the population each day.
    
    My hope is that this editorial will awaken those who are even more skilled 
    in computer science, statistics, game theory, etc. and that they find the 
    courage to speak up so we can put the brakes on the wasteful and 
    destructive blind alley called TIA.
    
    
    Benjamin Brunk
    
    
    
    
    -------------------------------------------------------------------------
    POLITECH -- Declan McCullagh's politics and technology mailing list
    You may redistribute this message freely if you include this notice.
    To subscribe to Politech: http://www.politechbot.com/info/subscribe.html
    This message is archived at http://www.politechbot.com/
    Declan McCullagh's photographs are at http://www.mccullagh.org/
    -------------------------------------------------------------------------
    Like Politech? Make a donation here: http://www.politechbot.com/donate/
    Recent CNET News.com articles: http://news.search.com/search?q=declan
    -------------------------------------------------------------------------
    



    This archive was generated by hypermail 2b30 : Mon Dec 09 2002 - 21:14:37 PST