[Politech] Why is Whitehouse.gov blocking Iraq directories from search engines?

From: Declan McCullagh (declan@private)
Date: Wed Oct 29 2003 - 21:56:01 PST

  • Next message: Declan McCullagh: "[Politech] Diebold nastygrams Politech member over internal letters [fs]"

    will say that the Democratic Party's allegations don't entirely make sense, 
    though. --Declan]
    
    ---
    
    From: "Paul \"Evil Genius\" Music" <evlpawl@private>
    To: "DeClan" <declan@private>
    Subject: (SPAM?) Why is whitehouse.gov disallowing "Iraq" directories from 
    search engine crawling?
    Date: Mon, 27 Oct 2003 21:57:21 -0600
    
    
    http://www.bway.net/~keith/whrobots/
    
    Whitehouse.gov Robots.txt
    
    Why is whitehouse.gov (the official White House website) disallowing "Iraq" 
    directories from search engine crawling?
    
    [The discussion below is for those with a small bit of technical knowledge 
    (i.e. those who already know what a robots.txt file is and what it does). 
    If you don't know about robots.txt, go here for a less technical explanation.]
    
    
    As of Oct 24, 2003 the robots.txt file at whitehouse.gov (you can access 
    the current version here or an archived version, here) is 1631 lines long. 
    There are two blank lines between sections, one line (at the very top) that 
    identifies the file, and 8 lines at the very bottom that are instructions 
    to a user-agent called "whsearch" which appears to be the internal 
    whitehouse.gov crawler. The bulk of the file is the section directed to all 
    external search engine robots /crawlers / spiders, which is 1,620 lines 
    long and has 1,619"Disallow" statements.
    
    There are 862 instances of the term "text" in the file, which is easily 
    explained because whitehouse.gov generally uses directory paths that end in 
    "text" for printable pages -- the pages that are duplicates of the normal 
    display pages except that they are formatted for printing. It's easy to see 
    why the term "text" appears so often in this file, since disallowing these 
    directories helps lessen the "clutter" in search by excluding the 
    essentially duplicate pages.
    
    There are 783 instance of the term "iraq" in this file, almost all of them 
    appended to paths that already exist in the file. These appear to have been 
    added haphazardly, since the term appears in many path names for which no 
    such terminal "iraq" directory exists, such as:
    
    Disallow: /holiday/2002/barney/iraq
    or
    Disallow: /kids/eggroll/iraq
    
    However, this robots.txt file does exclude external search engine robots 
    from some 75 directories that actually exist on whitehouse.gov. Here are 75 
    of the currently excluded directories.
    
    
    Some things to note:
    
    1) The White House internal search (scroll to the bottom of the robots.txt 
    file to see that section) is far less exclusionary than that for external 
    search robots.
    
    For instance, a search of whitehouse.gov with Google for the phrase "unique 
    urgency" returns no hits.:
    
    http://www.google.com/search?as_q="unique%20urgency"&as_sitesearch=http://www.whitehouse.gov
    
    But an internal search of Whitehouse.gov for the phrase does find hits from 
    directories that Google is excluded from:
    
    http://www.whitehouse.gov/query.html?col=colpics&qt=%22unique+urgency%22%22&submit.x=0&submit.y=0
    
    [This search-result URL may expire. If so, go to http://www.whitehouse.gov 
    and enter the phrase "unique urgency" (with the quotation marks) to perform 
    the search.]
    
    
    
    2) While the exclusion appears to have effectively prevented spiders from 
    any directory on whitehouse.gov that includes "iraq" in the name of the 
    directory itself, there are files that discuss or mention Iraq which are in 
    non-excluded directories: in other words, you can find "iraq" hits on 
    whitehouse.gov using Google, in non-excluded directories but not nearly as 
    many as if these exclusions were not in place.
    
    The information above is based on the robots.txt file at whitehouse.gov on 
    Friday, October 24, 2003. That file is archived here. For a historical look 
    at the whitehouse.gov robots.txt file, here are a few examples:
    
    Google's cache (retrieved from Google on 10/26/03, but actual caching date 
    unspecified) of whitehouse.gov robots.txt. I've archived the cache as it is 
    at this writing here . This file is 1579 lines long, with 754 instances of 
    "iraq."
    
    The most current whitehouse.gov file archived at the Internet Archive is 
    from April 16, 2003. This file is 780 lines long, with only 10 instances of 
    the word "iraq."
    
    Sometime between April 2003 and late October, 2003, hundreds of instances 
    of the term "iraq" were added to the whitehouse.gov robots.txt file.
    
    On a quick look, It appears that the google cached version of the 
    robots.txt file very nearly could be created by searching for the string 
    "text" in the April 16 file and replacing it with the string "iraq" then 
    adding the newly changed files ("text" to "iraq") to the original file, 
    keeping the lines with "text" in them as well. That's from a quick look so 
    I'm not sure it would hold up, but it appears it could explain the bulk of 
    the "iraq" appearances.
    
    It would be a rather haphazard way of doing things, but the inclusion of 
    odd Disallowed directories like the "barney" directory mentioned above 
    indicate that this was done in a slapdash way. Still, it did effectively 
    disallow many directories with "Iraq" in the path, but added many 
    non-existent directories to the robots.txt file.
    
     From looking at the directory structure of whitehouse.gov, it appears that 
    it probably eliminated from external search every directory with "Iraq" in 
    the directory path.
    _______________________________________________
    Politech mailing list
    Archived at http://www.politechbot.com/
    Moderated by Declan McCullagh (http://www.mccullagh.org/)
    



    This archive was generated by hypermail 2b30 : Wed Oct 29 2003 - 23:32:32 PST