RE: Possible DOS against search engines?

From: Rob Shein (shotenat_private)
Date: Mon Feb 03 2003 - 15:45:00 PST

  • Next message: 3APA3A: "Re: Windows reverse Shell"

    I see a few problems here.  Problems are listed below each concept, for
    clarity, and assume a decent webcrawler.
    
    > 
    > 1. You create a generator for fake web pages, whose purpose 
    > is to spit out HTML containing a huge amount of (pseudo) 
    > random _non-existing_ words, as well as links to other pages 
    > within the generator;
    
    I doubt this would make even a slight dent in things.  Seeing as how
    webcrawlers already walk the entire internet, with its various languages,
    enormous expanse, and endless misspellings, I think anything you could
    create would end up being a drop in the bucket.
    
    > 
    > 2. You place that generator somewhere and submit the URL to 
    > search engines for crawling;
    > 
    > 3. The search engines then crawls the site, possibly reaching 
    > their pre-defined maximum of crawling depth (or, if badly 
    > broken, crawl the site indefinitely, jumping from one freshly 
    > generated page to another);
     
    But they don't crawl indefinitely.  What do they do if they hit two sites
    that link to each other?  They notice this, and move on.
    
    > 4. Upon adding the gathered words to the search engine's 
    > index, the index becomes heavily overloaded with the newly 
    > added words, as they are outside of the real-language words 
    > already present in the index. The following should be 
    > theoretically possible:
     
    But who would search on them?
    
    >     - craft fake words so that they attack a specific hash 
    > function. Make a bunch of fakes that hash to the same value 
    > as a legitimate word in the English language. This will 
    > possibly impact the performance of search engines using that 
    > particular hash function when they try to look up the 
    > legitimate words that are being targeted.
    
    This would be noticed by the search engine long before it became a real
    problem, and it would be addressed.  This is how they deal with many things,
    including people who try to influence their ranking using various means.
     
    >     - craft fake words so that they disbalance a b-tree 
    > index, if one is used. I am not entirely sure, however it 
    > appears to me that it is possible to craft words in such a 
    > way as to alter the shape of the b-tree and thus impact the 
    > performance on the lookups where it used.
    > 
    >     - craft fake words randomly so that the index just grows. 
    > To the best of my understanding, most search engines will 
    > index and retain keywords that are only seen on one web page 
    > in the entire Internet. However, I think the capacity of the 
    > search engines to keep track of such one-time non-English 
    > letter sequences is limited and can be eventually exhausted.
    
    It is my belief that, again, they will notice the impact on their database
    and quickly address the issue.  What about a bit of code that states that if
    more then 5% of the words in a page are unique in the database, that that
    page is dropped?
    
    > If the above-mentioned things are feasible, then one can even 
    > construct a worm of some sort, that will auto-install such 
    > fake page generators on valid sites, thus increasing the 
    > traffic to the crawler even more. Writing an short Apache 
    > handler meant to be silently installed in httpd.conf at 
    > root-kit installation should not be that difficult. When is 
    > the last time your reviewed the module list of your Apache? 
    > Will you spot a malicious module if it is called 
    > mod_ip_vhost_alias, loaded inbetween two other modules that 
    > you never knew are vital or not?
    
    No, but I'd notice an abrupt lack of space on my web server.  And the sudden
    oddly-named URLS in my logs.  And the corresponding oddly-named pages in my
    site.  And if I didn't notice, my hosting provider would.
    
    > Please note that the setup described differs from the 
    > practice of generating fake pages containing a lot of real 
    > (mostly adult) keywords. After all, such real-language words 
    > already exist in the index, whereas I suggest bombing the 
    > index with a huge number of not-previously-existing 
    > freshly-generated random letter sequences. Also, please note 
    > that the purpose of the attack is to damage the index, and 
    > not to make the crawler consume bandwidth by going in an 
    > endless loop or something like that (though, the crawler has 
    > to scan the pages first so that the generated keywords are 
    > ultimately delivered to the index).
    > 
    > I will appreciate any and all thoughts on the issue.
    > 
    > Philip Stoev
    > 
    



    This archive was generated by hypermail 2b30 : Mon Feb 03 2003 - 15:53:51 PST