I see a few problems here. Problems are listed below each concept, for clarity, and assume a decent webcrawler. > > 1. You create a generator for fake web pages, whose purpose > is to spit out HTML containing a huge amount of (pseudo) > random _non-existing_ words, as well as links to other pages > within the generator; I doubt this would make even a slight dent in things. Seeing as how webcrawlers already walk the entire internet, with its various languages, enormous expanse, and endless misspellings, I think anything you could create would end up being a drop in the bucket. > > 2. You place that generator somewhere and submit the URL to > search engines for crawling; > > 3. The search engines then crawls the site, possibly reaching > their pre-defined maximum of crawling depth (or, if badly > broken, crawl the site indefinitely, jumping from one freshly > generated page to another); But they don't crawl indefinitely. What do they do if they hit two sites that link to each other? They notice this, and move on. > 4. Upon adding the gathered words to the search engine's > index, the index becomes heavily overloaded with the newly > added words, as they are outside of the real-language words > already present in the index. The following should be > theoretically possible: But who would search on them? > - craft fake words so that they attack a specific hash > function. Make a bunch of fakes that hash to the same value > as a legitimate word in the English language. This will > possibly impact the performance of search engines using that > particular hash function when they try to look up the > legitimate words that are being targeted. This would be noticed by the search engine long before it became a real problem, and it would be addressed. This is how they deal with many things, including people who try to influence their ranking using various means. > - craft fake words so that they disbalance a b-tree > index, if one is used. I am not entirely sure, however it > appears to me that it is possible to craft words in such a > way as to alter the shape of the b-tree and thus impact the > performance on the lookups where it used. > > - craft fake words randomly so that the index just grows. > To the best of my understanding, most search engines will > index and retain keywords that are only seen on one web page > in the entire Internet. However, I think the capacity of the > search engines to keep track of such one-time non-English > letter sequences is limited and can be eventually exhausted. It is my belief that, again, they will notice the impact on their database and quickly address the issue. What about a bit of code that states that if more then 5% of the words in a page are unique in the database, that that page is dropped? > If the above-mentioned things are feasible, then one can even > construct a worm of some sort, that will auto-install such > fake page generators on valid sites, thus increasing the > traffic to the crawler even more. Writing an short Apache > handler meant to be silently installed in httpd.conf at > root-kit installation should not be that difficult. When is > the last time your reviewed the module list of your Apache? > Will you spot a malicious module if it is called > mod_ip_vhost_alias, loaded inbetween two other modules that > you never knew are vital or not? No, but I'd notice an abrupt lack of space on my web server. And the sudden oddly-named URLS in my logs. And the corresponding oddly-named pages in my site. And if I didn't notice, my hosting provider would. > Please note that the setup described differs from the > practice of generating fake pages containing a lot of real > (mostly adult) keywords. After all, such real-language words > already exist in the index, whereas I suggest bombing the > index with a huge number of not-previously-existing > freshly-generated random letter sequences. Also, please note > that the purpose of the attack is to damage the index, and > not to make the crawler consume bandwidth by going in an > endless loop or something like that (though, the crawler has > to scan the pages first so that the generated keywords are > ultimately delivered to the index). > > I will appreciate any and all thoughts on the issue. > > Philip Stoev >
This archive was generated by hypermail 2b30 : Mon Feb 03 2003 - 15:53:51 PST