Google first responders

The Cheswick web pages are the first hits for Google for "Internet mapping", "mccollough effect", and maybe others. My question: what others?

I have written little programs to walk a collection of web pages, extract the text, and print out the word pairs. These are submitted to Google (and later, some other search engines), and the results examined for interesting stuff.

http://www.lumeta.com word pairs google hits
http://www.lumeta.com/research word pairs google hits
http://www.cheswick.com word pairs google hits

What are the word pairs for which cheswick.com (or the mirror, www.lumeta.com/research) are the first Google hits

Number one hits for word pairs found on www.cheswick.com and the mirror, research.lumeta.com/ches. The order of the unquoted word pairs changes the search results.

The above tables, when scanned by searchbots, yield thousands of links that they react to. This was actually disruptive for me: some of my normal search results ended up hitting the above data pages. For a while I removed the data from the web page. Now I have put the data back, with instructions to robots to lay off them. We will see what happens.

Open Questions

How tolerant is google of such activities?

I have no idea. They seem to let me perform thousands of queries without getting mad, or slowing results, or giving bogus answers. Sometimes, the queries, which usually run at about 1 per second, stop for a minute or so. Is this google throttling, or just Internet weirdness? Google has vast capacity: they say that over a thousand computers touch each query. To minimize this processing, and simplify processing, I make the queries look like they are coming from a PDA.

Google does supply an API for making queries. They say it is limited to 1000 queries per day.

What related work is out there?

People must have made explorations like this before. What google queries would help locate such efforts?

Is this information actually useful?

Our marketing people are chewing over the data. This certainly finds misspellings. It is easy to be the first hit for a misspelled word, and I have found a couple.

What are the most commonly-reported sites that are not lumeta's?