Google first responders

The Cheswick web pages are the first hits for Google for "Internet mapping", "mccollough effect", and maybe others. My question: what others?

I have written little programs to walk a collection of web pages, extract the text, and print out the word pairs. These are submitted to Google (and later, some other search engines), and the results examined for interesting stuff.

http://www.lumeta.com word pairs google hits
http://research.lumeta.com word pairs google hits
http://www.cheswick.com word pairs google hits

What are the word pairs for which cheswick.com (or the mirror, research.lumeta.com) are the first Google hits?

Number one hits for word pairs found on www.cheswick.com and the mirror, research.lumeta.com/ches. The order of the unquoted word pairs changes the search results.

What are the most commonly-reported sites that are not lumeta's?

Open Questions

How tolerant is google of such activities? I have no idea. They seem to let me perform thousands of queries without getting mad, or slowing results, or giving bogus answers. Sometimes, the queries, which usually run at about 1 per second, stop for a minute or so. Is this google throttling, or just Internet weirdness? Google has vast capacity: they say that over a thousand computers touch each query. To minimize this processing, and simplify processing, I make the queries look like they are coming from a PDA.

Google does supply an API for making queries. They say it is limited to 1000 queries per day.

What related work is out there? People must have made explorations like this before. What google queries would help locate such efforts?

Is this information actually useful? Our marketing people are chewing over the data. This certainly finds misspellings. It is easy to be the first hit for a misspelled word, and I have found a couple.