Google first responders
The Cheswick web pages are the first hits for Google for
"Internet mapping",
"mccollough effect",
and maybe others. My question: what others?
I have written little programs to walk a collection of web pages, extract
the text, and print out the word pairs. These are submitted to Google (and
later, some other search engines), and the results examined for interesting
stuff.
What are the word pairs for which cheswick.com (or the mirror,
www.lumeta.com/research)
are the first Google hits
Number one hits for word pairs found on
www.cheswick.com and the mirror, research.lumeta.com/ches.
The order of the unquoted word pairs changes the search results.
The above tables, when scanned by searchbots, yield thousands of links that
they react to. This was actually disruptive for me: some of my normal search
results ended up hitting the above data pages. For a while I removed the data
from the web page. Now I have put the data back, with instructions to robots
to lay off them. We will see what happens.
Open Questions
How tolerant is google of such activities?
I have no idea. They seem to let me perform thousands of queries without
getting mad, or slowing results, or giving bogus answers.
Sometimes, the queries, which usually run at about 1 per second, stop for
a minute or so. Is this google throttling, or just Internet weirdness?
Google has vast capacity: they say that over a thousand computers touch each
query. To minimize this processing, and simplify processing, I make the queries look
like they are coming from a PDA.
Google does supply an API for making queries. They say it is limited to 1000
queries per day.
What related work is out there?
People must have made explorations like this before. What google
queries would help locate such efforts?
Is this information actually useful?
Our marketing people are chewing over the data.
This certainly finds misspellings. It is easy to be the first hit for a misspelled
word, and I have found a couple.
What are the most commonly-reported sites that are not lumeta's?