Google’s Code Search

Google Code Search launched today. It was originally an internal site, for their internal code. But, being such a great application (and having had Krugle and Koders establish a market), they clean it up and made it public. I don’t expect to see Krugle on ebay straight away, but the other code search engines do have a bit of a fight on their hands.

Code Search’s main advantage is that it lets you use regexps for searching. To be more accurate, it defaults to regexps for searching, it takes a little effort not to use them. I can’t even speculate as to how they made regexp searching fast enough for the user and practical enough for the data center. A quick estimate puts the number of files search at 40 million (wave to Gary who found the search term with the most results - 32.4 million for “x“), which is about 1/1000 the size of their main web search index, so we might regexp searching on their web search sooner than I would have thought possible.

I do have to ignore the fact that Code Search has a much smaller audience and hence a much smaller load. If I was to guess, I would say it would get about one ten millionth of the searches web search gets. I’d also have to ignore the possibility that Code Search might have a very different index structure. Then I’d have to find a reason to use regexp for a web search. Before Google perfects their natural language queries, which is probably more important to them then giving us regexp search. But I digress.

Code Search also has a license filter, which is a nice bonus feature. It might save me a little time, but I like it more as an example of Google’s intent parse and understand the world’s data, regardless of structure. adopting this approach removes a significant mental barrier to understanding what you can and can’t do.

In among this goodness are a couple dangers. Will spammers run a search with an email regexp? (a very simple one - 3.9million, or a slightly more complete one - 7.9million results) They may know that programmers are among the least susceptible to spam. Then again, knowing your audience is a pretty big advantage for a spammer. Will it be easier for hackers to attack sites, when it’s so much easier to search for vulnerabilities? (username file:wp-config.php, from kottke) Will programmers start copying and pasting code, regardless of license? Will normal people learn the truth about programmers? (wtf, damn, hack)

Like many other Google products, Code Search shakes things up a bit, by entering other companies’ markets and by making once hard things easy, which removes existing barriers (albeit artificial ones) to abuse of other people’s data.


About this entry