This was quietly mentioned at the last session of the Search SIG, hosted by John Battelle, and has been discussed in the ranks of the searcherati's (search engine specialists): what happens to the market the day a search engine index becomes available for free (or close enough to free), and so is an entire crawling infrastructure ? This is actually something that Inktomi had been doing for a while with its index - for a fee.
Mike Arrington just pinged the Web 2.0 Workgroup (the 4th top blog network according to this new ranking ?) about the announcement by Alexa (aka Amazon) that they did just that: making their index and crawling infrastructure available to anyone for free - OK not free, but very close to it. John Battelle (who else :-) was briefed on this, and clarifies the “offering”:
In short, Alexa, an Amazon-owned search company started by Bruce Gilliat and Brewster Kahle (and the spider that fuels the Internet Archive), is going to offer its index up to anyone who wants it. Alexa has about 5 billion documents in its index - about 100 terabytes of data. It's best known for its toolbar-based traffic and site stats, which are much debated and, regardless, much used across the web. [...]
Anyone can also use Alexa's servers and processing power to mine its index to discover things - perhaps, to outsource the crawl needed to create a vertical search engine, for example. Or maybe to build new kinds of search engines entirely, or ...well, whatever creative folks can dream up. And then, anyone can run that new service on Alexa's (er...Amazon's) platform, should they wish.
It's all done via web services. It's all integrated with Amazon's fabled web services platform. And there's no licensing fees. Just “consumption fees” which, at my first glance, seem pretty reasonable. (“Consumption” meaning consuming processor cycles, or storage, or bandwidth).
The fees? One dollar per CPU hour consumed. $1 per gig of storage used. $1 per 50 gigs of data processed. $1 per gig of data uploaded (if you are putting your new service up on their platform).
So... What does it mean to have an index just 25% (-ish) the size of Google's and Yahoo's available to anyone and everyone? A few thoughts:
- Alexa is always perceived as an approximate source of traffic statistics based on the usage of the toolbar, and one can wonder regarding the distribution of the 5 billion pages available in the index.
- Search engines indexes are one step closer of being a commodity - at least for the “Surface web” (as opposed to the Deep Web). When will an open source index will be developed for everyone to use ? A proper plugin infrastructure would be required to allow specialized search engines like Truveo to apply their specific heuristics (in that case, to find code that likely suggests that there might be videos on a site) - as well a scheduling.
- This furthers the notion that the value of a search engine is in the application(s) built on top of it, like ad networks and the ability to match relevant ads to content, or any specific vertical search functionality.
- Amazon is further leveling the playing field by offering this commodity infrastructure to competitors of the large search players - and making a little bit of money around a third layer of business: selling stuff it holds in inventory, facilitating the sale of stuff that affiliates hold in inventory, and now selling access to information about stuff one can find on the Internet.
The implications of that announcement will further develop over the next couple of days. More can be found on the on Alexa blog and Alexa Web Search site:
The Alexa Web Search Platform provides public access to the vast web crawl collected by Alexa Internet. Users can search and process billions of documents -- even create their own search engines -- using Alexa's search and publication tools. Alexa provides compute and storage resources that allow users to quickly process and store large amounts of web data. Users can view the results of their processes interactively, transfer the results to their home machine, or publish them as a new web service.
Update: the blogosphere is buzzing around the topic this morning - not surprisingly (here is the Memeorandum thread). I found this article from Phil Wainewright quite on point regarding the changes that this announcement might imply for Google. At the end of the day, their value is the depth and breadth of their advertising network, and their ability to match relevant ads - and therefore they most likely will continue to do well.
It's a good point, but be careful of overstating it.
On the one hand, open source solutions like Nutch and now this Alexa service make it easier to get crawl and index data on the web.
On the other hand, comprehensiveness and timeliness are still issues. A partial, stale index of the web has a lot less value.
By the way, it is important to distinguish between the index and the relevance rank. Relevance rank is what gets the right information to the right people. Search is all about relevance, and relevance rank is where most of the value is.
Posted by: Greg Linden | December 13, 2005 at 07:53 AM
Greg> Good points, and I should have expanded on my comment regarding the distribution of the 5 billion pages indexed by Alexa re the freshness.
What is interesting here is that Alexa is making available a huge infrastructure as a service - whereas Nutch is a piece of software that you have to run. But you are right, we don't know what is the update rate of the crawler(s) and whether it is possible to "force" a refresh of a specific set of web sites.
And regarding relevance, this is where application built on top of the index will be able to introduce some level of differentiation. It might however not be optimized at all because of the implied layering of data/processing.
Posted by: Jeff Clavier | December 13, 2005 at 09:41 AM
Jeff, if Alexa and others start providing their search engine infrastructure for everyone to create their own search engines in this way, I don't feel these newly upcoming search engines will become that famous even as alexa itself! Developing and offering a service is one thing and marketing and promoting and establishing the same in it;s target market is another. And while we have big giants such as google, yahoo, msn and others not giving up their billion dollar infrastructure and their yearly revenues, I don't think these upcoming hundreds of thousands of search engines will make any sense or get any popular. Is this just a source that will offer free content that one can harness along with contextual advertising options? Again the ball is in google's court!
Cheers! The earth is round!
Posted by: Susan | June 05, 2006 at 03:11 AM