In my last post referring to the piece of the San Jose Mercury News on Glenbrook Networks, I mentioned that we would dig further in the technology used to build the Glendor Showcase. This first post covers the extraction of data from the Deep Web.
The majority of web pages one can access through search engines were collected by crawling the so-called Static or Surface Web. It is a smaller portion of the Internet reportedly containing between 8 and 20 billion pages (Google vs. Yahoo index sizes). Though this number is already very large, the total number of pages available on the Web is estimated to 500 billion pages. This part of the Internet is often referred to as Deep Web, Dynamic Web, or Invisible Web. All these names reflect some of the features of this gigantic source of information - stored deep down in databases, rendered through DHTML, not accessible to standard crawlers. Pages in the Deep Web typically might not have a standard URL, and cannot be addressed in a standard fashion. In many cases, they actually do not even exist until a user asks a question by filling up fields in a form, and a response (page) is generated. Typical examples of deep web applications are airline reservation, online dictionaries, etc.
It is supposedly quite easy for a human to navigate through the Deep Web. One just needs to fill up a form by choosing one of several options like destinations and dates a on travel site, or entering a word to search for a meaning or a translation. It is much more difficult for a machine to do so automatically and generically. Because the Deep Web contains a lot of factual information, it can be seen metaphorically as an ocean with a lot of fish. That is why we call the system that navigates the Deep Web a trawler.
There are two major problems with navigating Deep Web automatically. First, the trawler needs to understand what questions to ask through aforementioned forms, and ask them exhaustively. Second, the trawler can not easily navigate from one page to another since pages do not have set URLs or might not even exist. That's why the trawler needs to remember where it came from and return to the surface (like a whale) before "diving" again to ask the next question.
If the number of sites is relatively small, say a few thousands, each set of forms could be described manually through a templating system. Its major limitations are scalability, and non resilience to changes in page formats.
There is a third problem that is related to the size of the Deep Web. It is so big that one needs to focus on a particular subset (vertical) to have a chance to trawl it with some level of success, especially if high precision is an important factor. Since the task of determining what questions to ask includes understanding of semantics and context, the focus on a vertical comes handy.
Glenbrook's approach to building a trawler is based on mimicking the behavior of a (human) user. It is a useful approach since the "doors" opening the Deep Web were built with a human in mind and reflect the standards (no matter how loose) that humans use to navigate the Web.
The Trawler consists of five layers:
- Discoverer - locates perspective target home pages in Surface Web
- Scout - navigates Surface Web part of a web site and finds the "doors" - DHTML pages that contain forms leading to the Deep Web part of a web site
- Locksmith - fills up the forms with various requests and collects responses
- Assessor - analyses responses and makes a decision to use this door as candidate to query the Deep Web part of the site or move elsewhere
- Harvester - collects all relevant pages from Surface and Deep Web parts of the web site
After all potentially relevant pages are harvested the Extractor takes over. The Extractor is a hybrid system that applies Pattern Recognition, Natural Language Processing and other AI techniques to extract facts, combine them and populate a database that is used to provide factual answers to search queries.
The Extractor will be the subject of another post.
Tag: Glendor
Comments