Thursday, June 30, 2011

Crawling in Open Source, Part 1

Introduction

In this article I will give you a short introduction to crawling in normal and then move on to Apache Nutch, its history, and architecture, and explanations of its core processing steps and MapReduce functions at a very technical level. After reading this article readers should be somewhat customary with the basic crawling concepts and core MapReduce jobs in Nutch.

Apache

What is a web crawler?

A Web Crawler is a computer schedule that ordinarily discovers and downloads content from the web via an Http protocol. The discovery process of a crawler is ordinarily easy and straightforward. A crawler is first given a set of Urls, often called seeds. Next the crawler goes and downloads the content from those Urls and then extracts hyperlinks or Urls from the downloaded content. This is exactly the same thing that happens in the real world when a human is interfacing with a web browser and clicks on links from a homepage, and pages that follow, one after another.

When a web crawler discovers new Urls to follow, it then goes to fetch the content from those Urls - just like a normal browser fetches the content when links on a web page are clicked. The download part is even simpler. In the simplified case of Http, a network relationship to the host is opened, an Http request is sent, an Http response is read from the wire and then the network relationship is closed. A crawler might also hold other means of getting the content, production it capable of finding and fetching content from varied dissimilar places like local or shared file systems, ftp servers, mail boxes etc.

Crawler use cases

Generally speaking, crawlers are used to find and bring in the content that interests you. Often the think is that you want to make that information searchable by indexing it. The use case and context for crawling can vary greatly depending on the task at hand. For example the most favorite web search engines, like Google and Yahoo, deploy a web crawler that primarily finds content that is publicly ready via the web. The main challenge for such a generic web crawler is obviously scalability. The crawler architecture needs to be effective because the estimate of content to crawl is large and growing all the time at ever increasing speeds. Other example, of a completely dissimilar kind of use case, is a price comparison site where users are presented with products and their price information from varied online shops.

To get such information one could implement a crawler that continuously re-crawls content from a predefined set of web sites, and is generally concerned in just a few exact Urls from those sites. A crawler in an business search context possesses yet Other set of features foremost for that context. Usually, an business crawler needs to understand and excerpt information from varied dissimilar document formats and passage those documents from many dissimilar places like mail boxes, Crm systems, and content supervision systems. Comprehension many dissimilar document formats plays an foremost role. And there are many other examples of crawling needs that lie in between these examples that each want a slightly dissimilar set of requirements for the crawler to fulfill.

Essential crawler components

There are a few features or characteristics that are very foremost for a crawler. The most foremost single highlight every crawler should include is a sense of politeness. Since crawling content from web sites uses resources from the target server, like network bandwidth and processor time, it is mandatory to limit the frequency a crawler accesses single content or else the target server would soon become inaccessible to real users. Owners of a website may also want to expose only some part of the ready content to web crawlers. For such cases there exists a formula called Robots Exclusion appropriate that gives the owner of a web site the means to operate what and how often something is fetched from the server. A diplomatic crawler respects such rules. Politeness is not just following the rules, but respecting a site's resources even when such rules are not explicitly specified by limiting the frequency of your crawler's visits. The most base components of a crawler include a queue, fetcher, extractor and content repository.

The queue contains Urls to be fetched. It may be a easy memory based, first in, first out queue, but ordinarily it's more developed and consists of host based queues, a way to prioritize fetching of more foremost Urls, an potential to store parts or all of the data structures on disk and so on. The fetcher is a component that does the actual work of getting a single piece of content, for example one single Html page. The extractor is a component responsible for finding new Urls to fetch, for example by extracting that information from an Html page. The newly discovered Urls are then normalized and queued to be fetched. The content repository is a place where you store the content. Later the content is processed and ultimately indexed. I am not going to go into more detail about crawler architectures and implementation details, but will instead take a closer look at one open source java crawler implementation: Apache Nutch.

Apache Nutch

Nutch was originally implemented by Doug Cutting and Michael Cafarella et al. In around 2002. The goal was to make Nutch a web scale crawler and search application capable of fetching billions of Urls per month, speak an index of these pages and allow searching of that index 1000 times per second. The traditional compose was proven to scale up to 100 million documents but became impractical with problems of maintenance at such a scale. During 2004, after the incubation process, Nutch became part of Apache. Soon after Google published its MapReduce paper in 2004, the Nutch architecture was rewritten to take advantage of MapReduce, and a new data storage ideas called NutchDistributedFileSystem (Ndfs) was implemented.

This new architecture allowed Nutch to be run on a large mass of machines, production the architecture scalable both in processing power and data storage. Later both the MapReduce doing framework and Ndfs were promoted to a top level Apache scheme called Apache Hadoop. Hadoop solves much of the maintenance issues Nutch had in the pre-MapReduce era. After setting up a Hadoop mass you can operate Nutch crawling from one single motor despite the size of the mass you are running Nutch on. Nutch as it exists today is still pretty much an application that helps you to build a generic web search engine. It supports fetching content with varied protocols such as Http, Https, Ftp and local file system. Nutch can also excerpt textual content from some document formats like Html, Rss, Atom, Pdf, ms formats (doc, excel, ppt), etc right out of the box. Nutch is not for every person as some quite low level skills are still required to run a crawl and speak the indexes.

Apache Nutch Internals

Running Nutch consists of executing some commands from the shell in sequence. Those commands each open one or more MapReduce jobs that are executed by Hadoop. Next I will walk you straight through some of the most foremost commands and how they are constructed as MapReduce jobs and tasks. I'll use Python-like pseudocode to review the functions in simplified examples for illustration purposes, but this should still capture the most valuable aspects of each job. Before going into the private commands,the core terms and components that I will refer to later in this article need to be explained. The Crawl Database is a data store where Nutch stores every Url, together with the metadata that it knows about. In Hadoop terms it's a Sequence file (meaning all records are stored in sequential manner) consisting of tuples of Url and CrawlDatum. Many other data structures in Nutch are of similar structure and no relational databases are used. The rationale behind this kind of data structure is scalability.

The model of simple, flat data storage works well in a distributed environment. Each node gets a one or more split of the whole data and operates on that (The Map phase of MapReduce). Data can be stored inside the Hadoop Distributed File ideas so nodes can passage the splits from the nearest host that contains a replica of it. Operations (like inserts, deletes and updates) in Crawl Database and other data are processed in batch mode. Here is an example of the contents of crawldb: The Fetch List is a data structure (SequenceFile, Url->Crawldatum) that contains the Url, crawldatum tuples that are going to be fetched in one batch, ordinarily the Fetch list contents are a subset of CrawlDb that was created by the create command.

FetchList is stored inside Segment. Segment is basically a briefcase containing all the data linked to one fetching batch. Besides the Fetch List, the fetched content itself will be stored there in increasing to the extracted plain text version of the content, anchor texts and Urls of outlinks, protocol and document level metadata etc. The Link Database is a data structure (Sequence file, Url -> Inlinks) that contains all inverted links. In the parsing phase Nutch can excerpt outlinks from a document and store them in format source url -> target_url,anchor_text. In the process of inversion we invert the order and incorporate all instances production the data records in Link Database look like: targetUrl -> anchortext[] text so we can use that information later when private documents are indexed.

Inject

The Inject command in Nutch has one responsibility: inject more Urls into Crawl Database. ordinarily you should get a set of Urls to add and then process them in one batch to keep the time of a single insert small.

Generate

The create command in Nutch is used to create a list of Urls to fetch from Crawl Database Urls with the highest scores are preferred.

Fetch

Fetcher is responsible for fetching content from Urls and writing them to disk. It also optionally parses the content. Urls are read from a Fetch List generated by Generator.

Parse

Parser reads raw fetched content, parses it and stores the results.

Invert links

Inverts link information so we can use anchor texts from other documents that point to a document together with the rest of the document data.

Final words

I have now gone straight through the core MapReduce jobs in Nutch that are linked to crawling. There are also many other data processing jobs in Nutch like indexing, page scoring calculation, removing double content from index etc. The best way to get customary with the remaining algorithms to look at the following resources (the link is in in references). In the next article of this series I will take a look at other open source crawlers.

Crawling in Open Source, Part 1

No comments:

Post a Comment