Nalanda_iVia_Crawler Documentation
2.5.0
The Nalanda-iVia Focused Crawler (from henceforth called: NiFC) is a typical focused crawler. It uses text classification to attempt to collect only those Web pages belonging to a previously defined topic. Unlike other, simpler, focused crawlers it can utilize two different classifiers to achieve this task. The first, here referred to as the
baseline classifier, is used to determine with which estimated probability a Web page is considered to be on topic, the second, optional classifier, here called the
apprentice classifier, is used to estimate how likely it is for a link to lead to an ontopic page. The apprentice can be used when there is an abundance of pages to prioritize which links to follow first. Please see
related pages for
Installing Nalanda iVia Crawler instructions and
Configuring Nalanda iVia Crawler information.
NiFC provides, amongst others, the following features:
-
High-speed: 2 to 3 times as fast as, for example Nutch.
-
Provides focused crawling by using dynamically loadable Web page classification modules.
-
Open source (GPL) and therefore freely available.
-
Easy blacklisting of URL patterns based on lists of regular expressions.
-
Honors robots.txt requirements (unless re-configured).
-
Uses load-balancing between different Web servers.
-
High-performance single-threaded implementation.
-
Highly memory efficient, e.g. 9% of the memory usage of Nutch in our tests.
-
Supports filtering based on URL patterns or Web page classification based on content.
-
Configuration either through a command-line interface or by providing of a text-based configuration file.
-
Has support for an "apprentice" classifier to prioritize following the most promising links based on link context (HTML DOM tree) classification.
-
Automated, scripted installation for various Linux distributions.
-
Create a top-level crawl directory (here we'll refer to it as $CRAWL_DIR).
-
Create a working directory under $CRAWL_DIR, e.g. $CRAWL_DIR/working-dir.
-
The way NiFC works is by loading one (the baseline classifier), or two (baseline and apprentice classifiers) from serialized classifier files. This or these classifier file(s) have to be created by some external program and use classifers based on the libiViaCore Classifier object. Once the 1 or 2 classifier files have been created they need to be copied into $CRAWL_DIR/working-dir. You must call this/these file(s) "baseline.classifier" and "apprentice.classifier".
-
Create a file with seed URLs, one-per-line in $CRAWL_DIR/working-dir, e.g. called "seed.urls".
-
Create a "focused_crawler.conf" file in $CRAWL_DIR. (There are several potentially useful example config files in the .../Nalanda_iVia_Crawler/src/focused_crawler directory). This config file should contain the following entries in the [Defaults] section:
working-dir set to $CRAWL_DIR/working-dir, action set to baseline, if you only use the baseline classifier, or set to meta, if you use both the baseline and apprentice classifiers, seed-urls should be set to $CRAWL_DIR/seed.urls.
-
Finally you should start the crawler with "focus_crawler --conf-file-path=$CRAWL_DIR".
One unconventional way to use NiFC is to use it to crawl a Web community defined by a, typically small, set of URL regular expressions. The way to accomplish this is to provide baseline and apprentice "classifiers" that given a URL matching one of the desired patterns returns an estimated probability of being on topic of 1.0 and 0.0 for non-matching URLs. This method has been used to crawl all of the
ucr.edu domain by "training" two RegExpBasedBinaryClassifier's with "(http|https)://([^/]*\.)?ucr\.edu(:[0-9]+)?(/.*)?" as the pattern.
-
Install libiViaCore, libiViaInvertedIndex, libiViaOaiPmh, libiViaMetadata, Nalanda, iVia
-
Change directories into your Nalanda-installed directory and create a directory called "working"
-
Change directories into your Nalanda-installed/etc directory. There should be a template configuration file. Edit this file and make the following changes.
-
log_filename = PATH_TO_NALANDA_INSTALLED/working/crawler.log
-
working_dir = PATH_TO_NALANDA_INSTALLED/working
-
seed_url_filename = PATH_TO_NALANDA_INSTALLED/working/seed.urls
-
classify_urls = true
-
topic = "positive"
-
use_wfv_documents = "false"
-
Change directories into the Nalanda-installed/working directory and do the following:
-
Create and edit "classifier.txt" and add the following line (You can also edit this regular expression to your liking): positive:^https?://([^/]+\.)?ucr.edu(/.*)?$
-
Type: iViaCore-create-reg-exp-based-binary-classifier classifier.txt baseline.classifier
-
Create and edit "seed.urls", add one line "http://www.ucr.edu". (this is where the crawler will start. you can also add more than one seed URLs)
-
You should be ready to run the crawl... Change directories into the Nalanda-installed/bin directory and run ./focused_crawler ../etc/CONFIG_FILE
-
You can view the progress by using "tail -f" on Nalanda-installed/working/crawler.log