The Expert-Guided Crawl Service

Warning: This page contains some outdated screen shots.

The Expert-Guided Crawl Service lets people (and computer programs) use iVia's Expert-guided Crawler with Drill-down tool to crawl a Web site, discover the useful resources on that Web site, and create a set of iVia records that describe the discovered resources.

When RiSI is enabled, a form is provided for submitting URLs to the Metadata Assignment service on the iVia Adders' Homepage.

A common use for the Expert-Guided Crawl Service is to crawl a "Virtual Library" site like ChemInfo (http://www.indiana.edu/~cheminfo/), and discover all the links to "offsite" resources, and then create a record describing each such resource. Another potential use is to crawl a Web site and find all the PDF files, and create a record describing each PDF.

Be aware that the crawler is polite: it obeys robots.txt files, and usually only downloads one Web page at a time from the Web site being crawled. This means that it can take a long time to crawl a large site. Consequently, the Expert-Guided Crawl Service only operates in background mode.

This chapter will explain how to use the Expert-Guided Crawl Service by working through the steps required to harvest all the useful links on the Best Bugs site at http://pests.ifas.ufl.edu/bestbugs

Step 1: Setting the Crawl Parameters

A simple form is provided for submitting Expert-Guided Crawls to the iVia installation is provided from the Adders' Homepage (if RiSI is enabled). This form is shown in Figure 1 below, configured to crawl the "Best Bugs" Web site.

Figure 1: Expert-Guideed Crawler Configuration Screenshot
Figure 1: Expert-Guided Crawler Screen shot (out of date!).

The form is divided into three sections, from the most important at the top to the least important at the bottom.

Configuring the Expert-Guided Crawl Parameters

The Expert-Guided Crawl Parameters section is used to set the fundamental crawl parameters. First, the Start URL, which represents the "top" of the site to be crawled. The crawler will start crawling from this URL, and will only crawl pages which are on the same host, and which are its "children" on the Web site. These pages are said to be the onsite pages.

The Target Media Types parameter is used to specify what URLs the crawler should report based on their resource types. You can select any (or all) of a range of resource types. Note that not all Media Types are handled by the Metadata Assigner, so even though some records are reported, no records will be created from them.

The Target Resource Location parameter is used to specify what URLs the crawler should report based on their target location. The options are:

If the Create record for start URL option is set, then an additional record will be created describing the Start URL. I have decided not to select this option (the default).

Configuring Display, Notification and Storage of Results

The Display, Notification and Storage of Results section is used to control the way the results of the crawl are communicated to the user. The Harvest Tag option is a short identifier that is added to every record created by the Expert-Guided Crawl, and will be used to "harvest" the result set with OAI-PMH. You will almost always want to choose a new, unique harvest tag for each task. In this example, we have supplied the tag bestbugs to describe our result set. If you use the default value, automatic, then a unique harvest tag will be automatically generated for the task.

The Email Notification Address is an optional parameter that can be used to specify an email address that will be notified when the remote service is complete. This is particularly useful for long-running tasks, and in this example, I have requested a notification be sent to my email address.

The Log Verbosity option is used to control the level of detail of the events that are reported to the task log file. By default, this is set to 3 (normal output). For more information (for example, to find out more about the program's behavior on a given site) it m,ay be useful to select 4 (detailed output) instead. You will probably never want to use any level other than these.

Configuring Fine-tuning Parameters

The final set of parameters are used to fine-tune the crawl. The Crawler Time Limit option is used to limit the amount of time the crawler runs for. By default it is set to 2 hours (7200 seconds). Note that this controls the time taken by the crawl; additional time is spent creating results from the URLs discovered by the crawl, and this can take as long as the crawl itself.

The Crawler Drill-Down Limit option controls how many "levels deep" into the site the crawler may venture. A value of 0 (the default) means no limit is imposed. The Crawler On-site Page Limit is used to limit the number of onsite pages that are crawled by the crawler; typically this is set to 0, meaning no limit is imposed. The Crawler Result Limit is used to restrict the size of the result set, and again this is normally set to 0, meaning no limit.

The Links Anchored By Images can be used to make the crawler ignore links whose anchors are images (as opposed to text fragments). On some sites, these images are uniformly advertising material, and better ignored. However, the default is to report these links because some sites use images to provide useful navigational icons. Finally, the Crawler Verifies Result Links Work option can be used to force the crawler to check every link in the result set as it works, and to discard broken links. Generally, this option is not useful because it is faster to wait until the end and then check all the results in parallel (this is the default behavior).

Step 2: Launch the Task

Once we have configured the crawl parameters as described above, we start the task by clicking on the Submit Query button. This action launches the crawler as a background task, then forwards the user directly to the Log File Viewer for this task.

Step 3: Monitoring the Crawl with the Log File Viewer

Once the crawl is underway, we can monitor it in the iVia Log File Viewer. From the summary screen in Figure 2, we click on the Log Link, and see the log file in Figure 3.

Figure 3: Viewing the Task Log
Figure 3: Viewing the Task Log.

The iVia Log File Viewer displays a log file with additional hypertext mark-up for ease of reading and navigation. The box at the top of the page describes the file and page being displayed. Below it the log file is visible.

The Expert-Guided Crawl Service log file is divided into three main sections. This first, which starts with the bold entry Remote iVia Service Interface: Initiating task 13364, outputs the task parameters and useful links. In this case, 13364 is the unique the task identifier assigned to this request. The harvest_tag line shows the harvest tag that will later be used to retrieve the results, and the Email Notification line shows who will be notified when the task is complete.

The Log line provides a link to the task log displayed in the iVia Log File Viewer (i.e. the current page).

The Review Records line contains a link to the results generated by the task. Since the Expert-Guided Crawl does not generate records for its results until it has finished crawling the Web site, the search will initially return no results; only when the crawl is complete will this be useful. It is discussed in more detail below.

The other lines provide more technical information. The details of the user who requested the task are shown, and the Sample OAI-PMH Request line provides a link to an OAI-PMH ListResults page generated using the crawls harvest set parameter.

The second major section begins with the boldface heading Launching Expert-Guided Crawl and marks the beginning of the first major phase of the task: crawling the Best Bugs Web site. A link is provided to a more detailed log file, called 13364.log-crawl.log, and a summary entry notes that 23 results were found.

The third major section begins with the boldface heading Creating records for the new resources and marks the beginning of the second major phase of the task: creating new records representing the records discovered. Again, a link is provided to a supporting log file, 13364.log-crawl.log, and informational message notes that 22 new records were created.

The log of the crawl itself can be viewed by clicking on the link to 13364.log-crawl.log. Figure 4 shows this log in the log file viewer.

Figure 4: Viewing the Crawler Log
Figure 4: Viewing the Crawler Log.

The crawler starts by processing the Root URL at the bold line Processing http://pests.ifas.ufl.edu/bestbugs. The following lines show the useful URLs that are extracted from this page, and whether they are onsite URLs which will be crawled (like http://pests.ifas.ufl.edu/bestbugs/criteria.htm and http://pests.ifas.ufl.edu/software/poster.htm), or results that will have records created.

The second page is processed beginning with the bold line Processing http://pests.ifas.ufl.edu/bestbugs/criteria.htm. Although five result URLs are extracted from this page, they all links to the Web pages of the authors of the Web site, and as such are not useful members of our result set. We will discuss how to blacklist pages like these below.

The other supporting log file, 13364.log-create.log. This log file lists each result URL, and the record created from it. A short extract is shown in Figure 5.

Figure 5: Viewing the Record Creation Log
Figure 5: Viewing the Record Creation Log.

Note that the last URL visible on the page could not be downloaded, so no record was created. This is why the main log reports that 23 result were discovered, but only 22 records were created.

Step 4: Refining the Result Set

Once the crawl is complete, we can review the result set with the iVia SQL Search program. There is a link at the start of the task log file labelled Review Results, visible in Figure 3. Following this link leads to the screen visible in Figure 6.

Figure 6: Reviewing the Result Set
Figure 6: Reviewing the Result Set.

This result set was generated by performing a search for all the records in the iVia database that have their harvest tag set to bestbugs. Each of the records is displayed with its URL and Title metadata, and can be viewed, edited, deleted, and otherwise updated. In practice the most useful actions at this point are:

In Figure 6, the first two records visible appear to be useful insect-related records: exactly the type of result we are looking for. The third and fourth results (records 101968 and 101967), however, are the homepages of two of the page creators, and we do not want then in our result set (we can confirm this by viewing the records or visiting the pages). Although it would be simple to delete these pages, it is better to blacklist these pages so they are never added again.

We note at this point that both these homepages are on the same server, http://entnemdept.ifas.ufl.edu, so we decide to blacklist this entire Web server (since it contains only staff homepages, which are unlikely to be useful). We begin by clicking on the "Blacklist URL" link next to the record for "Malcolm T. Sanford". This launches the iVia Blacklist URL program, and we choose the Reject all URLs on the site http://entnemdept.ifas.ufl.edu option and Submit our selection. This blacklists the two bad URLs visible in Figure 6, and two other URLs discovered during the crawl which are not visible in the Figure.. The records will remain in the database for now, but will be ignored by most operations and will be eventually deleted. See the Blacklisting URLs chapter for a discussion of the implications of blacklisting an URL.

Looking through the result list a little further, we discover three other URLs that we do not want: one is another staff URL, which we blacklist (i.e. Reject the URL http://csssrvr.entnem.ufl.edu/~walker); and the others are a link to a textbook that is not required, and to the publisher of that textbook. We blacklist the publishers Web site (i.e. Reject all URLs containing .ifasbooks.ufl.edu/merchant2/).

After blacklisting these unwanted records, the result set is complete. If we were to run the crawl again with a different harvest tag (e.g. bestbugs-2 then on the second run the crawler would ignore the blacklisted records, and not introduce them into the result set.

It is also possible to blacklist on-site URLs so that the crawler will not traverse them looking for results. For example, we could blacklist the onsite URL http://pests.ifas.ufl.edu/bestbugs/criteria.htm visible in Figure 4, and then the crawler would not visit it, and would not add any of the poor-quality result URLs appearing on that page. (There is a form available from the Adders Homepage for blacklisting any URL.)

Step 5: Harvesting the Result Set

When the task is complete, the results are stored as records in the iVia database. They can be harvested using the OAI-PMH using the harvest tag as the OAI-PMH set. The location of the OAI-PMH server, and example ListRecords query, appear in the main log file for easy reference.

In our example, our harvest tag is bestbugs, and the OAI-PMH query is http://hostname.here/cgi-bin/OAI-PMH-server?verb=ListRecords&metadataPrefix=nsdl_dc&set=bestbugs.

Conclusion

This chapter has presented a step-by-step explanation of how to use the Expert-Guided Focused Crawler Service to crawl a Web site for useful URLs, create records from the URLs and harvest the result set. An important additional step is to review the result set and blacklist undesirable URLs: this expert interaction greatly improves the quality of the final result.