Blacklisting URLs
The Remote iVia Service Interface can be used to produce useful sets of records completely automatically; however, it is often useful for a human expert to review the results and blacklist undesirable URLs. This expert interaction greatly improves the quality of the final result set.
When a URL is blacklisted, it means that URL will be ignored in (almost) all future iVia operations. For example, blacklisted URLs are ignored by the Web Crawlers, the Record Builder, and OAI-PMH importer. Additionally, most iVia installations delete records with blacklisted URLs; this is described in more detail below.
How to Blacklist
URLs are blacklisted using the iVia Blacklist URLs program. This can be accessed in two ways. First, Web pages in the iVia Adders' Web site that are used to view records, including the RiSI "Review Result Set" page, will provide a "Blacklist URL" icon which will send the record's URL to the blacklisting interface. The icon looks like this:
Second, the Adders' Homepage contains a Blacklist URLs link that can be used to blacklist arbitrary URLs.
Why to Blacklist
In general, it is a better use of an expert's time to blacklist URLs than it is to delete records. When URLs are blacklisted, they are never added again, even if the task is repeated later. Furthermore, a URL that has been blacklisted by an expert working on a specific task is blacklisted for all programs on that iVia installation: that URL will be ignored in all future tasks. This prevents duplication of effort.
Figure 1: A Blacklisted URL in a Result set. Note the
Blacklist URL icon is absent, and the URL is described as
(blacklisted!. This record will be deleted by the
delete_blacklisted_records program overnight.
Blacklisting URLs, Web sites and Patterns
Although we speak of blacklisting URLs (and occasionally of blacklisting records) the blacklisting process does not target particular URLs, as such. Instead, it helps us define blacklist patterns that are matched against URLs. If a URL matches any of the known blacklist patterns, then that URL is said to be blacklisted.
The advantage of using blacklist patterns, instead of specific URLs, is that we can blacklist several URLs (or several hundred URLs) at once. We can use a very specific pattern to blacklist a single URL, or a more general pattern to blacklist a group of related URLs, or an even more general pattern to blacklist all the URLs on a Web site (or sites). For example, suppose we want to blacklist http://www.explore.cornell.edu/scene.cfm?scene=Beetle+Science. The Blacklist URL program lets us chose one of four patterns that will blacklist this URL, shown in Figure 2.
Figure 2: Blacklisting an URL.
The first pattern is very specific: it will blacklist the initial URL, and only that URL. The next two are much more general: they allow you blacklist all the URLs on the site, or on related sites, respectively. The final option is somewhere in between: it blacklists all URLs that use the same CGI script as the original. The list of blacklist options will vary depending on the URL that is submitted.
Viewing the Blacklist and Log File
The "Adders' Blacklist" is a file containing all the blacklist patterns that Adders have defined on an iVia installation. You can view it with the iVia Log File Viewer. Most iVia installations have an additional, permanent blacklist (that was distributed with iVia).
When an Adder blacklists an URL, this operation is recorded in a log file. We are keeping this log to use as training data if we decide to pursue research into automatic blacklisting. The log file can be viewed in the iVia Log File Viewer.
Records with Blacklisted URLs are Deleted Overnight
When a URL is blacklisted, any occurrence of the URL is ignored in almost all subsequent operations. However, existing records associated with that URL are not immediately deleted because the additional of one blacklisting pattern may cause several (even hundreds) or records to be blacklisted, and finding all the affected records is time-consuming.
However, each night a program called delete_blacklisted_records is run as part of iVia's nightly processing, and it is usually configured to delete any local, robot-created records whose URLs are blacklisted. A log of it's operations can be viewed in the iVia Log File Viewer.