Chapter 7: The URL Checker
The URL Checker is a tool for maintaining the URLs in iVia. A program called the Site Checker regularly checks every record to make sure it's URL is still working, another program, called the URL Checker can be used to examine its results.
The Site Checker
The web sites in the live database are regularly examined to see if the site has changed or the URLs have grown stale. In INFOMINE, each record is checked once each week.
When each record is checked, it is assigned a Status code that indicates whether the record requires human maintenance.
At the same time that the site checker checks the sites, it can also downloads several pages of documents for use in iVia full-text searches.
Site Checker Status codes
Here are the status codes that the site checker returns:
- Invalid URL: The Site Checker skipped this record because it does not have a valid URL.
- Not an HTTP URL: The Site Checker skipped this record because it does not have a Web URL (i.e the URL does not start with http://). This may not be an error: telnet URLs, for example, may form the basis of useful iVia records.
- Not an HTML Document: The Site Checker skipped this record because the document is not an HTML document. Again, this may be intentional.
- Blocked by robots.txt: The URL points to a resource that is blocked by a robots.txt file, so the site_crawler cannot download it and examine it.
- Redirected onsite: The URL is being automatically redirected to another URL on the same Web site (which will be suggested). This does not necessarily mean that the suggested URL is better than the current one. Sometimes these redirections are temporary, at other times they may be used to redirect a standard top-level URL to a more complicated URL (in which case it is more useful to keep the top-level URL).
- Redirected offsite: The URL is being automatically redirected to another URL on a different Web site (which will be suggested). This Often means the Web site has moved or changed.
- Modified: The URL is available, but its content has changed.
- Failed: The URL could not be reached over the Internet by the Site Checker last time it was run.
- No error: There is no problem with the record's URL.
The URL Checker
The URL Checker interface is designed to help you quickly find records that need to be maintained. It lets you query the database for records based on their Site Checker status, INFOMINE Category, and the record creator’s institution.
The top of the page contains the URL Checker Search box, which lets you set up a query. You can choose any (or all) of the Site Checker Status results described above, and click on the Search button to get a list of records with that status.
You can filter URL Checker Search results by INFOMINE Category (records from the selected categories will be shown, selecting none is the same as selecting them all) or by the institution of the record creator. Finally, there are options to limit your search to a particular number of records (10 is the default), to expert records only (recommended), and to local records only (recommended).
The result list presents the records that match your query. The Record Id number, Title and URL of the resource are shown; clicking on the URL will open that URL in a new window.
Several standard record editing icons are displayed: the "View Record", "Comment on Record", "Edit Record", and "Delete Record" buttons work as they do elsewhere. The "Clear" icon can be used to clear the Site Checker Status for a particular record, which will remove the record from the list (until next time the Site Checker sets its status).
For Redirected URLs, an Update URL icon appears when a suggested new URL has been identified. Pressing the button replaces the record’s current URL (which is show, marked "Current URL") with the suggested new URL (which is also shown, marked "Suggestion"). Note that the suggestion is not always appropriate (note the "Redirected onsite" description above).
Finally, there are a few unusual features in the result list that do not appear anywhere else. The third major column looks something like this:
Failed
07/13/2004
The topmost element is the record's Site Checker Status, in this case Failed. The middle element is the date that the sire_checker last changed the status of this record: in this example, the date is several weeks ago, which means the Site Checker has looked at the record several times, and each time it has returned the same status: Failed.
Finally, the "No Fix" button is a special operation that is used to tell the site_checker to ignore this record in the future. If you click on "No Fix", the record will never show up in the URL Checker again, even if there is a problem with the site. This feature has a number of uses. For example, if you maintain a record for a frequently changing URL (e.g. a news site like CNN.com), you can mark it "No Fix" and it will no longer be displayed in the list of query results.