The INFOMINE Virtual Library Crawler
The INFOMINE Virtual Library Crawler is a program that uses the 1300 academic virtual libraries cataloged in INFOMINE as a source of academic resources. It has contributed over 100,000 robot-created records to the INFOMINE database.
The Virtual Library Crawler has three goals: to automatically discover new resources for inclusion in INFOMINE; to measure the "worth" of the new resources by counting how many times they appear in academic virtual libraries; to automatically create INFOMINE records for the new resources; and to suggest to our editors which new resources are the best ones for librarian experts to review.
Technical details
The INFOMINE virtual library crawler is based on the Site Crawler code in the INFOMINE libiViaCore library. It respects robots.txt files, and crawls virtual library sites sequentially (to reduce their load). Because there are so many sites to be crawled (over 1300 at last count), the Virtual Library Crawler is parallelized, and crawls ten different virtual libraries at once. Only URLs are harvested from these sites (metadata content and full-text are generated from the primary Internet resource).
The Crawler's User Agent string looks something like this:
INFOMINE/8.0 VLCrawler (see http://infomine.ucr.edu/useragents/)
An overview of INFOMINE's crawling activities is available on the INFOMINE User Agent page.
Feedback
If you have any questions (or complaints!) about the crawler, please contact us.