The Nalanda iVia Focused Crawler

The Nalanda iVia Focused Crawler (NIFC) is a focused Web crawler. It was created by Dr. Soumen Chakrabarti (Indian Institute of Technology Bombay) and developed with the support of IIT Bombay, the iVia Team and the U.S. Institute of Museum and Library Services.

Focused crawlers are "Web robots" that search the Web looking for pages that are related to a specific topic. Focused crawling identifies significant Internet resources within specific communities of shared subject interest, and represents an appropriately scaled approach for many library and academic community applications.

How it works

NIFC crawls the Internet to find resources that are strongly inter-linked and which are part of, and contain content similar to, the same or related learning communities as those represented in INFOMINE and other significant academic IPDVLCs. The high quality data from IPDVLCs is often used in seed sets, or for training, in guiding the crawler. As the crawling progresses, an inter-linkage graph is developed describing how resources link to one another (i.e., cite and co-cite). Good resources focused around a common topic often cite one another. Highly inter-linked resources are evaluated, differentiated, and rated as to the degree to which they are linked to/from as well as for their capacities as authoritative resources (e.g., a primary resource such as a database which receives many in-links to it from other resources) or hubs (e.g., secondary sources such as virtual library collections which provide out-links to other, authoritative resources). After such assessments have been made, a second automated process is started which rates resources, as a second indirect measure of resource quality, by comparing for similarity of content (e.g., similarities in key words and vocabulary) between the potential new resources and resources already in the collection. The most linked to/from authorities and hubs, with terminology most similar to that in other high quality collections become prime candidates for either adding to the collection as automatically created records or for expert review and refinement. There are numerous algorithms and approaches for detecting relevant resources through co-citation or linkage analysis and through text similarity analysis, among other approaches. These areas of inquiry and software development are rapidly expanding frontiers in computer science research where great advances are taking place. The rewards, and why this work is being pursued, are much greater efficiencies and accuracy in the automated discovery of significant Internet resources and, in following, iVia and Data Fountains systems and services with greatly improved capabilities.

Preferential focused crawling, linkage analysis and topic similarity/semantic analysis are areas where work is occurring that have and will continue to benefit iVia and Data Fountains. Improved preferential focused crawling addresses improving accuracy in automatically selecting the .better. links to crawl among all those available (i.e., the URL frontier) on a page. This involves an .apprentice. learning program (Chakrabarti, 2002) that intelligently detects clues in a resource, which a human user would notice, regarding which links are the most promising to follow (e.g., visually emphasized links, link placement on the page, anchor text and text windows around anchor text)(Chakrabarti, 2002, 2001; Glover, 2002; Menczer, et. al., 2004; Menczer, 2004). Relatedly, the concepts of reinforcement learning in focused crawling as well as soft focused crawling are being examined (e.g., relaxing crawls to overcome bottlenecks)(Chakrabarti, 2002 and 2003; Flake, 2002; Menczer, et. al., 2004; Rennie and McCallum, 1999).

Download

Nalanda iVia is available for download from the iVia download page.

References

Chakrabarti, S. (2003), Mining the Web: Discovering Knowledge from Hypertext, Morgan Kauffman, San Francisco.

Chakrabarti, S., et. al. (2002), .Accelerated Focused Crawling through Online Relevance Feedback., WWW 2002, Honolulu, HI, at: http://www2002.org/CDROM/refereed/336/

Chakrabarti, S. (2001), .Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction.. In WWW 10, Hong Kong, May 2001, at: http://www10.org/cdrom/papers/489

Flake, G., et. al. (2002), .Self-organization and Identification of Web Communities., IEEE Computer, 35 (3), March, at: http://computer.org/computer/co2002/r3066abs.htm

Glover, E.J. et. al. (2002), .Using Web Structure for Classifying and Describing Web Pages., in WWW2002, May 7-11, 2002, Honolulu, at: http://www2002.org/CDROM/refereed/504/index.html

Menczer, F., G. Pant, and P. Srinivasan, .Topical Web Crawlers: Evaluating Adaptive Algorithms., to appear in ACM TOIT, 2004. At: http://www.informatics.indiana.edu/fil/Papers/TOIT.pdf

Menczer, F. (2004), .Correlated topologies in citation networks and the Web. (working paper). At: http://www.informatics.indiana.edu/fil/Papers/web-topologies.pdf

Rennie, J. and A. McCallum. (1999), .Using reinforcement learning to spider the web efficiently.. in Proceedings of the Sixteenth International Conference on Machine Learning, 1999. At: http://www.cs.cmu.edu/~mccallum/papers/rlspider-icml99s.ps.gz