iVia Notes on Focused Crawling, Link Analysis, Text Similarity Analysis

Very partial listing from our other bibs of mostly very new cites.

Lexical and Semantic Clustering by Web Links.

F. Menczer:
To appear in JASIST
http://www.informatics.indiana.edu/fil/Papers/JASIST-04.pdf
Recent Web searching and mining tools are combining text and link analysis to improve ranking and crawling algorithms. The central assumption behind such approaches is that there is a correlation between the graph structure of the Web and the text and meaning of pages. Here I formalize and empirically evaluate two general conjectures drawing connections from link information to lexical and semantic Web content. The link-content conjecture states that a page is similar to the pages that link to it, and the link-cluster conjecture that pages about the same topic are clustered together. These conjectures are often simply assumed to hold, and Web search tools are built upon such assumptions. The present quantitative confirmation sheds light on the connection between the success of the latest Web mining techniques and the small world topology of the Web, with encouraging implications for the design of better crawling algorithms.


Topical Web Crawlers: Evaluating Adaptive Algorithms.

F. Menczer, G. Pant, P. Srinivasan:
To appear in ACM TOIT
http://www.informatics.indiana.edu/fil/Papers/TOIT.pdf
Topical crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. The context available to such crawlers can guide the navigation of links with the goal of efficiently locating highly relevant target pages. We developed a framework to fairly evaluate topical crawling algorithms under a number of performance metrics. Such a framework is employed here to evaluate id_erent algorithms that have proven highly competitive among those proposed in the literature and in our own previous research. In particular we focus on the tradeo_ between exploration and exploitation of the cues available to a crawler, and on adaptive crawlers that use machine learning techniques to guide their search. We noticed that the best performance is achieved by a novel combination of exploratory and exploratory bias, and introduce an evolutionary crawler that surpasses the performance of the best non-adaptive crawler after sufficiently long crawls. We also analyze the computational complexity of the various crawlers and discuss how performance and complexity scale with available resources. Evolutionary crawlers achieve high efficiency and scalability by distributing the work across concurrent agents, resulting in the best performance/cost ratio.


Crawling/Web Graph Theory:

Colin Cooper Alan Frieze
On a General Model of Web Graphs
http://www.aladdin.cs.cmu.edu/papers/pdfs/y2003/power.pdf

Colin Cooper Alan Frieze
Crawling on Web Graphs
http://www.aladdin.cs.cmu.edu/papers/pdfs/y2002/spider.pdf


Expert Interactions with Focused Crawlers:

Persona: A Contextualised and Personalized Web Search http://www.hicss.hawaii.edu/HICSS_35/HICSSpapers/PDFdocuments/DTDMI01.pdf Recent advances in graph-based search techniques derived from Kleinberg's work [1] have been impressive. This paper further improves the graph-based search algo- rithm in two dimensions. Firstly, variants of Kleinberg's techniques do not take into account the semantics of the query string nor of the nodes being searched. As a result, polysemy of query words cannot be resolved. This paper presents an interactive query scheme utilizing the simple web ontology provided by the Open Directory Project to resolve meanings of a user query. Secondly, we extend a recently proposed personalized version of the Kleinberg algorithm [3]. Simulation results are presented to illustrate the sensitivity of our technique. We outline the implementation of our algorithm in the Persona personalized web search system.

H. Chang, et al
CreatingCustomized Authority Lists
http://citeseer.ist.psu.edu/chang99creating.html
concept of lifting good hubs