Harnessing New Technologies in Support of Collaborative Service Design and Amplification of Expert Effort
July 2004
A number of machine learning based text and information processing technologies, developed over the last five to ten years in the computer sciences, are beginning to deliver useful advancements in automatically or semi-automatically detecting, extracting, organizing and providing good access to Internet delivered information. Machine processes, both fully automated and guided by experts, of aid in building finding tools and other collections, represent rapidly advancing frontiers yielding significant resource and time savings in areas that have been the traditional province of librarians. A number of these advancements, as represented in iVia, are discussed below. It should be noted that, while there are many collection building activities that can run fully automatically and yield useful results, many of these do not yet provide the accuracy required of academic finding tools. A special interest of ours thereforeee, also dealt with below, is exploring the interface between collection builder expertise and machine processes. This includes fleshing out areas where expert input can improve machine processes as well as areas where machine assistance can improve and amplify expert effort. The net result of this is that collections can become easier, quicker and less expensive to build and much larger. Collections, thus, better scale with the very large number of significant research and educational resources on the Internet at the same time that academic library levels of accuracy and quality (currently achieved through expensive manual processes) are approached.
The following emphasizes our work in automated and semi-automated resource discovery, metadata generation and rich full-text identification and extraction. Also of note, though not discussed here, are iVia advancements in information retrieval and in handling multiple types of data streams and import/export challenges and, more generally, contributions and related institutional identity management issues from/for multiple institutions.
iVia utilizes a range of programs known as crawlers to traverse the Web and identify new Internet resources. iVia's crawlers are used to help identify important academic resources on the Internet. The crawlers function as .collection development. tools. The simplest crawler is the .Expert Guided Crawler. that crawls expert specified pages or sites. A .Virtual Library Crawler. occupies the mid-range. The most advanced is our .Focused Crawler.. It is actually a next generation focused crawler that moves beyond standard approaches by using techniques known as preferential focused crawling to increase performance (Chakrabarti, et. al., 2002a).
The Virtual Library Crawler (vlcrawer) crawls a set of close to 1,000 academic virtual libraries or subject directories of Internet resources that have been librarian selected as being high quality, expert-vetted collections. Only those pages or resources that are present in multiple expert collections are harvested and further processed. If inclusion in an expert virtual library can be considered a .vote. from the expert community attesting to the value of the resource, iVia only builds records for resources that, in effect, are relatively popular in the academic virtual library community, having received multiple votes. In this way iVia informally polls and utilizes community opinion on the value of a site. Vlcrawler results are also used to help create high quality seed sets (important beginning points for a crawl) for the focused crawler. Vlcrawler respects .no robot. exclusion tags (it won.t crawl a site that does not care to be crawled). All iVia metadata generated and rich, full-text harvested comes from the resource itself, not the virtual library that is crawled.
Focused crawling makes possible the accurate identification of significant Internet resources within specific, focused communities of shared subject interest (Chakrabarti, 2003) and represents an appropriately scaled approach for many library and academic community applications. Academic communities as represented on the Internet are often fairly distinct, self-identifying and coherent Web communities that lend themselves to this discovery technique.
iVia's most advanced crawler, one that is under continuing development, is the Nalanda iVia Focused Crawler (NIFC). NIFC is a program that crawls the Internet to find resources that are strongly inter-linked within focused subject/discipline/research-specific communities, and which contain academic content similar to that found in INFOMINE and other academic virtual libraries. Seed sets of representative high quality resources representing the community targeted are used to begin the focused crawl process (these can be manually suggested or originate from the vlcrawler, as mentioned, or other sources). As the crawling progresses, an inter-linkage graph (representing, typically, thousands to millions of resources) is developed of which resources link to one another (i.e., cite and co-cite). Good resources focused around a common topic often cite one another. Highly linked resources are evaluated, differentiated and rated as to the degree to which they are linked to/from as well as for their capacities as authoritative resources (e.g., an important resource such as a database which receives many in-links to it from other resources) or hubs (e.g., secondary sources such as virtual library collections which provide out-links to other, authoritative resources). The degree and type of inter-linkage is, like co-citation in journal articles, often a useful measure of community opinion of the value of a site or resource. After such assessments have occurred, a second automated process is then put into play which rates resources, as a second indirect measure of resource quality, by comparing for similarity of content (e.g., similarities in keywords and vocabulary) between the potential new resources and resources already in the collection. The most linked to/from authorities and hubs, with terminology most similar to that in other high quality collections, thus become preferred candidates for either adding to the collection as automatically created records or for expert review and metadata refinement.
NIFC moves beyond the focused crawling described above through the usage of preferential focused crawling. This improves focused crawling performance and accuracy by enabling the NIFC to automatically select the .better. links to crawl among all those available (i.e., the URL frontier) on a page. This involves an .apprentice. learning program (Chakrabarti, 2002) that intelligently detects clues in a resource, which a human user would notice, regarding which links are the most promising to follow (e.g., visually emphasized links, link placement on the page, anchor text and text windows around anchor text)(Chakrabarti, 2002, 2001; Flake, et. al., 2003; Glover, et. al., 2002; Menczer, et. al., 2004; Menczer, 2004ab).
Expert interaction and/or semi-automated approaches to improve crawling are special emphases in iVia, as mentioned above. Among the expert interventions possible to improve crawling is expert creation and refinement of seed sets to begin or renew a crawl by selectively choosing among seeds generated from expert collections. iVia can use the expert vetted results from vlcrawler, as mentioned, for this. Expert community created .blacklists. of URLs for types of sites or pages that are not valuable save crawling time. There is such a blacklist for iVia and will be for each Data Fountain. Rules for determining sites that should be fully crawled as opposed to being sampled or only crawled at top levels are being developed as well to save crawler time.
Interactive, expert topic distillation is very useful. This can take a few forms. Expert feedback, either interactively or after a crawl (prior to re-crawling), is one of these forms and works by providing example based learning (i.e., specifying positive and/or negative examples) to indicate to the crawler the resources most relevant to a subject.
Web graph interaction is useful as a form of topic distillation. Graph visualization tools enable large numbers of inter-linking resources and communities to be sensibly viewed and discriminated among. Expert truing of crawler-built Web graphs, either during or after a crawl, is being explored to improve crawling accuracy. Outliers reduction, whereby resources or communities of sites that are of little or peripheral value to a focused crawl can be detected (often weakly linked on the margins of the visualized graph) by experts and eliminated, is useful. Expert .lifting. of the values assigned to selected hubs and authorities is another technique that results in re-configuration of the crawl and more relevant Web graphing and linkage analysis (Chang, 2000; Tanudjaja and Mui, 2002). Similarly, lists of representative, preliminary crawl results can be ranked as positive or negative with similar effect (Yu, et. al., 2004).
Discussions of 'semi-automated. modes of crawling bring into relief iVia's Expert Guided Crawler with Drill Down and Drill Out. This is iVia's simplest and most expert-involved mode of crawling in the sense that all pages or sites crawled, as well as both the depth of crawling into the site (most sites being hierarchically organized) and the number of links to crawl out from the site (those pages and sites linked to from the original site but otherwise not a part of it) are determined by the expert operating the tool. This is valuable in crawling sites (or collections of sites, including virtual libraries), that an expert knows to be excellent. This approach is of great benefit as well in crawling known, non-open Web, collections (e.g., pre-print or article databases) where known objects exist and they and they alone are to be crawled, classified and augmented with metadata. Work on this has been an important aspect of customization and improvement of iVia for the National Science Digital Library.
iVia, as mentioned, supports fully expert created metadata, fully automatically created metadata, and metadata that is automatically created and then expert refined and augmented (i.e., semi-automatically created). It also supports the identification, harvest and storage of rich full-text. Similarly, controlled subject terms (LCSH and LCC and, shortly, iVia's own 600 term hierarchical research disciplines ontology), keywords, and descriptions are among the metadata that can be created automatically by iVia. These automated extraction and classifier programs are part of a suite of programs known as the "Record Builder".
The Record Builder, whenever possible, first identifies and then extracts and prioritizes existent author supplied metadata in the HTML/XML of the resource. Conventions in author applied metadata are used to identify and extract this type of metadata. Title, URL, keyword, description and author or creator metadata can be found this way. This author supplied metadata, if available, is then supplemented by creating original metadata for a number of, mostly thematic, fields (e.g., LCSH, LCC, Keyword, Subject discipline and Description) as described below. When author supplied metadata is not available (and/or to augment author supplied metadata), iVia automatically generates its own for the resource, as described below.
The automated creation of original metadata (after the author included metadata has been noted and processed) actually begins by selectively identifying text in the Internet page/site/object that is thematically .rich.. Among the first steps of the classification process, rich text identification involves determination of .aboutness. text in a the entry pages for a resource or site or sections of a document which are intended by the author(s) to be rich in descriptive information about the topics within (e.g., abstracts, introductions, summaries, .about. pages, FAQS). Accurate rich text identification in turn yields more accurate identification and application of key phrases and, from these, more accurate controlled subject term and other metadata application.
Refinement of iVia's "aboutness" measure (Mitchell, et. al., 2003) in identifying rich text is an important and ongoing task involving identifying rich text placement in specific types of Internet resources and documents. Different resource types often have differing areas where rich text can be found and different types of rich text. Aboutness hunting has increasingly involved detection of semantic patterns and patterns in site and document layout that yield rules and measures which help identify rich text placement in diverse formats (e.g., HTML/XML, PDF, postscript, WORD), resource types (e.g., web sites, articles, tech reports) and languages. Rich text identification is crucial to the whole classification process. From it key phrases and then subjects and other metadata are then generated.
Rich, full-text is also important from an information retrieval perspective. This is because the natural language terminology contained partially corrects for the limitations inherent in most metadata and subject schema approaches. For example, new or specialized subject terminology is often slow to appear in library standard subject schemas. These can also have the problem of sometimes being obtuse to average users and/or too general for specialists or practitioners.
General: Original, machine generated thematic metadata applied currently by iVia includes key phrases, LCSH, LCC, and broad research subject disciplines. These are derived, roughly in the order listed, from one another starting with key phrases extracted from the text of the Internet object. The set of key phrases attained serve as a surrogate in representing each Internet resource and summarize the resource's content. Text classification software (i.e., classifiers) are programs that are trained to take an object and, using various methods or algorithms that identify important text (through statistical means and through author supplied clues in the resource such as large fonts), are able to deduce which classes or labels (e.g., key phrases, LCSH, etc.) best represent the object according to both natural language terms and terms from controlled vocabularies.
Classifiers build a model starting with the most important natural language key phrases identified and extracted. The model encapsulates the relationships between these natural language key phrases and the set of controlled language terms making up LCSH. Using this, the closest corresponding set of LCSHs is assigned. LCC is then assigned through mappings from the LCSH (Frank and Paynter, 2004). In turn, the research subject disciplines are assigned by mappings from the LCC. The classification algorithms we have used to date include Support Vector Machines (SVM), Naïve Bayes (NB), k-Nearest Neighbor (kNN) and, shortly perhaps, Logistic Regression (LR).
Classifier Training: The model built by the classifier is learned from training datasets that consist of large numbers of records from library catalogs (23 million from the University of California's Melvyl Catalog and 5 million from Cornell University's Catalog) and virtual libraries, where LCSH and key phrase metadata, as well as selected full-text from Internet resources that have been catalogued, are used to describe a given resource. Leveraging labeled training data with partially or unlabeled training data is an area that we continue to explore given the large amount of library catalog data available to us (Ghani and Jones, 2002; Jones, et. al. 2003; Park and Zhang, 2003; Thelen and Riloff, 2002). If the partially and unlabeled catalog data contained within this could be used more effectively, it would give us much greater accuracy in automated library subject schema assignment.
Classification Algorithms: iVia work has included much effort in analyzing and customizing various text classification algorithms. Though SVMs are relatively accurate, model building using SVM can be problematical. While they excel at two-class discriminative learning problems and are more accurate than generative classifiers, such as Naïve Bayes (NB), they are more complex than NB and have greater performance problems in building models in a timely manner, especially when multi-class problems and large training datasets are involved (Godbole, et. al., 2002; Yu, et. al. 2003). iVia requires multi-class classification though, since we are taking Internet resources (say a multiple subject site or ebook) and assigning often more than two classes (e.g., LCSHs representing multiple subjects). Consequently, we have had a need to explore beyond SVM or NB in developing our approach to classification.
Among our explorations in improving upon iVia classification algorithms have been the following. There are a number of means to improve SVM performance and accuracy that have or are being examined (Shih, L., et. al., 2002; Yu, H., et. al., 2003). We are exploring a multiple linear discriminant algorithm in text classification, known as SIMPL (Chakrabarti, S., et. al., 2002b). Naïve Bayes is another reliable and fast, though previously not extremely accurate, classification algorithm that is being improved (Rennie, J., et. al., 2003). We have evaluated new designs in regard to all of these. In addition, there may be advantages to combining multiple classifiers to either further refine the work done by each or to confer greater accuracy in varying classification situations (Shih, et. al., 2002). Relatedly, we have looked at combining SVM with Naïve Bayes, where appropriate (ibid.). In fact, NB has been re-deployed to aid SVM classifiers in a new technique for multi-way classification which exploits the accuracy of SVMs and the speed of NB (Godbole, et. al., 2002).
One of the more promising, .new. classification algorithms, one which iVia is working with, is known as Logistic Regression (LR)(Hastie, et. al. 2001). Though it has been around for a while and has not been widely utilized, recent improvements in this algorithm's speed in classification and its stability (Komarek, 2004; Komarek and Moore, 2003) may render it highly useful for iVia. LR has a well developed statistical foundation that gives it depth and flexibility for further development.
Better mappings of the important natural language key phrases and keywords found on Internet resources to LCSH terms through increasing accuracy in detecting synonymy and background context and intent in word usage is another major area we have been pursuing.
Descriptions are created by using author created descriptions indicated by HTML/XML meta-tags in the resource and/or by extracting rich, aboutness text as described above. This text is determined based on its position in the Web resource and other document layout clues including HTML/XML structures (e.g., headings, fonts, and links which may indicate author emphasized text). Important key phrases, extracted using our phrase extractor, are also used. Description length is determined by the amount of rich text discovered on the site that is gauged to be about the site.
Using subject expert input to refine classification is a major focus of our research and development. Rules reflecting the semantics of resources in each major subject area are being developed to improve classification. This will be especially important in Data Fountains. Background context clarification (i.e., word sense disambiguation), including manually distinguishing among identical or closely related terms representing differing subjects (e.g., homonyms such as jaguar, the car, or jaguar, the animal) may be supported. .Blacklists. are also developed for terms which misrepresent a particular subject (e.g., the entomology community isn't usually interested in 'software bugs.). Updating the ontologies that the classifier populates either during or after classification runs, by adding new classes (i.e., leaf nodes) as the parent node becomes overpopulated, can be practiced. Generally, manual reviews of classification results are regularly undertaken with the results used to improve the classifier.
Chakrabarti, S., 2003. Mining the Web: Discovering Knowledge from Hypertext. Morgan Kauffman, San Francisco.
Chakrabarti, S. 2002, The Structure of Broad Topics on the Web, WWW2002, Honolulu, HI, at: http://www2002.org/CDROM/refereed/338/index.html
Chakrabarti, S., et. al., 2002a, Accelerated Focused Crawling through Online Relevance Feedback, WWW2002, Honolulu, HI. at: http://www2002.org/CDROM/refereed/336/
Chakrabarti, S., et. al., 2002b, Fast and Accurate Text Classification via Multiple Linear Discriminant Projections, Aug. 2002, VLDB 2002, Hong Kong, at: http://www.cs.ust.hk/vldb2002/VLDB2002-proceedings/papers/S19P01.pdf
Chakrabarti, S. 2001, Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In WWW 10,Hong Kong, May 2001, at: http://www10.org/cdrom/papers/489
Chang, H., et. al., 2000, Creating Customized Authority Lists, in Proceedings of the Seventeenth International Conference of Machine Learning. At: http://www.cs.umass.edu/~mccallum/papers/lift-icml2000.ps
Flake, G., et. al., 2002, Self-organization and Identification of Web Communities, IEEE Computer, 35 (3), March, at: http://computer.org/computer/co2002/r3066abs.htm
Frank, E. and Paynter, G. W., 2004, Predicting Library of Congress Classifications from Library of Congress Subject Headings, Journal of the American Society of Information Science and Technology, Vol. 55, No.3, pp.214-227. At: http://www.cs.waikato.ac.nz/~eibe/pubs/LCSHtoLCC.ps.gz
Ghani, R. and Jones, R., A Comparison of Efficacy and Assumptions of Bootstrapping Algorithms for Training Information Extraction Systems, LREC 2002 Workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Language Data. 2002. At: http://www-2.cs.cmu.edu/~rosie/papers/ghanijoneslrec2002.pdf
Glover, E. J., et. al., 2002, Using Web Structure for Classifying and Describing Web Pages, WWW2002, May 7-11, 2002, Honolulu. At: http://www2002.org/CDROM/refereed/504/index.html
Godbole, S., et.al., 2002, Scaling multi-class Support Vector Machines Using Interclass Confusion, in The Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), July 23 - 26, 2002, Edmonton, Alberta. At: http://www.it.iitb.ac.in/~shantanu/work/gsc-kdd02.pdf
Hastie, T., et. al. 2001. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer Verlag.
Jones, R., Ghani, R., Mitchell, T., and Riloff, E., Active Learning for Information Extraction with Multiple View Feature Sets, ECML-03 Workshop on Adaptive Text Extraction and Mining, 2003. at: http://www.cs.utah.edu/~riloff/psfiles/ecml-wkshp03.pdf
Komarek, P. 2004. Logistic Regression for Data Mining and High-Dimensional Classification. Doctoral Thesis. Carnegie Mellon University, 138 p. http://www.autonlab.org/autonweb/documents/papers/komarek:lr_thesis.pdf
Komarek, P. and Moore, A., 2003, Fast Logistic Regression for Data Mining, Text Classification and Link Detection, http://www.autonlab.org/autonweb/documents/papers/komarek:nips2003.pdf
Mason, J., et. al., 6/00, INFOMINE: Promising Directions in Virtual Library Development, First Monday, Volume 5, Number 6 - June 5th 2000, at http://www.firstmonday.dk/issues/issue5_6/mason/index.html
Menczer, F. Pant, G., and Srinivasan, P., Topical Web Crawlers: Evaluating Adaptive Algorithms, to appear in ACM TOIT, 2004. At: http://www.informatics.indiana.edu/fil/Papers/TOIT.pdf
Menczer, F., 2004a, Correlated topologies in citation networks and the Web (working paper). 2004. http://www.informatics.indiana.edu/fil/Papers/web-topologies.pdf
Menczer, F., 2004b, Mapping the Semantics of Web Text and Links (working paper), at: http://dollar.biz.uiowa.edu/~fil/Papers/maps.pdf
Mitchell, S., et. al., 1/03, iVia Open Source Virtual Library System, D-Lib Magazine, 9 (1) at: http://www.dlib.org/dlib/january03/mitchell/01mitchell.html
Park, S.-B. and Zhang, B.-T., Large Scale Unstructured Document Classification Using Unlabeled Data and Syntactic Information, Lecture Notes in Artificial Intelligence, vol. 2637, pp. 88-99, 2003. at: http://bi.snu.ac.kr/Publications/Journals/International/LNAI2637_Park.pdf
Rennie, J., et. al., 2003, Tackling the Poor Assumptions of Naive Bayes Text Classifiers. in the Proceedings of The Twentieth International Conference on Machine Learning, Washington, DC, August 21-24. at: http://people.csail.mit.edu/u/j/jrennie/public_html/papers/icml03-nb.pdf
Shih, L., et. al., 2002, Not Too Hot, Not Too Cold: The Bundled-SVM is Just Right!, Proceedings of the ICML-2002 Workshop on Text Learning. at: Rennie, J., et. al., 2003, at: http://people.csail.mit.edu/u/j/jrennie/public_html/papers/icml02-bundled.pdf
Tanudjaja, F. and Mui, L., 2002, Persona: A Contextualized and Personalized Search, Proceedings of the 35th Hawaii International Conference on System Sciences . 2002, http://csdl.computer.org/comp/proceedings/hicss/2002/1435/03/14350067.pdf
Thelen, M. and Riloff, E., A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts, Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing. 2002. at: http://www.cs.utah.edu/~riloff/psfiles/emnlp02-thelen.pdf
Yu, H., Han, J. and Chang, K. C.-C. , 2004, PEBL: Web Page Classification Without Negative Examples, IEEE Transactions on Knowledge and Data Engineering, 16(1):70-81, January 2004. Special Section on Mining and Searching the Web. At: http://www-faculty.cs.uiuc.edu/~hanj/pdf/tkde04_pebl.pdf
Yu, H., Yang, J., and Han, J., 2003, Classifying Large Data Sets Using SVM with Hierarchical Clusters, Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC. At: http://citeseer.nj.nec.com/yu03classifying.html