R&D Archive: Steve Jones' LCSH Classification Project

Note: This work has been superseded by the current LCSH assignment tool.

Steve Jones' LCSH Classification research

The INFOMINE LCSH classification research project, led by Dr. Steve Jones of the Department of Computer Science at The University of Waikato, aims to classify documents by assigning them Library of Congress Subject Headings (LCSH).

In our experiments a set of keywords and key phrases serve as document surrogates, summarizing a document's content. In the training stage, a model is learned that encapsulates the relationships between these key phrases and sets of LCSH. Once we have built the model, we can use it to assign LCSHs to new documents by first automatically extracting key phrases from them (with Kea or PhraseRate), and then using the model to match these key phrases with a set of LCSHs.

INFOMINE's goal is to take Dr. Jones' research and use it to assign generate LCSH metadata for Internet resources in the INFOMINE automatic record builder.

Learning a model

The model is learned from a training dataset that consists of a set of documents that have both LCSH and key phrase metadata. Although documents that have been annotated with LCSH are plentiful (e.g. library catalogs), it is more difficult to find records with key phrases also assigned. However, we have compiled several training datasets:

  1. MARC Records from library catalogs with LCSH and Table of Contents.
    Librarians occasionally use Table of Contents metadata in place of key phrases; we take a similar approach, using Kea to extract the twenty most significant phrases from the Table of Contents. We compiled two datasets, one based on 1 million records from the UCR SCOTTY catalog, and when this proved too small, another based on 24 million records from the UC's combined MELVYL catalog.
  2. Google.
    A second dataset is based on documents returned by Google searches for LCSH heads. This dataset can be arbitrarily large, but is of lower quality because the search results may not be relevant to the search terms. Currently, the dataset consists of:
  3. INFOMINE.
    INFOMINE contains over 20,000 records that have been assigned both LCSH and key phrase metadata by professional catalogers. Although this dataset is too small for training purposes, we use it as our testing dataset. This is appropriate since our ultimate application will be to assist the INFOMINE adders in their cataloging duties.

The training data is used to build a model that represents the frequency of co-occurrence of key phrases and LCSH heads. To assign LCSH to new documents, key phrases are extracted form new documents, then similarity measures are used to find the LCSH with the most similar set of co-occuring key phrases.

Problems

The LCSH present an interesting classification problem because of its size: most classifiers, like those in machine learning, are designed to classify data into fewer than a dozen categories, but there are hundreds of thousands of LCSH. Despite the fact that we consider only the "head" term of each LCSH, we have encountered the following problems:

Performance (preliminary)

Our evaluation is based on experiments where we learn a model on the Google dataset and test it on the INFOMINE dataset. We found:

Examples

Here are some examples of the LCSH assigned to Internet resources by the INFOMINE editors and the LCSH classifier. We are pleased with the first few results, but the last ones highlight the problems with the current system.

INFOMINE editors LCSH classifier
  1. forest insects
  1. forest insects
  2. bark beetles
  3. borers (insects)
  4. tobacco hornworm
  5. scolytidae
  6. greenhouse whitefly
  7. agriculture in literature
  8. mountain pine beetle
  9. boiling water reactors
  10. pests
  1. BRASSICA
  2. CROPS
  3. PLANT BREEDING
  1. cruciferae
  2. buriats
  3. brassica
  4. phytophagous insects
  5. plants, effect of metals on
  6. blood groups in animals
  7. rapeseed
  8. hybridization, vegetable
  9. chromosome numbers
  10. rape (plant)
  1. Elections
  2. Presidents
  3. United States
  4. Web sites
  1. political conventions
  2. pressure groups
  3. political candidates
  4. political parties
  5. campaign management
  6. presidential candidates
  7. political oratory
  8. white earth indian reservation (minn.)
  9. communism and intellectuals
  10. watts riot, los angeles, calif., 1965
  1. CLIMATOLOGY
  2. ENVIRONMENTAL SCIENCES
  3. POLLUTION
  1. atmospheric chemistry
  2. meteorology
  3. continentality (meteorology)
  4. chemical oceanography
  5. multidimensional chromatography
  6. turbulent diffusion (meteorology)
  7. aerosols
  8. precipitation scavenging
  9. satellite meteorology
  10. chemistry, technical
  1. BIODIVERSITY
  2. CLIMATE
  3. CLIMATOLOGY
  4. CONSERVATION
  5. ECOLOGY
  6. ENVIRONMENTAL SCIENCES
  7. GEOGRAPHY
  8. NATURAL RESOURCES
  9. POLLUTION
  10. SOILS
  11. WATER RESOURCES
  1. conservation biology
  2. continentality (meteorology)
  3. estuarine biology
  4. turbulent diffusion (meteorology)
  5. extinction (biology)
  6. cloud physics
  7. agricultural pests
  8. lepidobatrachus laevis
  9. evaporation (meteorology)
  10. satellite meteorology

Future work

This classification project is a work in progress. In the near future, we plan to start using the new MARC dataset based on 24 million MELVYL records and comparing its performance to the Google training data. We are also exploring syntactic techniques for boosting accuracy, and methods for making the classifier prefer more general terms.