R&D Archive: Steve Jones' LCSH Classification Project
Note: This work has been superseded by the current LCSH assignment tool.
Steve Jones' LCSH Classification research
The INFOMINE LCSH classification research project, led by Dr. Steve Jones of the Department of Computer Science at The University of Waikato, aims to classify documents by assigning them Library of Congress Subject Headings (LCSH).
In our experiments a set of keywords and key phrases serve as document surrogates, summarizing a document's content. In the training stage, a model is learned that encapsulates the relationships between these key phrases and sets of LCSH. Once we have built the model, we can use it to assign LCSHs to new documents by first automatically extracting key phrases from them (with Kea or PhraseRate), and then using the model to match these key phrases with a set of LCSHs.
INFOMINE's goal is to take Dr. Jones' research and use it to assign generate LCSH metadata for Internet resources in the INFOMINE automatic record builder.
Learning a model
The model is learned from a training dataset that consists of a set of documents that have both LCSH and key phrase metadata. Although documents that have been annotated with LCSH are plentiful (e.g. library catalogs), it is more difficult to find records with key phrases also assigned. However, we have compiled several training datasets:
- MARC Records from library catalogs with LCSH and Table of Contents.
Librarians occasionally use Table of Contents metadata in place of key phrases; we take a similar approach, using Kea to extract the twenty most significant phrases from the Table of Contents. We compiled two datasets, one based on 1 million records from the UCR SCOTTY catalog, and when this proved too small, another based on 24 million records from the UC's combined MELVYL catalog.
- Google.
A second dataset is based on documents returned by Google searches for LCSH heads. This dataset can be arbitrarily large, but is of lower quality because the search results may not be relevant to the search terms. Currently, the dataset consists of:
- 287,000 Web pages
- 34,238 distinct LCSH
- 986,000 distinct key phrases (extracted with Kea)
- INFOMINE.
INFOMINE contains over 20,000 records that have been assigned both LCSH and key phrase metadata by professional catalogers. Although this dataset is too small for training purposes, we use it as our testing dataset. This is appropriate since our ultimate application will be to assist the INFOMINE adders in their cataloging duties.
The training data is used to build a model that represents the frequency of co-occurrence of key phrases and LCSH heads. To assign LCSH to new documents, key phrases are extracted form new documents, then similarity measures are used to find the LCSH with the most similar set of co-occuring key phrases.
Problems
The LCSH present an interesting classification problem because of its size: most classifiers, like those in machine learning, are designed to classify data into fewer than a dozen categories, but there are hundreds of thousands of LCSH. Despite the fact that we consider only the "head" term of each LCSH, we have encountered the following problems:
- There are too many LCSH.
- LCSH are subjectively assigned to documents.
- The training sets are too small relative to the number of classifications: in other words there is little Key phrase/LCSH overlap. For example, even in our largest training set (Google), 87% of key phrase/LCSH pairs only occur once.
- Roughly 20% of the LCSH in the test dataset (INFOMINE) do not occur in the training datasets at all.
Performance (preliminary)
Our evaluation is based on experiments where we learn a model on the Google dataset and test it on the INFOMINE dataset. We found:
- Absolute performance, measured by the number of exact classifications, is low.
- Although absolute LCSH matches are infrequent, many predicted LCSH are related to the actual LCSH of a document.
- The automatic classifier tends to assign very specific LCSH, whereas human classifiers tend to assign more general terms.
- Once the model is loaded, LCSH can be assigned in a few seconds.
Examples
Here are some examples of the LCSH assigned to Internet resources by the INFOMINE editors and the LCSH classifier. We are pleased with the first few results, but the last ones highlight the problems with the current system.
| INFOMINE editors |
LCSH classifier |
- forest insects
|
- forest insects
- bark beetles
- borers (insects)
- tobacco hornworm
- scolytidae
- greenhouse whitefly
- agriculture in literature
- mountain pine beetle
- boiling water reactors
- pests
|
- BRASSICA
- CROPS
- PLANT BREEDING
|
- cruciferae
- buriats
- brassica
- phytophagous insects
- plants, effect of metals on
- blood groups in animals
- rapeseed
- hybridization, vegetable
- chromosome numbers
- rape (plant)
|
- Elections
- Presidents
- United States
- Web sites
|
- political conventions
- pressure groups
- political candidates
- political parties
- campaign management
- presidential candidates
- political oratory
- white earth indian reservation (minn.)
- communism and intellectuals
- watts riot, los angeles, calif., 1965
|
- CLIMATOLOGY
- ENVIRONMENTAL SCIENCES
- POLLUTION
|
- atmospheric chemistry
- meteorology
- continentality (meteorology)
- chemical oceanography
- multidimensional chromatography
- turbulent diffusion (meteorology)
- aerosols
- precipitation scavenging
- satellite meteorology
- chemistry, technical
|
- BIODIVERSITY
- CLIMATE
- CLIMATOLOGY
- CONSERVATION
- ECOLOGY
- ENVIRONMENTAL SCIENCES
- GEOGRAPHY
- NATURAL RESOURCES
- POLLUTION
- SOILS
- WATER RESOURCES
|
- conservation biology
- continentality (meteorology)
- estuarine biology
- turbulent diffusion (meteorology)
- extinction (biology)
- cloud physics
- agricultural pests
- lepidobatrachus laevis
- evaporation (meteorology)
- satellite meteorology
|
Future work
This classification project is a work in progress. In the near future, we plan to start using the new MARC dataset based on 24 million MELVYL records and comparing its performance to the Google training data. We are also exploring syntactic techniques for boosting accuracy, and methods for making the classifier prefer more general terms.