INFOMINE's LCC classification research aimed to automatically assign documents a Library of Congress Classification (LCC). The project was led by Dr. Eibe Frank, a Senior Lecturer at the Department of Computer Science of The University of Waikato, and Dr. Gordon W. Paynter at INFOMINE.
The aim of this project is to assign one or classifications from the LCC Outline to each resource based on the set of Library of Congress Subject Headings (LCSH) associated with that resource. Other projects have explored similar territory: most of these are based on information retrieval techniques and use additional metadata, such as the document title. We have chosen instead to use a smaller, controlled feature set (LCSH head terms) and to base our classifier on Support Vector Machine algorithms, which represent the state-of-the-art in machine learning techniques for text classification.
INFOMINE has two applications for this research. First, we used it to assign a LCC to the 23,000 existing INFOMINE records: for historical reasons, these records have LCSH metadata, but no LCC metadata. Second, we use this project in combination with our LCSH assignment tool to assign both LCSH and LCC to new Internet resources.
We have applied the LCC classifier to every Internet resource in INFOMINE, and created a browsing interface to this structure. This interface is a work in progress, and (for reasons discussed below) many of the resources are not classified, or are erroneously classified as QA1-43. The current assignment uses a model trained on 1,600,000 records from the library catalog of the University of California, Riverside.
LCSH to LCC is a free Java implementation of the LCC classifier. It is released under the GNU General Public License, and is available from the iVia download page.
The electronic version of the LCC Outline used in this project is the same as was used by the Pharos project. Its origin is described in Ron Dolin's doctoral thesis. The original can be downloaded from the Pharos project, and a reformatted version is available from the download page.
A paper describing this project was published in JASIST: Predicting Library of Congress Classifications From Library of Congress Subject Headings..
LCSH to LCC uses a model to encapsulate the relationships between LCSH and LCC. This model is learned from a training dataset that consists of documents with both LCSH and LCC metadata assigned. Our principal dataset is based on 1,600,000 examples drawn from the University of California library catalog (MELVYL).
The model itself is based on the LCC's hierarchical structure. A Support Vector Machine is built for each node in the LCC hierarchy; each can classify an example as relevant to that node or to one of its children. To classify a new example, it's LCSH are "filtered down" from the root node of the tree to more specific classifications.
We measure LCSH to LCC's absolute performance by using it to classify a set of 50,000 MARC records. Absolute performance--the number of times it classifies an LCC exactly correct--is around 58% for the best model, which compares well to similar work. We also attempt to measure near-misses: for example, 4% of the classifications were too specific and 3% were too general.
We found that when we use more training data, performance increases. The graphs below measure absolute accuracy (vertical axis) as the number of training records is increased, when 1, 2, 5, 10 and 15 LCC are extracted (training and test sets are independent).

We are particularly interested in LCC Schedule accuracy: the proportion of records that were correctly classified at the top level of the tree. Top-level accuracy is currently around 80%.
LCSH to LCC released under the GNU General Public License, and relies on two other free software packages: the Waikato Environment for Knowledge Analysis (WEKA) for its Support Vector Machine implementation, and the Java MARC Events (James) package for extracting training data from MARC records. Versions of both packages are included in LCSH to LCC.
The main problem encountered is that some records in the test set have no LCSH in common with the records in the training data. When this happens, the model "guesses" that the record belongs in the most popular classification from the training dataset: Science > Mathematics > General (QA1-43). We have modified the the classifier so that it (optionally) makes no prediction in these situations.
A more subtle problem occurs when the training data contains records that have a good primary LCSH and an additional LCSH that is only slightly related to the record topic and that does not occur elsewhere in the training data. In these cases, the model will associate the slightly related LCSH with the original topic, and is likely to make misleading predictions about any new resource containing this LCSH. We can solve this problem by increasing quantity of training data or by insisting that every LCSH must occur in several examples before we will assign it.
The LCSH to LCC software is no longer under development, but the Java source code remains available from the iVia downloads page.