iVia LCSH Metadata Assignment

This page describes the iVia Library of Congress Subject Heading (LCSH) assignment algorithm.

Note: This page will be (slightly) out of date after the algorithm is re-implemented in late 2005.

LCSH Metadata

The LCSH are, as the name suggests, a set of over 200,000 Subject Headings created and maintained by the Library of Congress. They consist of short, descriptive phrases, which may be modified by combining a Head term with one or more subdivisions: for example, the Head term History might have two subdivisions added to produce History -- United States -- 19th Century. Usually, a set of LCSH will be assigned to a document. The Library of Congress assignment rules suggest that the first LCSH represent the primary topic, and subsequent LCSH represent less-important topics, but indexers do not always agree on a document's topics, and the practice is not used in INFOMINE, so it is ignored in this analysis.

The primary use of LCSH, both in library catalogs and in INFOMINE, is for retrieval, as experienced librarians can perform very discriminating searches based on their knowledge of LCSH. They are also used for detailing the content of a document in a concise form when records are displayed.

The LCSH Assignment Algorithm

iVia assigns LCSH using a classification process that is conceptually related to k-nearest-neighbor or locally-weighted learning, but which has been extended to handle examples that routinely belong to several categories, and that belong to classes with very large vocabularies.

The assignment process depends on a supply of expert-assigned training data, in our case the INFOMINE collection. LCSH are assigned to a new document in two stages: first, the set of documents that are the most similar to the new document are discovered; and second, the LCSH that are assigned to the similar documents are retrieved, and the most popular are assigned to the new document. Each stage has several complexities.

In the first stage, the goal is to find the set of N most-similar documents (in practice, N is set to 15). This set is approximated by exploiting iVia's strengths. The Key phrase assignment module is used to extract a set of key phrases from the document, and these are formed into an disjunctive query for iVia's search engine [27]. For example, if the key phrases Africa, Sahara desert, and sand dunes are assigned, they will be combined in the query africa OR (sahara AND desert) OR (sand AND dunes). The query is used to search the Title, Key phrases, Description and full text fields of the iVia database (results are sorted using the standard iVia ranking system) and to retrieve the 3N best results. Next, a similarity score is calculated for each of the results, by using the cosine measure to compare them to the original document, resulting in a number between 0 and 1. The N most similar documents are then chosen.

The second stage proceeds by retrieving the LCSH metadata from each of the similar records and assigning a score to each full LCSH that occurs. The score is calculated by adding together the similarity scores for each of the documents that the LCSH appears in. This can be thought of as a voting process, where each of the similar documents "vote" for their LCSH, and their votes are weighted according to how similar each LCSH is to the target document. LCSH that appear in only one of the documents are eliminated, and the six remaining LCSH with the highest scores are assigned.

LCSH Assignment Evaluation

We evaluate the LCSH assignment algorithm using the iVia metadata evaluation tool and the 1,000 most-recently-modified INFOMINE records.

Evaluation Measures

Metadata evaluation metrics are explained on the iVia metadata evaluation page.

LCSH are a controlled vocabulary, and multiple values are assigned, so subfield precision and recall are the obvious measures of performance. However, the number of LCSH is very large, and the number of potential combinations even larger, so different sources will often assign different LCSH (even human catalogers will frequently disagree) and a more complex metric that allows for near misses is appropriate. Therefore we consider the content word precision and recall, and the LCSH Head precision and recall. The latter measure is the same as subfield precision and recall, but uses unique LCSH Heads as subfields (instead of the full LCSH).

Running an evaluation

The evaluate_metadata_assignment program will evaluate LCSH assignment if the configuration file has the variable subjects = "Library of Congress Subject Headings" defined in the [Fields] section. An example of the LCSH output is shown below.

Library of Congress Subject Headings
Field name: subjects
Number of passes: 320
Number of attempts: 680
Number of exact matches: 5
Exact match accuracy: 0.0074
Average length of expert metadata in letters: 108.1
Average length of assigned metadata in letters: 104.8
Average length of expert metadata in words: 12.1
Average length of assigned metadata in words: 12.1
Total number of expert content words: 4939
Total number of assigned content words: 5104
Total number of matching content words: 1633
Content word precision: 0.3199
Content word recall: 0.3306
Total number of expert subfields: 2647
Total number of assigned subfields: 2936
Total number of matching subfields: 569
Subfield precision: 0.1938
Subfield recall: 0.2150
Total number of expert LCSH heads: 2353
Total number of assigned LCSH heads: 2636
Total number of matching LCSH heads: 674
LCSH head precision: 0.2557
LCSH head recall: 0.2864

Results

Row Method Tries SFP SFR LHP LHR CWP CWR
1 Current 683 of 1000 0.1894 0.2110 0.3136 0.3259 0.2505 0.2821
2 Chan [4] 100 of 100 0.3684 0.3271 0.6105 0.5421

The results table shows an evaluation of the LCSH assignment algorithm. The columns identify the method used; the number of times the method was able to make a prediction; the subfield precision (SFP) and recall (SFR); the LCSH Head precision (LHP) and recall (LHR); and the content-word precision (CWP) and recall (CWR). The data in row 1 shows the current algorithm, described above. As in the case of uncontrolled key phrases, the precision and recall are low, and disguise many near-misses.

LCSH assignment is particularly difficult for machine learning systems because good results are dependent on good training data, and there are so many different LCSH that it is difficult to find training data relating to all of them. In terms of LCSH assignment, this means that the algorithm can only assign LCSH accurately when there is already a set of similar documents that have been assigned appropriate LCSH by human experts. In INFOMINE, this means that when no similar documents exist, the search engine tends to return a set of documents that have few or coincidental similarities to the target document, which biases the assignment towards very common LCSH like California, Periodicals, and Web Sites -- Directories.

Discussion

The LCSH assignment algorithm reduces the complexity of the LCSH problem using a form of locally-weighted learning. Instead of trying to assign six descriptors from the hundreds of thousands of possible values, we instead narrow the number of possible descriptors to the small subset that occur in the 15 most-similar documents. This requires considerable computation at classification time (as opposed to during training), and has the potential to be very slow. However, iVia's search functions have been optimized for speed and relevance over a period of several years, so the LCSH assignment speed is still acceptable.