iVia INFOMINE Category Metadata Assignment

This page describes the iVia INFOMINE Category assignment algorithm.

INFOMINE Category Metadata

The INFOMINE Categories are used to divide an iVia collection into sub-collections that can be used to filter searches and restrict browsing to general topic areas. The set of collections can be changed but usually includes seven topic-based collections (Biological, Agricultural and Medical Sciences, Business and Economics, Humanities, Visual and Performing Arts, etc), one format-based collection (Electronic Journals), and one publisher-based collection (Government Publications). Every expert-created record in INFOMINE is assigned to one or more of these categories.

The INFOMINE Category Assignment Algorithm

iVia assigns INFOMINE Categories using a set of binary classifiers, each of which is responsible for assigning a single category based on the text of a document. Two steps are involved: training and classification. The training process is run weekly as part of iVia's automated maintenance scripts, and takes several hours, while the classification step is invoked whenever an assignment is requested, and is almost instantaneous.

The training step builds a set of category classifiers, each of which is a probabilistic binary classifier that classifies a new example as either belonging to, or not belonging to, a particular category. The training data are INFOMINE records that describe Web pages (and whose full text is available). Each category classifier builds its own training set, which must have at least 1000 positive and negative training examples, otherwise the category is skipped (at INFOMINE, the smallest category, Cultural Diversity, is routinely skipped). The training data is then reduced to limit both positive and negative classes to 10,000 training examples, to ensure that the ratio of negative to positive examples (and vice-versa) is no greater than 3:1, and to limit the number of features to 50,000 words using Chi-squared feature selection. Once the training set is build, the category classifier is trained on the data. In the current implementation, the category classifiers are binary Logistic Regression classifiers.

The classification step is invoked to assign INFOMINE Category metadata to a new resource. Each of the category classifiers is loaded and applied to the text of the new document, and their individual predictions are combined to assign a set of categories. The procedure for combining individual assignments is:

There is one exception to this process, specified by the INFOMINE Editors. If a resource is hosted on a server whose domain name ends in .gov, then it is always assigned the category Government Publications, regardless of the output of the relevant category classifier

INFOMINE Category Assignment Evaluation

We evaluate the INFOMINE Category assignment algorithm using the iVia metadata evaluation tool and the 1,000 most-recently-modified INFOMINE records.

Evaluation Measures

Metadata evaluation metrics are explained on the iVia metadata evaluation page.

The INFOMINE Categories are a simple example of a controlled metadata element which allows multiple assignments. Consequently, the two primary evaluation metrics are subfield precision and subfield recall. Because the number of possible values is so small, and the quality and quantity of training data is high, we can frequently assign the entire set of records, so it is also useful to measure the exact match accuracy.

Running an evaluation

The evaluate_metadata_assignment program will evaluate INFOMINE Category assignment if the configuration file has the variable categories = "INFOMINE Categories" defined in the [Fields] section. An example of the INFOMINE Category output is shown below.

INFOMINE Categories
Field name: categories
Number of examples:      1000
Number of passes:        23
Number of attempts:      977
Number of exact matches: 656
Exact match accuracy:    0.6560
Average length of expert metadata in letters:   11.0
Average length of assigned metadata in letters: 11.0
Average length of expert metadata in words:   1.5
Average length of assigned metadata in words: 1.5
Total number of expert content words:   1496
Total number of assigned content words: 1494
Total number of matching content words: 1237
Content word precision: 0.8280
Content word recall:    0.8269
Content word f-measure: 0.8274
Total number of expert stemmed content words:   1496
Total number of assigned stemmed content words: 1494
Total number of matching stemmed content words: 1237
Stemmed content word precision: 0.8280
Stemmed content word recall:    0.8269
Stemmed content word f-measure: 0.8274
Total number of expert subfields:   1496
Total number of assigned subfields: 1494
Total number of matching subfields: 1237
Subfield precision: 0.8280
Subfield recall:    0.8269
Subfield f-measure: 0.8274

Results

Row Method Tries Number SFP SFR EMA Date
1 Current 981 1.5 0.8195 0.8854 0.6960 2005-01-28
2 No GovPub rule 980 1.5 0.8126 0.8826 0.6920 2005-01-28
3 Naive Bayes variant 382 1 0.7357 0.5249 0.4467 2005-01-06
4 Old kNN variant 676 1 0.7204 0.6040 n/a 2004-11-22

The results table shows the results of several evaluations of INFOMINE Category assignment. The columns identify the method being used for assignment, the number of times an assignment was attempted, the subfield precision (SFP), subfield recall (SFR) and exact match accuracy (EMA). The rightmost column is the date that the evaluation was performed: the same process was used to select the training and test data for each evaluation, but different documents were used in the earlier evaluations, reflecting the state of the INFOMINE database on those dates.

The first row shows the current method, described above. On average 1.5 categories were assigned to each record (the same as the human experts) and precision was 84% and recall 89%. Row 2 shows that removing the special rule for assigning Government Publications to .gov Web sites has almost no effect.

The next rows show the equivalent statistics for two other assignment methods. Row 2 shows the result of replacing the set of Logistic Regression classifiers with a set of Naive Bayes classifiers [32], and shows that Logistic Regression outperforms Naive Bayes by every measure: more assignments are attempted, and precision, recall, and exact match accuracy are all much greater. Row 3 shows the results for an older assignment algorithm, which operated in the same way as the LCSH assignment algorithm. This algorithm is also inferior to the current algorithm by all measures, though it is arguably preferable to the Naive-Bayes-based method (row 2), as it makes predictions in more cases yet has comparable precision and superior recall.

Discussion

One significant limitations of this algorithm is it's dependence on the LR classifier; see the Logistic Regression page for more background.