INFOMINE Category Assignment

The INFOMINE Categories are the ten top-level sub-collections, such as Biological, Agricultural and Medical Sciences, which are accessible from the iVia homepage. We have developed an algorithm for assigning one or more INFOMINE Categories to documents based on Logistic Regression Classification.

How it works

iVia assigns INFOMINE Categories using a set of binary classifiers, each of which is responsible for assigning a single category based on the text of a document. Two steps are involved: training and classification. The training process is run weekly as part of iVia's automated maintenance scripts, and takes several hours, while the classification step is invoked whenever an assignment is requested, and is almost instantaneous.

The training step builds a set of category classifiers, each of which is a probabilistic binary classifier that classifies a new example as either belonging to, or not belonging to, a particular category. The training data are INFOMINE records that describe Web pages (and whose full text is available). Each category classifier builds its own training set, which must have at least 1000 positive and negative training examples, otherwise the category is skipped (the smallest INFOMINE Category, Cultural Diversity, is routinely skipped). The training data is then reduced to limit both positive and negative classes to 10,000 training examples, to ensure that the ratio of negative to positive examples (and vice-versa) is no greater than 3:1, and to limit the number of features to 50,000 words using Chi-squared feature selection. Once the training set has been built, the category classifier is trained using the data set. In the current implementation, the category classifiers are binary Logistic Regression classifiers.

The classification step is invoked to assign INFOMINE Category metadata to a new resource. Each of the category classifiers is loaded and applied to the text of the new document, and their individual predictions are combined to assign a set of categories. The procedure for combining individual assignments is:

  1. Any category assigned with confidence greater than 0.95 is automatically assigned; but
  2. If no categories are assigned by step 1, then the category (or categories) with the highest probability is assigned, unless its probability is less than 0.75, in which case no assignment is made.

Evaluation

We have made several evaluations of INFOMINE Category assignment. Generally, the method described above assigns an average of 1.5 categories to each record, and has a precision of 84% and a recall of 89%.

This performance is significantly better than our old method (based on k-nearest neighbor), which had precision and recall of 72% and 64% respectively. It is also significantly better than the result observed when the Logistic Regression classifiers are replaced with Naive Bayes classifiers (precision 74%, recall 53%).