Logistic Regression Classification

The Logistic Regression Classifier is a new classification algorithm implemented in iVia.

Introduction

At the beginning of the project, we identified the need for a new general-purpose classification algorithm to be used for broad subject categories and specific classification. The specific classification was designed to extend the pachinko-machine classification algorithm used in iVia's LCSH to LCC.

We determined that the new algorithm had three important requirements: it must be fast (comparable to Naive Bayes), accurate (comparable to Support Vector Machines) and it must be probabilistic (i.e. output a probability of correctness, not simply a true/false result). As no such algorithm was readily available, we decided to implement our own classifier based on Logistic Regression, which was generally believed to meet these criteria, but which had not been implemented in a concrete, efficient fashion (to our knowledge). We designed a flexible C++ programming interface for representing classifiers and document collections, and implemented a LR classifier using the interface. We also built other classifiers for comparison.

Improving classification with LR

We replaced our current assignment algorithms with new ones based on the new Logistic Regression classifier, INFOMINE iVia Category classification is a typical ?broad? classification task: each record is assigned one or more of the nine INFOMINE iVia Categories. By replacing our existing classification scheme with one based on LR, we improved classification accuracy (both precision and recall), coverage and speed dramatically. See Section 4.6 of the JCDL2005 paper for details.

LCC Assignment is a ?specific or hierarchical classification task. Strictly speaking, we attempt to assign LCC Outline classifications to records. We replaced our old LCSH to LCC algorithm with a new assignment scheme in C++ based on LR classifiers. The old version was based on SVM classifiers, and was slow, memory-intensive, and could only assign single LCCs. The new version was smaller, faster, performed multiple assignments (where necessary) and integrated with iVia. However, we did not have appropriate datasets for a direct performance comparison with the old version, but our intuition is that classification accuracy was not as good, largely as a result of stability problems with LR.

Although we initially intended to solve LCSH assignment in a similar manner to LCC Outline assignment, we did not have sufficient information about the LCSH structure to successfully implement it and an alternative solution was found.

Where LR fell short

A fundamental problem with the LR algorithm is that training the classifier is essentially a search problem. An algorithm is used to search the space of all possible solutions for the solution that best fits the training data, and sometimes no solution could be found. In this case, no classification can be made. We call this a stability problem because the algorithm could not converge on a stable solution.

Sometimes a solution can be found by restarting the search with different parameters, and the final version of the classifier will automatically restart the search several times if the search fails. However, in some cases no solution can be found at all, and no classification can be performed. In this case, the only available course of action is to change the problem to be solved? In practical terms, this means simplifying the problem by using fewer training examples or by using feature selection to decrease the number of features considered in the classification. When successful, this meant a classifier was available, but the classifier was based on a subset of the available training data and was not necessarily the best possible solution. Other problems could not be solved in this matter. In either case, considerable time was spent performing manual interventions at the training stage that should not have been necessary.

Possible Future Work

Classification could be improved if a different classification algorithm was found that satisfies the requirements above, and could be slotted into the classification algorithms in place of LR (once an algorithm was implemented using the iVia API, replacing LR would be trivial). At the time our project began, no such algorithm was available. Since then, significant progress has been made towards making Support Vector Machine algorithms probabilistic. Replacing LR with a stable, probabilistic Support Vector Machine algorithm could make a lot of progress to wards solving our current problems.

There has been some discussion of the effect of using additional training data in our work, particularly additional sets of MARC records or cleaner MARC records. With the current system, this is probably not likely to be useful because increasing the amount of data tends to decrease the stability of the LR algorithm. In general, more training data often leads to improved performance, but only up to a certain point, after which performance levels out and further training examples are of no more benefit.

This is illustrated in the following figure, showing LCSH to LCC results, that illustrates diminishing returns in terms of performance as the number of examples increases.

One advantage of the LCC assignment algorithm is that even when data is discarded at the "top" of the assignment hierarchy, it is retained for use further down the tree. This suggests that increasing the size of the training set may be more valuable for the new algorithm than it was for LCSH to LCC.

Cleaning up the MARC record data that we already have may be beneficial (it can hardly hurt), however, the sheer number of training examples in question mean that manual editing will be very time-consuming and is unlikely to give a significant increase inaccuracy.

Aboutness

iVia implements "aboutness" in the RichTextFinder class (and supporting classes). It can be used to analyze a Web page and find additional supporting pages that are directly linked and share the same topic, in order that these additional pages can contribute text to aid any classification task. Examples of aboutness pages include author-intended descriptions of the work, such as will be found in introductions, "about" pages, and abstracts.

The quantity of additional "rich text" found by the RichTextFinder can be controlled by setting the maximum and minimum number of pages downloaded, and bytes of text downloaded. For example, we might specify that we want to take between 10,000 and 15,000 bytes of rich text from a page, but that we want to take text from fewer than four pages. The RichTextFinder will attempt to satisfy these constraints by downloading several pages and extracting text from them. When too much text is downloaded from a set of pages, then the amount of text taken from each page will be proportional to the similarity of the page to the original page?in other words, it will be proportional to its aboutness, though this will be constrained by the amount of text appearing on each page.

Currently, we do not use the RichTextFinder in metadata assignment, though it is easy to enable it by updating some simple settings in the MetadataAssigner.conf file.

The main reason we do not enable it is that tests with the metadata evaluation tool show it does not noticeably improve metadata assignment accuracy when it is enabled. However, we have not extensively tested the RichTextFinder to determine the best set of download parameters: our initial tests specified (from memory) between 5,000 and 15,000 bytes of text from up to three pages, but it may be possible that by experimenting with other parameters to find a combination that improves performance. This work has not been carried out due to time constraints, though tools have been built for trialling different settings (see the Metadata Assignment Test page in the Adders Web Site) and testing the results against a large corpus (the metadata evaluation tool described in the JCDL2005 paper).