iVia Description Metadata Assignment

This page describes the iVia Description assignment algorithm.

Description Metadata

Description metadata provides a concise, textual representation of the intellectual content of a work. In most applications this means one or two paragraphs of text that describe the subject matter of the work. A description can take several forms including a summary, abstract, or even a table of contents.

The Description Assignment Algorithm

The iVia Description assignment process is based on two sources: HTML Meta tags and a text summarization algorithm. The first step is to check for Metadata tags named description or dc:description, and if either is present, they are used as the description. If that fails, a text summarization program is used to extract a summary.

The summary is generated by the AutoAnnotator method. The goal of this method is to break the HTML document down into sections reflecting its structure, and then find the single paragraph of text that best represents the content of the work. The algorithm is initialized with an HTML document, and a set of words that are "important" to the document. In iVia, these important words are the content words appearing in the extracted Title and Key phrase metadata.

AutoAnnotator is based on sentence and paragraph scoring. First, the document is read, and split into textual divisions, which are then split into paragraphs, which are split into sentences, which are finally split into words. Each word is then assigned a score, which is based on several factors, including whether the word is decorated by HTML markup (e.g. heading tags, bold tags). Once the words are scored, they are used to calculate a combined score for the sentence, paragraph, and textual division they occur in. These scores are then modified to account for the position of the text in the document, as text that appears early in the document is more likely to contain a useful summary. Finally, the highest-scoring text division is found, and the highest-scoring paragraph within it is returned. If this strategy fails, a set of contiguous high-scoring sentences are returned instead.

Description Assignment Evaluation

We evaluate the Description assignment algorithm using the iVia metadata evaluation tool and the 1,000 most-recently-modified INFOMINE records.

Evaluation Measures

Metadata evaluation metrics are explained on the iVia metadata evaluation page.

iVia's Description field acts as a summary when records are displayed, and also supports searches. In each case, it is desirable that all the topics covered in the document are represented in the summary, so recall is an important measure. A good summary will not mislead the reader by referring to unimportant material, so precision is also important. Because only one Description is assigned, and it is a free-text field, our primary measures are content word recall and precision (both stemmed and unstemmed). We also monitor the average length of the assigned metadata to ensure it is suitable for presentation.

Running an evaluation

The evaluate_metadata_assignment program will evaluate Description assignment if the configuration file has the variable ivia_description = "Description" defined in the [Fields] section. An example of the Description output is shown below.

Description
Field name: ivia_description
Number of examples:      1000
Number of passes:        5
Number of attempts:      995
Number of exact matches: 3
Exact match accuracy:    0.0030
Average length of expert metadata in letters:   360.0
Average length of assigned metadata in letters: 246.3
Average length of expert metadata in words:   52.5
Average length of assigned metadata in words: 37.0
Total number of expert content words:   26665
Total number of assigned content words: 18769
Total number of matching content words: 5494
Content word precision: 0.2927
Content word recall:    0.2060
Content word f-measure: 0.2418
Total number of expert stemmed content words:   25784
Total number of assigned stemmed content words: 18290
Total number of matching stemmed content words: 5752
Stemmed content word precision: 0.3145
Stemmed content word recall:    0.2231
Stemmed content word f-measure: 0.2610

Results

Row Method Tries Length CWP CWR SCWP SCWR
1 Current 992 36.2 0.2977 0.1891 0.3215 0.2072
2 Meta tags only 464 20.5 0.3907 0.0776 0.4253 0.0865
3 AutoAnnotator 992 51.3 0.2284 0.1938 0.2471 0.2118

The first row of the table contains the results of the current method, described above. A description was suggested on 992 of the 1000 attempts, and the description had an average length of 36.2 words, much shorter than the expert-created records, which had an average length of 59.0 words. The stemmed results are uniformly better than the unstemmed equivalents, suggesting that the terminology used in the assigned descriptions is similar, but not identical, to the terminology used by INFOMINE's Editors. Rows 2 and 3 show the performance of assignment by Meta tags only, and the performance of the AutoAnnotator only.

Interestingly, the 992 assignments included three exact matches (two from Meta tags, one from AutoAnnotator), reflecting the occasional practice of creating a description by quoting directly from the resource.

Discussion

There is an extensive body of literature on Description extraction, and many of the ideas in AutoAnnotor have been previously reported: sentence-scoring and paragraph-scoring are established methods, and the use of extracted key phrases as the basis of scoring has been described elsewhere [15]. Our implementation is informed by these precedents, and is robust and fast.

Several commercial summarization products are currently available. For example, the Open Text Summarizer, is a Free Software tool that is similar to AutoAnnotator, but works on plain text (not HTML), and uses a different set of important words for sentence scoring (the document's content words, each weighted according to its frequency in the text).