This page describes the iVia Description assignment algorithm.
Description metadata provides a concise, textual representation of the intellectual content of a work. In most applications this means one or two paragraphs of text that describe the subject matter of the work. A description can take several forms including a summary, abstract, or even a table of contents.
The iVia Description assignment process is based on two sources: HTML Meta tags and a text summarization algorithm. The first step is to check for Metadata tags named description or dc:description, and if either is present, they are used as the description. If that fails, a text summarization program is used to extract a summary.
The summary is generated by the AutoAnnotator method. The goal of this method is to break the HTML document down into sections reflecting its structure, and then find the single paragraph of text that best represents the content of the work. The algorithm is initialized with an HTML document, and a set of words that are "important" to the document. In iVia, these important words are the content words appearing in the extracted Title and Key phrase metadata.
AutoAnnotator is based on sentence and paragraph scoring. First, the document is read, and split into textual divisions, which are then split into paragraphs, which are split into sentences, which are finally split into words. Each word is then assigned a score, which is based on several factors, including whether the word is decorated by HTML markup (e.g. heading tags, bold tags). Once the words are scored, they are used to calculate a combined score for the sentence, paragraph, and textual division they occur in. These scores are then modified to account for the position of the text in the document, as text that appears early in the document is more likely to contain a useful summary. Finally, the highest-scoring text division is found, and the highest-scoring paragraph within it is returned. If this strategy fails, a set of contiguous high-scoring sentences are returned instead.
We evaluate the Description assignment algorithm using the iVia metadata evaluation tool and the 1,000 most-recently-modified INFOMINE records.
Metadata evaluation metrics are explained on the iVia metadata evaluation page.
iVia's Description field acts as a summary when records are displayed, and also supports searches. In each case, it is desirable that all the topics covered in the document are represented in the summary, so recall is an important measure. A good summary will not mislead the reader by referring to unimportant material, so precision is also important. Because only one Description is assigned, and it is a free-text field, our primary measures are content word recall and precision (both stemmed and unstemmed). We also monitor the average length of the assigned metadata to ensure it is suitable for presentation.
The evaluate_metadata_assignment program will evaluate Description assignment if the configuration file has the variable ivia_description = "Description" defined in the [Fields] section. An example of the Description output is shown below.
Description Field name: ivia_description Number of examples: 1000 Number of passes: 5 Number of attempts: 995 Number of exact matches: 3 Exact match accuracy: 0.0030 Average length of expert metadata in letters: 360.0 Average length of assigned metadata in letters: 246.3 Average length of expert metadata in words: 52.5 Average length of assigned metadata in words: 37.0 Total number of expert content words: 26665 Total number of assigned content words: 18769 Total number of matching content words: 5494 Content word precision: 0.2927 Content word recall: 0.2060 Content word f-measure: 0.2418 Total number of expert stemmed content words: 25784 Total number of assigned stemmed content words: 18290 Total number of matching stemmed content words: 5752 Stemmed content word precision: 0.3145 Stemmed content word recall: 0.2231 Stemmed content word f-measure: 0.2610
| Row | Method | Tries | Length | CWP | CWR | SCWP | SCWR |
|---|---|---|---|---|---|---|---|
| 1 | Current | 992 | 36.2 | 0.2977 | 0.1891 | 0.3215 | 0.2072 |
| 2 | Meta tags only | 464 | 20.5 | 0.3907 | 0.0776 | 0.4253 | 0.0865 |
| 3 | AutoAnnotator | 992 | 51.3 | 0.2284 | 0.1938 | 0.2471 | 0.2118 |
The first row of the table contains the results of the current method, described above. A description was suggested on 992 of the 1000 attempts, and the description had an average length of 36.2 words, much shorter than the expert-created records, which had an average length of 59.0 words. The stemmed results are uniformly better than the unstemmed equivalents, suggesting that the terminology used in the assigned descriptions is similar, but not identical, to the terminology used by INFOMINE's Editors. Rows 2 and 3 show the performance of assignment by Meta tags only, and the performance of the AutoAnnotator only.
Interestingly, the 992 assignments included three exact matches (two from Meta tags, one from AutoAnnotator), reflecting the occasional practice of creating a description by quoting directly from the resource.
There is an extensive body of literature on Description extraction, and many of the ideas in AutoAnnotor have been previously reported: sentence-scoring and paragraph-scoring are established methods, and the use of extracted key phrases as the basis of scoring has been described elsewhere [15]. Our implementation is informed by these precedents, and is robust and fast.
Several commercial summarization products are currently available. For example, the Open Text Summarizer, is a Free Software tool that is similar to AutoAnnotator, but works on plain text (not HTML), and uses a different set of important words for sentence scoring (the document's content words, each weighted according to its frequency in the text).