iVia Key phrase Metadata Assignment

This page describes the iVia Key phrase Assignment algorithm.

Key phrase Metadata

Key phrases are a set of short, uncontrolled phrases that describe the content of a document. Key phrases in technical documents typically consist of five to fifteen complementary phrases of between one and four words.

The Key phrase Assignment Algorithm

The iVia Key phrase assignment algorithm combines metadata from two sources to produce a final set of assigned key phrases. The number of phrases returned (N) is set in the configuration file; in our experiments N = 10.

Step 1: Key phrases are extracted from any HTML META tags with name of keyword, subject, dc:subject or the plural form of these terms.

Step 2: Keith Humphreys' PhraseRate Key phrase identification and extraction algorithm is used to extract 2N key phrases. Let P be the number of phrases extracted.

Step 3: The two sets of phrases are canonized by normalizing the character sets, compressing whitespace, and trimming non-alphanumeric characters from the beginning and end.

Step 4: A score is then calculated for each phrase in the following manner:

  1. If the phrase appears in the META tags, add 1.0 to the score.
  2. If the phrase appears in the PhraseRate output, add (P+1-R)/(P+1), where R is the rank of the phrase in the PhraseRate results (first result = 1, etc).
  3. Add L / 1,000,000, where L is the length of the phrase in letters.

The phrases are then sorted by score (highest scores first).

Step 5: Any phrases that appear in the Key phrase blacklist are eliminated from contention.

Step 6: The N highest-scoring phrases are assigned.

The scoring algorithm ensures that phrases appearing in both sources have the highest scores, followed by phrases only appearing in META tags, and finally by phrases appearing only in the PhraseRate list. The scores receive a tiny bonus based on phrase length; the only effect of this bonus is to break ties in favor of longer phrases.

Key phrase Assignment Evaluation

We evaluate the Key phrase assignment algorithm using the iVia metadata evaluation tool and the 1,000 most-recently-modified INFOMINE records.

Evaluation Measures

Metadata evaluation metrics are explained on the iVia metadata evaluation page.

Key phrase metadata is used to summarize the important content of a document in retrieval and display. Precision is important because the Key phrase field is relatively highly weighted in INFOMINE, and because the potential for inaccurate key phrases to misrepresent the content of the document is exaggerated by the fact that relatively few phrases are used to represent entire documents. Recall is also important, as it approximates the degree of coverage of the document subject matter that has been attained. Our evaluation will thereforeee focus on subfield precision and recall, and content-word precision and recall. Since the test data often contains plural forms which are removed in the assigned metadata, the stemmed content-word recall is the best measure of coverage.

Running an evaluation

The evaluate_metadata_assignment program will evaluate Key phrase assignment if the configuration file has the variable keywords = "Keyphrases" defined in the [Fields] section. An example of the output is shown below.

Key phrases
Field name: keywords
Number of examples:      1000
Number of passes:        6
Number of attempts:      994
Number of exact matches: 1
Exact match accuracy:    0.0010
Average length of expert metadata in letters:   350.1
Average length of assigned metadata in letters: 188.0
Average length of expert metadata in words:   42.5
Average length of assigned metadata in words: 24.3
Total number of expert content words:   28323
Total number of assigned content words: 13746
Total number of matching content words: 4744
Content word precision: 0.3451
Content word recall:    0.1675
Content word f-measure: 0.2255
Total number of expert stemmed content words:   25605
Total number of assigned stemmed content words: 13362
Total number of matching stemmed content words: 4909
Stemmed content word precision: 0.3674
Stemmed content word recall:    0.1917
Stemmed content word f-measure: 0.2520
Total number of expert subfields:   25090
Total number of assigned subfields: 9381
Total number of matching subfields: 1637
Subfield precision: 0.1745
Subfield recall:    0.0652
Subfield f-measure: 0.0950

Results

Row Method Tries Length CWP CWR SCWP SCWR
1 Current 992 36.2 0.2977 0.1891 0.3215 0.2072
2 Meta tags only 464 20.5 0.3907 0.0776 0.4253 0.0865
3 AutoAnnotator only 992 51.3 0.2284 0.1938 0.2471 0.2118

The first row shows the performance of the current key phrase assignment process, described above. Row 2 of Table 3 shows the performance when PhraseRate is ignored, and only HTML Meta tags are used, while row 3 shows the effects of using only PhraseRate and ignoring the Meta tags. They demonstrate that although Meta tags tend to have higher quality metadata, it is frequently unavailable, and when it is available fewer than four phrases are assigned (on average), resulting in low recall. PhraseRate, on the other hand makes assignments in over 99% of cases, and assigns more than nine values (on average) to each document.

Discussion

Most Key phrase extraction algorithms follow the same pattern: a set of candidate phrases are extracted from a document, the candidates are ranked, and the best candidates are assigned. They are evaluated by comparison to the "keywords" assigned to technical reports by their authors.

Early work by Peter Turney used GenEx, a hybrid genetic algorithm, to learn the optimal parameters for the Extractor key phrase extraction tool. The New Zealand Digital Library Project developed a very simple algorithm, Kea, based on a Naive Bayes classifier, that offers similar performance and has been extended in several ways. Other approaches use natural language processing techniques to extract noun phrase heads for use as key phrases.

PhraseRate (and the iVia Key phrase extraction system as a whole) cannot be easily compared directly to these other systems because PhraseRate only assigns phrases of two or more words, which significantly effects the measurement of performance in automatic evaluations. However, Keith Humphreys has provided some side-by-side comparisons of earlier versions of PhraseRate, Kea and GenEx.