iVia Title Metadata Assignment

This page describes the iVia Title assignment algorithm.

Title Metadata

Titles are short, uncontrolled text passages that appear at the beginning of many types of documents. They are usually chosen to identify the document and summarize its content. Although both roles are important, the latter is most useful when assessing an unknown document. Most applications require a single Title value for each resource.

The Title Assignment Algorithm

Title assignment is an extraction process: Title values are simply read from appropriate parts of the HTML document, such as the Title tag. Although it might appear that this method will yield a 100% success rate, many HTML authors supply no Title, or poor values that are not useful.

A list of potential titles is built up by extracting text from the following sections of the HTML document, in order:

  1. The content of any META tag whose name is title or dc:title.
  2. The Title tag.
  3. All H1 tags.
  4. The sequence of words in the first 50 letters of body text.

The initial list is post-processed to remove duplicate entries, blacklist undesirable values (e.g. Homepage, Untitled Document), and remove unwanted prefixes (e.g. Welcome to, Homepage of) while preserving the order of the list. The values remaining in the list are assumed to be in order of decreasing quality, so that when a single Title is required, the first is used.

Title Assignment Evaluation

We evaluate the Title assignment algorithm using the iVia metadata evaluation tool and the 1,000 most-recently-modified INFOMINE records.

Evaluation Measures

Metadata evaluation metrics are explained on the iVia metadata evaluation page.

Both recall and precision are important in Title assignment, and as Titles contain uncontrolled text values, content-word-based statistics are very useful. Broadly speaking, a high recall score indicates the important words in the Title are identified correctly, while high precision suggests we are not assigning incorrect words that summarize unimportant parts of the document and skew search results. By convention, Titles are often short and predictable, so exact matches are frequently possible. For these reasons, the primary measures in our evaluations will be content word precision, content word recall and exact match accuracy.

Running an evaluation

The evaluate_metadata_assignment program will evaluate Title assignment if the configuration file has the variable title = "Title" defined in the [Fields] section. An example of the Title output is shown below.

Title
Field name: title
Number of examples:           1000
Number of non-empty examples: 1000
Number of passes:             1
Number of attempts:           999
Number of exact matches:      225
Exact match accuracy:         0.2250
Average length of expert metadata in letters:   41.3 +/- 22.9
Average length of assigned metadata in letters: 41.8 +/- 27.4
Average length of expert metadata in words:   5.7 +/- 3.2
Average length of assigned metadata in words: 5.8 +/- 3.8
Total number of expert content words:   4341
Total number of assigned content words: 4536
Total number of matching content words: 2979
Content word precision: 0.6567
Content word recall:    0.6862
Content word f-measure: 0.6712
Total number of expert stemmed content words:   4326
Total number of assigned stemmed content words: 4507
Total number of matching stemmed content words: 3007
Stemmed content word precision: 0.6672
Stemmed content word recall:    0.6951
Stemmed content word f-measure: 0.6809

Results

Row Method Tries EMA CWP CWR Length
1 Current 1000 0.2290 0.6200 0.6410 5.5
2 META only 21 0.0050 0.4783 0.0132 7.2
3 Title only 992 0.2340 0.6295 0.6434 5.5
4 H1 only 165 0.0350 0.7289 0.0786 3.6
5 Text only 919 0.0000 0.2390 0.2931 8.0

Discussion

Rows 2 to 5 of the Table above show the effect of using only one of the sources of Title metadata at a time (with standard post-processing). Document META tags (row 2), though rare, are less precise than Title tags (row 3), so arguably Title tags should appear before META tags in the list of preferred sources above. An automatic evaluation shows that this change does not affect the overall performance, and a side-by-side comparison reveals that the META tags are usually either identical to Title tags, or very similar but longer and more descriptive. Although the last two sources (rows 4, 5) perform poorly compared to the META and Title tags, they are useful for PDF and other non-HTML documents.