This page describes the iVia Title assignment algorithm.
Titles are short, uncontrolled text passages that appear at the beginning of many types of documents. They are usually chosen to identify the document and summarize its content. Although both roles are important, the latter is most useful when assessing an unknown document. Most applications require a single Title value for each resource.
Title assignment is an extraction process: Title values are simply read from appropriate parts of the HTML document, such as the Title tag. Although it might appear that this method will yield a 100% success rate, many HTML authors supply no Title, or poor values that are not useful.
A list of potential titles is built up by extracting text from the following sections of the HTML document, in order:
The initial list is post-processed to remove duplicate entries, blacklist undesirable values (e.g. Homepage, Untitled Document), and remove unwanted prefixes (e.g. Welcome to, Homepage of) while preserving the order of the list. The values remaining in the list are assumed to be in order of decreasing quality, so that when a single Title is required, the first is used.
We evaluate the Title assignment algorithm using the iVia metadata evaluation tool and the 1,000 most-recently-modified INFOMINE records.
Metadata evaluation metrics are explained on the iVia metadata evaluation page.
Both recall and precision are important in Title assignment, and as Titles contain uncontrolled text values, content-word-based statistics are very useful. Broadly speaking, a high recall score indicates the important words in the Title are identified correctly, while high precision suggests we are not assigning incorrect words that summarize unimportant parts of the document and skew search results. By convention, Titles are often short and predictable, so exact matches are frequently possible. For these reasons, the primary measures in our evaluations will be content word precision, content word recall and exact match accuracy.
The evaluate_metadata_assignment program will evaluate Title assignment if the configuration file has the variable title = "Title" defined in the [Fields] section. An example of the Title output is shown below.
Title Field name: title Number of examples: 1000 Number of non-empty examples: 1000 Number of passes: 1 Number of attempts: 999 Number of exact matches: 225 Exact match accuracy: 0.2250 Average length of expert metadata in letters: 41.3 +/- 22.9 Average length of assigned metadata in letters: 41.8 +/- 27.4 Average length of expert metadata in words: 5.7 +/- 3.2 Average length of assigned metadata in words: 5.8 +/- 3.8 Total number of expert content words: 4341 Total number of assigned content words: 4536 Total number of matching content words: 2979 Content word precision: 0.6567 Content word recall: 0.6862 Content word f-measure: 0.6712 Total number of expert stemmed content words: 4326 Total number of assigned stemmed content words: 4507 Total number of matching stemmed content words: 3007 Stemmed content word precision: 0.6672 Stemmed content word recall: 0.6951 Stemmed content word f-measure: 0.6809
| Row | Method | Tries | EMA | CWP | CWR | Length |
|---|---|---|---|---|---|---|
| 1 | Current | 1000 | 0.2290 | 0.6200 | 0.6410 | 5.5 |
| 2 | META only | 21 | 0.0050 | 0.4783 | 0.0132 | 7.2 |
| 3 | Title only | 992 | 0.2340 | 0.6295 | 0.6434 | 5.5 |
| 4 | H1 only | 165 | 0.0350 | 0.7289 | 0.0786 | 3.6 |
| 5 | Text only | 919 | 0.0000 | 0.2390 | 0.2931 | 8.0 |
Rows 2 to 5 of the Table above show the effect of using only one of the sources of Title metadata at a time (with standard post-processing). Document META tags (row 2), though rare, are less precise than Title tags (row 3), so arguably Title tags should appear before META tags in the list of preferred sources above. An automatic evaluation shows that this change does not affect the overall performance, and a side-by-side comparison reveals that the META tags are usually either identical to Title tags, or very similar but longer and more descriptive. Although the last two sources (rows 4, 5) perform poorly compared to the META and Title tags, they are useful for PDF and other non-HTML documents.