The iVia Project has developed an automatic metadata evaluation tool to measure the effectiveness of its automatic metadata assignment algorithms. This work is described in a paper at JCDL 2005.
The automatic metadata evaluation tool is used to evaluate the metadata assignment evaluations on the following pages:
The remainder of this document describes the automatic metadata evaluation process, the test data used in our experiments, the evaluation metrics used in our experiments, and how to run the iVia automatic metadata evaluation tool.
Different metadata fields have different characteristics and are measured in different ways. For example, some fields are only assigned a single metadata value, in which case accuracy is a good performance measure, while others are assigned multiple values, and precision and recall are more informative. Some fields are even more complex. For example, accuracy is not a useful way to measure Description assignment. Consequently, the iVia evaluations use a range of different performance measures.
The simplest statistic used in our evaluations is the exact match accuracy, or EMA. EMA is the proportion of times the automatic assignment for a field exactly matches the expert's assignment for a field (after simple normalizations).Matches apply to all the values in multiple-value fields, so in the case of key phrase assignment (for example), an exact match only occurs when the automatically-assigned set is entirely the same as the expert-assigned set. Because of this strict interpretation, exact match accuracy is most useful in controlled fields where one value is assigned like the Language and Media Type fields.
In multiple-value fields like Key phrases, EMA does not directly handle the case where some, but not all, of the values are correctly assigned. To remedy this shortcoming, we split each multiple-value field into its individual "subfields", and then count the number of "expert" subfields (i.e. drawn from expert metadata), "assigned" subfields (i.e. assigned by an algorithm), and "matching" subfields (i.e. appearing in both sources). We can then measure performance using subfield precision and subfield recall.
SFP is defined as the number of matching subfields divided by the number of assigned subfields.
SFR is defined as the number of matching subfields divided by the number of expert subfields.
Uncontrolled text fields like Description and Title are more difficult to measure because assignment tools for generating textual summaries will rarely generate exactly the same description as a human expert. To evaluate textual fields, we have introduced two new metrics: content-word precision and content-word recall. Both metrics compare two passages of text by analyzing the set of unique content words in each passage (ignoring case). Content words are words that are not common stop words like "of" and "the".
CWP is the proportion of unique content words in the automatically-assigned metadata that are also present in the expert-assigned metadata.
CWR is the proportion of unique content words from the expert-assigned metadata that also appear in the automatically-assigned metadata.
The SCWP is the proportion of unique content-word stems in the automatically-assigned metadata that are also present in the expert-assigned metadata.
The SCWP is the proportion of unique content-word stems in the expert-assigned metadata that are also present in the automatically-assigned metadata.
The metadata evaluation tool is implemented in the iVia Virtual Library Software. To run it, you need a a working iVia installation that contains a set of reference metadata created by human experts. The evaluation process is explained in detail below.
The metadata evaluation tool is in the iVia/src/programs/evaluate_metadata_assignment directory of your iVia installation, and is called evaluate_metadata_assignment. It's behavior is controlled by the evaluate_metadata_assignment.conf configuration file in the iVia etc directory.
In order to evaluate metadata assignment algorithms with iVia, a configuration file like the one below is required. It is in the standard iVia IniFile format. You may also need to configure your libiViaMetadata configuration file, as described in the libiViaMetadata manual.
# evaluate_metadata_assignment.conf
[Logging]
verbosity = 4
[Test Data]
no_of_records = 1000
html_documents_only = true
where_clause = "foreign_source='infomine.ucr.edu' AND expert_created='true' \
AND access='free' AND url NOT LIKE 'http://lib.ucr.edu%' \
AND url NOT LIKE 'http://infomine.ucr.edu%'"
order_by_clause = "last_modified_at DESC"
batch_size = 2000
[Fields]
title = "Title"
ivia_description = "Description"
authors = "Creators"
categories = "INFOMINE Categories"
keywords = "Keyphrases"
LCC = "Library of Congress Classification"
#media_type = "Media Type (unverified test data)"
subjects = "Library of Congress Subject Headings"
[Metadata Assigner]
download_timeout = 30000 # milliseconds
download_mode = "DOWNLOAD_PAGE"
min_no_of_urls = 0
min_qty_of_text = 0
max_no_of_urls = 0
max_qty_of_text = 0
The first section [Logging], is used to set the verbosity of the log comments (0 = non, 1 = quiet, 3 = normal, 4 = detailed, 5 = very detailed).
The [Test Data] section is used to identify the iVia metadata records that will be used as a "gold standard" in the evaluation. The no_of_records variable specifies how many records to use, and the HTML_documents_only is used to force the evaluator to use only HTML documents. The where_clause variable contains an SQL WHERE statement that will be used to select records from the iVia record_info table, while the order_by_clause variable contains as SQL ORDER BY clause that controls the order in which the records are selected for evaluation. Finally, the batch_size variable is used to control how many records the evaluator requests from the database on any given query.
The [Fields] section specifies the fields to be included in the evaluation. Each variable in this section is assumed to be an iVia record_info database field name, and the corresponding value to be a well-formated English-language description of that field. For example, the line ivia_description = "Description" specifies that we will evaluate the ivia_description field from the database, and will describe this as "Description" metadata in our output. (Note that in this example the media_type field has been commented out with the "#" character.)
Finally, the [Metadata Assigner] section describes the basic settings to be used for the libiViaMetadata Metadata Assigner object. The download_timeout is the maximum amount of time (in milliseconds) to allow for Web page downloads. The remaining settings are defined as per the Metadata Assigner documentation.
Once you have iVia installed and configured, you can run an evaluation by changing to the iVia/src/programs/evaluate_metadata_assignment directory of your iVia installation, and running the command ./evaluate_metadata_assignment.
The evaluation progress will be recorded in the log file. This can normally be found in your installed log directory, usually in $HOME/iVia-installed/log/evaluate_metadata_assignment.log, though some iVia installations may specify a different location.
The program output will be written to STDOUT. It will begin with a header, as follows:
Evaluate Metadata Assignment Results
Start date: 2005-08-08 21:13:07
End date: 2005-08-08 21:14:52
Installation: afcrawler.ucr.edu
Parameters:
WHERE clause: foreign_source='infomine.ucr.edu' AND expert_created='true' AND access='free' AND url NOT LIKE 'http:/
/lib.ucr.edu%' AND url NOT LIKE 'http://infomine.ucr.edu%'
ORDER BY clause: last_modified_at DESC
target number of records: 1000
Records:
records evaluated: 1000
records skipped: 92
The start and end dates are the times for the entire evaluation, and may include the evaluation of several fields. Be aware that the absence or presence of the test documents in the Page Cache may be the most significant factor in these times. The Installation is the name of the iVia installation where the evaluation was performed, and the Parameters are set in the configuration file. The Records note how many records were used in the evaluation (not including skipped records). If the records skipped value is non-zero it means that some of the records were not suitable for use in the evaluation (e.g. they were not HTML, or could not be downloaded). These will be explained in the log file.
A results section will then be output for each of the fields listed in the [Fields] section of the configuration file. Here is an example from an evaluation of the Title assignment algorithm by comparing it to the values in the title field in the iVia record_info table for 1000 records.
Title Field name: title Number of examples: 1000 Number of non-empty examples: 1000 Number of passes: 1 Number of attempts: 999 Number of exact matches: 225 Exact match accuracy: 0.2250 Average length of expert metadata in letters: 41.3 +/- 22.9 Average length of assigned metadata in letters: 41.8 +/- 27.4 Average length of expert metadata in words: 5.7 +/- 3.2 Average length of assigned metadata in words: 5.8 +/- 3.8 Total number of expert content words: 4341 Total number of assigned content words: 4536 Total number of matching content words: 2979 Content word precision: 0.6567 Content word recall: 0.6862 Content word f-measure: 0.6712 Total number of expert stemmed content words: 4326 Total number of assigned stemmed content words: 4507 Total number of matching stemmed content words: 3007 Stemmed content word precision: 0.6672 Stemmed content word recall: 0.6951 Stemmed content word f-measure: 0.6809
The top six rows describe the data used in the comparisons. In this case, the evaluation considered 1000 records (all with non-empty Title values) and the assignment algorithm attempted an assignment in 999 cases (and made no attempt, or "passed" in 1 case).
The remaining rows calculate various statistics to describe the "expert" metadata (i.e. the expert-created metadata drawn from the database which we assume is correct) to the "assigned" metadata (i.e. the values assigned by the assignment algorithm). Some of these are observed values, and some are calculations. For example, the number of exact matches statistic records the number of times the "expert" and "assigned" values are exactly the same; while the exact match accuracy is found by dividing this number by the number of examples.