The iVia Project has developed a set of automatic metadata assignment algorithms which it uses to assign descriptive metadata to Internet resources such as Web pages. The algorithms have been developed in conjunction with an automatic metadata evaluation tool, which has been used to measure each algorithm's effectiveness. This work was described in a paper at JCDL 2005.
iVia uses a different assignment algorithms for different fields. These algorithms are described on the following pages:
The remainder of this document describes how to use the automatic metadata assignment algorithms through the libiViaMetadata library, the command-line tool, and some of the supporting technologies shared by the different assignment algorithms.
The automatic metadata assignment tools are distributed in the libiViaMetadata package. They are implemented as a library (called libiViaMetadata), and distributed with a command-line program called iViaMetadata-assign and binary classifiers.
libiViaMetadata, is a C++ library created by iVia Project and distributed under the terms of the GNU Lesser General Public License.
The principle class in libiViaMetadata is the MetadataAssigner. A MetadataAssigner instance is used to assign metadata to a Web resource. For example, the following code assigns Title and Key phrase metadata based on a Web page:
TimeLimit time_limit(30000 /* milliseconds*/);
MetadataAssigner assigner("http://ivia.ucr.edu", time_limit, MetadataAssigner::DOWNLOAD_PAGE);
if (assigner.anErrorOccurred())
MsgUtil::Error("MetadataAssigner Error: " + assigner.getErrorMessage());
std::list<std::string> titles, keyphrases;
assigner.assignTitles(&titles);
assigner.assignTitles(&keyphrases);
The MetadataAssigneer is usually applied to HTML documents downloaded from the Internet, though it can be applied to local files, and to documents in other formats (such as PDF files).
More precise control over the MetadataAssigner is possible by passing a the MetadataAssigner::Params structure. In the default mode (DOWNLOAD_PAGE, used above), a single Web page will be downloaded from the Internet (unless an HTML framesets is encountered, in which case the constituent frames will be composed into a single document that approximates the content that is displayed by a Web browser).
The iViaMetadata-assign program is installed with libiViaMetadata. It is a command-line utility that assigns metadata to Internet documents. It takes one parameter, a URL or filename, and outputs the extracted metadata to STDOUT.
If the libiViaMetadata bin directory is in your path environment variable, then the command iViaMetadata-assign 'http://infomine.ucr.edu' will produce output in the following format:
Derived content:
Derived HTML (min: 0 bytes of alphanumeric plain text requested):
9799 bytes of HTML, including
1172 bytes of plain text, including
389 bytes of alphanumeric plain text.
Derived URLs (min: 0 URLs requested):
http://infomine.ucr.edu
Title:
INFOMINE: Scholarly Internet Resource Collections
Description:
INFOMINE is a comprehensive virtual library and reference tool for academic and scholarly Internet resources, including Web sites,
databases, electronic journals, bulletin boards, listservs, online library card catalogs, articles, directories of researchers, and other types of
information.
Key phrases:
online library card catalogs
electronic journals
bulletin boards
scholarly resources
scholarly communications
bibliographic databases
educational technology
information technology
government information
educational resources
Language:
en
Media Type:
text/html
The assignment algorithms can be fine-tuned using the MetadataAssigner.conf configuration file. This is located in the user's $HOME/.iViaMetadata directory.
# MetadataAssigner.conf [Models] model_dir = "" # defaults to libiViaMetadata SHARE directory. [Categories] enabled = false use_govpub_rule = true use_classifier = true [Keyphrases] no_to_assign = 10 [LCC Outlines] enabled = false min_cutoff_probability = 0.7 classification_method = "pachinko" # options: "all", "best", "pachinko". [Field Aliases] authors = "creator" lcc_outline_auto = "none" subjects = "lcsh" annotation = "description" ivia_description = "description" ucla_description = "none" ucr_description = "none" lc_description = "none"
Most of the configuration items are field-specific, and are described on the corresponding field assignment page. The INFOMINE Category and LCC Outline assignment algorithms depend on stored model files, which are located in the libiViaMetadata share directory unless the model_dir variable is set. The model files are described on the assignment algorithm pages.
The [Field Aliases] section is used to give the metadata assigner clues about which assignment algorithm to use in situations where the algorithm is not explicitly requested. In this case, the MetadataAssigner is asked to assign metadata and told the name of the field to assign. The MetadataAssigner knows the names of many common metadata fields, but not all. For example, if it is asked to assign "creator" metadata it knows to use its Creator assignment algorithm. However, it would not know what algorithm to use if "author" metadata was requested. To solve this problem, the [Field Aliases] section contains the line authors = "creator" which tells the assigner that when authors metadata is requested, this is the equivalent of a request for creator metadata.
Several of our metadata assignment algorithms use a Logistic Regression text classification algorithm to assign metadata to Internet resources.
iVia can boost metadata assignment by using Rich Text Identification to find "high-aboutness" pages that describe a resource, and exploiting that text in its assignment algorithms.