TEST INFO:
|
The Test Candidates
-
Kea
-
Kea is a component of the
New Zealand Digital Library,
designed for automatically extracting keywords/keyphrases
from text documents.
Briefly, it is a naive Bayesian classifier trained on a
document/author-keyword corpus.
-
Turney's Keyphrase Extractor
-
Extractor is a commercial product which attempts to directly mimic
the authors selection of keywords by comparing its results to the
authors. It uses a complex gentic algorithm to learn the paremeters
of keyphrases.
-
Dublin Core Metadata Editor
-
The Dublin Core is a specification of a small set of metadata elements
for describing information resources: RFC 2731.
The Dublin Core Metadata Editor is a service to provide this data from
processed pages for the benefit of catalogers.
The "subject or keywords" results were extracted for test
comparisons.
-
PhraseRate
-
PhraseRate was designed to provide a list of keyphrases from a web page
that would aid librarians in selecting keywords for the page.
As such, it was relieved from the responsibility of precision in
it's selection of keywords. Instead, it focuses more on briefly
imparting a sense of a web page, as well as including predominate
keywords for librarian selection.
While these programs all have concerns about keywords/keyphrases,
their intended utilization and integration
have induced distinct specializations which are apparent in their responses.
Test Sites
Test Method
The test web pages were selected by submitting queries on
subjects and properties to
Google,
and then using the URLs return from the response page.
The reasons for using properties in some of the queries were to
provide a diversity in responses and to obtain some pages that were
not strongly subject oriented.
The test URLs were then submitted to the web demos in the case of
Dublin Core Metadata Editor,
Turney's Keyphrase Extractor Demo,
and PhraseRate.
In the case of Kea, as it's web presence was currently down,
the web pages were processed to text (which is the input format for Kea),
and submitted as a tar file to Gordon Paynter, who applied Kea 2.0 to them.
To process the html, a custom html to text processor was written that
would include the title, meta description, meta subject, and meta keywords
from the test pages.
This was necessary to provide a fair comparison for Kea, as a number
of the web pages were very weak in content.
After the test was run, to provide an awareness of Kea's testing context,
Dr. Paynter suggested that
"Two things I think you should note in your description of Kea is it
was trained on a corpus of web pages (Aliweb, from Frank et al., 1999)
and that it is restricted to phrases of 1 to 3 words."
The results from Kea, Extractor, and PhraseRate are listed in order of
perceived weight.
Gordon Paynter suggested that
"if I were doing this for presentation to users I would take
the first 7 of each set - this was a good tradeoff between noise and
accuracy in our experiments.",
and so the top 7 (if available) were extracted for the Kea results.
For Extractor's results, it was assumed that the entries returned in the demo
reflect the appropriate selection.
For PhraseRate, the top 9 (if available) were selected.
It was not clear that Dublin Core's results were ordered by perceived
relevancy, and we included them in mass.
Any web page which was not accepted by all the key phrase extractors,
which amounted to about 5 or 6 pages, was dropped from the test.
The cause for these failures was due to a candidate not being able to process
frames, transfer to forward references, or handle non-standard characters
in the URL.
Otherwise, all results are presented, including pages that generated no
keywords or meaningless results from the tested programs.
The weaker pages form "boundary cases" which stress the
extractors. These all too common pathological web sites need to be handled,
and while all the programs could use improvement, they were often
admirably sparse.
The results presented below are grouped by the Google keyword queries,
with the keyword listed in the header in each section.
The number of words and characters in each page, both with and without
html (and excess space),
are listed to provide a perspective into the nature of the page.
NOTE: Both Extractor and PhraseRate can handle framesets, Dublin Core evidently
does not, and Kea was not given the opportunity.
There are 10 framesets included in the tests and they are noted as such.
|
|
|