Research and Development: Phrase extraction

Keith Humphreys' PhraseRate

PhraseRate is a program, developed by Keith Humphreys, for extracting a set of meaningful, attractive keywords and key phrases from a web page describing the content of that page.

Examples

PhraseRate has been compared to two other key phrase extraction programs, Kea and Extractor. You can see a short list of examples, or a much longer list, on the Archived PhraseRate web site.

How it works

PhraseRate is based on a set of heuristics tailored to the web environment, and a phrase selection algorithm.

To extract phrases from a Web page, its text is first downloaded (additional pages may be crawled if the first is inadequate), then the phrases appearing in the page are extracted, along with a range of properties based on the Web page structure and phrase syntax. The best of these phrases are chosen with the phrase selection algorithm.

The phrase selection algorithm is based on harmonic means; it was chosen to satisfy eight requirements (e.g. document length independence, longer phrases are favored, repeating documents gives the same output).

Further Information

Keith's research paper on PhraseRate is now available. More information is available from Keith's archived research web site.