Keith Humphrey's Research Pages

PhraseRate
PhraseRate is a system designed to estimate key phrases in an HTML document. It approaches the task with an assumption that a reasonably written page attempts to tell you what it's about, especially at the beginning. Text position, frequency, HTML markup, and sequencing are used to estimate weights on various short phrases though out the document.

The Focused Crawler
The basic goals of the focused crawler project are:
  1. To provide the means for a document adder easily obtain a large percentage of the core authoritative documents related to subject words.

  2. To take existing subject categories from the database and scan the Internet for closely related authoritative documents. This in effect produces an automated adder.

Currently, attention is being directed towards the first goal. Given a request, it is processed though a series of steps to reach the desired objective:
  • Page Cacher
    Page Cacher is a helper program. It takes a list of URLs and attempts to fetch any available document not already in the document database. For each acquired document, page cacher enters it into the database. It is highly parallelized!

  • Seed_Set
    Seed_Set starts with some keywords (and some extra parameters). It then queries a search engine for documents that contain this word. Last, it feeds Snarf the URL list it received.

  • Purify_Seed_Set
    Purify_Seed_Set takes a collection of documents, those received from Seed_Set, and extracts the predominate clusters:

    Each document is transformed into a word frequency vector, then relativized with a standard corpus, and then L_2 normalized. With images on the unit sphere in "word space", this induces an a metric on the complete graph via the arc distance in the image space. (Note: Some of this seems to be going into page cacher.)

    The edges are sorted, then added in order, to an initially discreet graph (ala Tarjan disjoint sets) up to a statistically determined cutoff bound. This causes a precipation of the documents into clusters, where related documents have a sequence deformations between them.

    The pro-dominate clusters have a large number of members, which adds precision by this grouping by this method. They are then extracted and returned.

  • Crawl_Cluster
    Crawl_Cluster takes a set of documents, then
    1. fetches a number of back links for each page from a search engine,
    2. extracts all the non-self referential non-image forward links from the document set,
    3. feeds this list to Snarf.

  • Purify_Cluster
    This routine examines the new documents in context with the old set, and then expunges poor candidates from the recruits.

  • Rate_Cluster
    Rate_Cluster is a modified Kleinberg routine. The two primary modifications are
    1. a solution to the indiscriminate referee,
    2. a potential for iterated expansions guided by the core set.
    Rate_Cluster applies a sophisticated bibliometric analysis on the linkage structure of the cluster, which provides a rating of the set.

BlobCacher
Blobcacher is a blob caching system whose focus is to be fast, robust, and able to accompany multiple processes requesting information. It emphasizes stability of the database structure.

Documentation: