"Aboutness" is the idea that metadata assignment for a Web page can be improved by taking into account any surrounding Web pages that are "about" the same topic. We call these "high-aboutness" pages.
iVia implements "aboutness" in the RichTextFinder class (and supporting classes). It can be used to analyze a Web page and find additional supporting pages that are directly linked and share the same topic, in order that these additional pages can contribute text to aid any classification task. Examples of aboutness pages include author-intended descriptions of the work, such as will be found in introductions, "about" pages, and abstracts.
The quantity of additional "rich text" found by the RichTextFinder can be controlled by setting the maximum and minimum number of pages downloaded, and bytes of text downloaded. For example, we might specify that we want to take between 10,000 and 15,000 bytes of rich text from a page, but that we want to take text from fewer than four pages. The RichTextFinder will attempt to satisfy these constraints by downloading several pages and extracting text from them. When too much text is downloaded from a set of pages, then the amount of text taken from each page will be proportional to the similarity of the page to the original page?in other words, it will be proportional to its aboutness, though this will be constrained by the amount of text appearing on each page.
Currently, we do not use the RichTextFinder in metadata assignment by default, though it is easy to enable it in the MetadataAssigner.conf file.
The main reason we do not enable it is that tests with the metadata evaluation tool show it does not noticeably improve metadata assignment accuracy when it is enabled. However, we have not extensively tested the RichTextFinder to determine the best set of download parameters: our initial tests specified (from memory) between 5,000 and 15,000 bytes of text from up to three pages, but it may be possible that by experimenting with other parameters to find a combination that improves performance. This work has not been carried out due to time constraints, though tools have been built for trialling different settings, particularly the iVia Metadata Assignment Evaluation Program and the Metadata Assignment Test page in the Adders Web Site.