iVia Notes on Internet Object Boundary Detection: Papers, Workshops, Examples.

Challenge: Need to identify author intended, logical, whole Internet objects prior to classification/metadata generation. Otherwise, you classify fractions of a whole or multiple Internet objects as if they were a whole which leads to inaccuracies. A problem is that the author intended whole can be simultaneously the whole site, a whole complex document (a tech report + datasets + database on different sites) and/or each whole individual report, database and dataset making up the site or complex document. We are concerned with all these levels of granularity. This is why the concept of paths or cuts, in additional to concepts of traditional information objects, is interesting. Good news is that most Internet objects, including complex hypertexts and megasites, usually have an initial, author intended entry point that is often detected through linkage analysis (McCurley, Untangling).


A related challenge is that within an author intended whole info object, there are conventional parts (structures within documents or sites or micro information units) that are concerned with .aboutness.. They usually identify themselves (e.g., .about., .abstract., .introduction., .summary.) and offer very rich, author-intended descriptive text. They can be the entry point but often are not. Identifying and then extracting and classifying via the very rich aboutness text found in these areas is crucial and is what such notions as micro-hubs and micro information units within info objects/docs concerns.

Untangling Compound Documents on the Web
Kevin McCurley IBM.
http://www.almaden.ibm.com/cs/people/mccurley/pdfs/pdf.pdf
* Asked McCurley (4/28) if he still had his test data and whether he could send the complex docs to us
(hasn.t replied as of 5/4).
* Types of challenges (complex doc types) enumerated in 2nd paper
* Key is finding the author intended entry point of a doc or resource
* Crawling via linkage analysis usually does this .this is the good news.
* Flaw of study.wasn.t it an intranet?
.Most text analysis is designed to deal with the concept of a .document", namely a cohesive presentation of thought on a unifying subject. By contrast, individual nodes on the World Wide Web tend to have a much smaller granularity than text documents. We claim that the notions of .document" and .web node" are not synonymous, and that authors often tend to deploy documents as collections of URLs, which we call .compound documents". In this paper we present new techniques for identifying and working with
such compound documents, and the results of some large scale studies on such web documents. The primary motivation for this work stems from the fact that information retrieval techniques are better suited to working on documents than individual hypertext nodes.... As mentioned above though authors usually designate a primary entry point, regardless of type of information object, that is often found through linkage analysis.
Kevin McCurley IBM
mccurley@us.ibm.com
http://www.almaden.ibm.com/cs/people/mccurley/

The W3C Workshop on Web Applications and Compound Documents. 1st and 2nd June 2004.
http://www.w3.org/2004/04/webapps-cdf-ws/

Finding Context Paths for Web Pages
Keishi Tajima
Proc. of ACM Hypertext, Darmstadt, Germany, Feb. 1999, pp 13-22
http://www.jaist.ac.jp/~tajima/papers/ht99www.pdf
.The contents of Web pages are often not self-contained. A page author often assumes all the readers of the page come through the same path, and he sometimes omits the information described in the pages on that path because the readers must already know it. Therefore, indexes used by search engines based on the contents of each page are also incomplete. In this paper, we propose a method of discovering those paths assumed by page authors, and of complementing the incomplete indexes with keywords extracted from the pages on those paths..

Discovery and Retrieval of Logical Information Units in Web
K. Tajima, K. Hatano, T. Matsukura, R. Sano, K. Tanaka,
Invited, Proc. of Workshop of Organizing Web Space (in conjunction with ACM Conference on Digital Ligraries '99), pp. 13-23, 1999
http://www.jaist.ac.jp/~tajima/papers/wows99www.pdf
In ordinary search engines for Web pages, the data unit for query processing is individual pages. Indexes are produced for each page in accordance with the words appearing in it. In actual Web data, however, a logical document discussing one topic is often organized into a set of pages connected via links provided by the page author as .standard navigation routes.. In such a situation, conjunctive queries with multiple keywords may fail to retrieve an appropriate document if those keywords appear in different pages within that document. Therefore, a data unit for Web data retrieval should not be a page but should be a connected subgraph corresponding to one logical document. In this paper, we develop new techniques for discovering and retrieving the logical information units in Web data. As in some previous researches, we adopt minimal subgraph semantics for conjunctive queries. In our approach, when given a conjunctive query, we try to approximate information units including all the given keywords in the following
three steps: (1) we distinguish standard route links from the others, (2) we find minimal subgraphs connected via those links and including all the keywords, and (3) we compute the score of each subgraph based on the locality of the keywords within it in order to examine whether it is really a logical information unit relevant to the query.

A Characterization of Compound Documents on the Web
Eyal de Lara, et al;.,
http://www.cs.toronto.edu/~delara/papers/compdoc.pdf
Recent developments in office productivity suites make it easier for users to publish rich compound documents on the Web. Compound documents appear as a single unit of information but may contain data generated by different applications, such as text, images, and spreadsheets. Given the popularity enjoyed by these office suites and the pervasiveness of the Web as a publication medium, we expect that in the near future these compound documents will become an increasing proportion of the Web.s content. As a result, the content handled by servers, proxies, and browsers may change considerably from what is currently observed. Furthermore, these compound documents are currently treated as opaque byte streams, but future Web infrastructure may wish to understand their internal structure to provide higher-quality service. In order to guide the design of this future Web infrastructure, we characterize compound documents currently found on the Web. Previous studies of Web content either ignored these document types altogether or did not consider their internal structure. We study compound documents originated by the three most popular applications from the Microsoft Office suite: Word, Excel, and PowerPoint. Our study encompasses over 12,500 documents retrieved from 935 different Web sites.

Browsing intricately interconnected paths
Dave, Pratick, et. al.,
Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
http://www.ht03.org/papers/pdfs/13.pdf
Graph-centric and node-centric browsing are the two commonly identified hypertext-browsing paradigms. We believe that path-centric browsing, the browsing behavior exhibited by path interfaces, is an independent browsing paradigm that combines useful aspects of the two commonly supported cases

Web Search Based on Micro Information Units
Xiaoli Li, et. al.
http://www2002.org/CDROM/poster/78.pdf
.A Web page is often populated with a number of small information units, which we call micro information units (MIU). Each unit focuses on a specific topic and occupies a specific area of the page.
Internet search is one of the most important applications of the Web. One shortcoming of existing search techniques is that they do not give due consideration to the micro-structures of a Web page. A Web page is often populated with a number of small information units, which we call micro information units (MIU). Each unit focuses on a specific topic and occupies a specific area of the page. During the search, if all the keywords in the user query occur in a single MIU of a page, the top ranking results returned by a search engine are generally relevant and useful. However, if the query words scatter at different MIUs in a page, the pages returned can be quite irrelevant. The reason for this is that although a page has information on individual MIUs, it may not have information on their intersections. In this paper, we propose a technique to solve this problem. At the off-line preprocessing stage, we segment each page to identify the MIUs in the page, and index the keywords of the page according to the MIUs in which they occur. In searching, our retrieval and ranking algorithm utilizes this additional information to return those most relevant pages. Experimental results show that this method is able to dramatically improve the search precision.

Retrieval and organizing Web pages by Information Unit.
W. S Lee, K. S. Candan, V. Quoc and D. Agrawal.
WWW10, Hongkong, 2001.
http://www10.org/cdrom/papers/466/
Since WWW encourages hypertext and hypermedia document authoring (e.g., HTML or XML), Web authors tend to create documents that are composed of multiple pages connected with hyperlinks or frames. A Web document may be authored in multiple ways, such as (1) all information in one physical page, or (2) a main page and the related information in separate linked pages. Existing Web search engines, however, return only physical pages. In this paper, we introduce and describe the use of the concept of information unit, which can be viewed as a logical Web document consisting of multiple physical pages as one atomic retrieval unit. We present an algorithm to efficiently retrieve information units. Our algorithm can perform progressive query processing over a Web index by considering both document semantic similarity and link structures. Experimental results on synthetic graphs and real Web data show the effectiveness and usefulness of the proposed information unit retrieval technique.


A. Heterogeneous Metasite Challenges:

Large collections and collections of collections of large collections.
Do you want to index the: site as a whole; the database component, ebooks component, or links collection component or all components of the site.

For tons of examples of this type:
Go to: http://infomine.ucr.edu/cgi-bin/search
Turn off all fields except keyword
Search for: digital libraries or virtual libraries


Online Books

http://digital.library.upenn.edu/books/
challenge: create record for whole site + records for several thousand constituent books.
E.g.,
index entry: http://onlinebooks.library.upenn.edu/webbin/book/authorstart?M
entry: Maas, Julie, illust.: Arguments With the Thought Police, by John Bart Gerald (HTML at nightslantern.ca)
link: http://www.nightslantern.ca/book/entry.htm
Whole site/collection > books

Others of this type include:

Electronic Text Center

http://etext.lib.virginia.edu/
Whole site/meta-collection > collections by language > English language collections > English online resrouces > Subject: Native Americans > Abbott, Jacob: Aboriginal America > divided into chapters > whole book
Challenge: record for the whole site; records for individual collections in several languages ; records for individual works in each collections while eliminating several intervening pointer/index pages w/o



B. Compound Document Challenges

Internet objects that consist of multiple parts/objects that are intended by the author to be used as a logical whole but which are not organized necessarily in a single site and/or are not organized hierarchically and/or are heterogeneous. E.g., a gov report that contains a report, spreadsheet, data sets, link list in separate places.


Compound Documents - Hopping Domains:

The Works of William Shakespeare

at the Electronic Text Center, University of Virginia Library
http://etext.virginia.edu/shakespeare/works/
Measure for measure
Act I, Scene II

The Middle English Collection

http://etext.lib.virginia.edu/mideng.browse.html
The Harley Lyrics
Lyric 1

Compound Documents - Resource Fractionation (this a word?):


The Yî King [Book of Changes]

http://www.sacred-texts.com/ich/ictp.htm
Table of Contents
http://www.sacred-texts.com/ich/ictoc.htm
book chapters/sections (e.g.):
Hsu http://www.sacred-texts.com/ich/ic05.htm#page_67
Sung http://www.sacred-texts.com/ich/ic06.htm#page_69
challenge: create record for the whole work and do not drill into chapters
Whole book > avoid separate chapters

Abbott, Jacob. Aboriginal America
Electronic Text Center, University of Virginia Library
accessible as chapters http://etext.lib.virginia.edu/toc/modeng/public/AbbAbor.html
or entire book:
entire book.