iVia Notes on Internet Object Boundary Detection: Papers, Workshops, Examples.
Challenge: Need to identify author intended, logical, whole Internet objects prior to
classification/metadata generation. Otherwise, you classify fractions of a whole or multiple
Internet objects as if they were a whole which leads to inaccuracies. A problem is that the author
intended whole can be simultaneously the whole site, a whole complex document (a tech report +
datasets + database on different sites) and/or each whole individual report, database and dataset
making up the site or complex document. We are concerned with all these levels of granularity.
This is why the concept of paths or cuts, in additional to concepts of traditional information
objects, is interesting. Good news is that most Internet objects, including complex hypertexts
and megasites, usually have an initial, author intended entry point that is often detected through
linkage analysis (McCurley, Untangling).
A related challenge is that within an author intended whole info object, there are conventional
parts (structures within documents or sites or micro information units) that are concerned with
.aboutness.. They usually identify themselves (e.g., .about., .abstract., .introduction.,
.summary.) and offer very rich, author-intended descriptive text. They can be the entry point
but often are not. Identifying and then extracting and classifying via the very rich
aboutness text found in these areas is crucial and is what such notions as micro-hubs and
micro information units within info objects/docs concerns.
Untangling Compound Documents
on the Web
Kevin McCurley IBM.
http://www.almaden.ibm.com/cs/people/mccurley/pdfs/pdf.pdf
* Asked McCurley (4/28) if he still had his test data and whether he could send the complex docs
to us
(hasn.t replied as of 5/4).
* Types of challenges (complex doc types) enumerated in 2nd paper
* Key is finding the author intended entry point of a doc or resource
* Crawling via linkage analysis usually does this .this is the good news.
* Flaw of study.wasn.t it an intranet?
.Most text analysis is designed to deal with the concept of a .document", namely a cohesive
presentation of thought on a unifying subject. By contrast, individual nodes on the World Wide Web
tend to have a much smaller granularity than text documents. We claim that the notions of
.document" and .web node" are not synonymous, and that authors often tend to deploy documents as
collections of URLs, which we call .compound documents". In this paper we present new
techniques for identifying and working with
such compound documents, and the results of some large scale studies on such web documents. The
primary motivation for this work stems from the fact that information retrieval techniques are
better suited to working on documents than individual hypertext nodes.... As mentioned above
though authors usually designate a primary entry point, regardless of type of information object,
that is often found through linkage analysis.
Kevin McCurley IBM
mccurley@us.ibm.com
http://www.almaden.ibm.com/cs/people/mccurley/
The W3C Workshop on Web Applications and Compound Documents. 1st and 2nd June 2004.
http://www.w3.org/2004/04/webapps-cdf-ws/
Finding Context Paths for Web Pages
Keishi Tajima
Proc. of ACM Hypertext, Darmstadt,
Germany, Feb. 1999, pp 13-22
http://www.jaist.ac.jp/~tajima/papers/ht99www.pdf
.The contents of Web pages are often not self-contained. A page author often assumes all the
readers of the page come through the same path, and he sometimes omits the information described
in the pages on that path because the readers must already know it. Therefore, indexes used by
search engines based on the contents of each page are also incomplete. In this paper, we propose a
method of discovering those paths assumed by page authors, and of complementing the incomplete
indexes with keywords extracted from the pages on those paths..
Discovery and Retrieval of Logical Information Units in Web
K. Tajima, K. Hatano, T. Matsukura, R. Sano, K. Tanaka,
Invited, Proc. of Workshop of Organizing Web Space (in
conjunction with ACM Conference on Digital Ligraries
'99), pp. 13-23, 1999
http://www.jaist.ac.jp/~tajima/papers/wows99www.pdf
In ordinary search engines for Web pages, the data unit for query processing is individual pages.
Indexes are produced for each page in accordance with the words appearing in it. In actual Web
data, however, a logical document discussing one topic is often organized into a set of pages
connected via links provided by the page author as .standard navigation routes.. In such a
situation, conjunctive queries with multiple keywords may fail to retrieve an appropriate document
if those keywords appear in different pages within that document. Therefore, a data unit for
Web data retrieval should not be a page but should be a connected subgraph corresponding to one
logical document. In this paper, we develop new techniques for discovering and retrieving the
logical information units in Web data. As in some previous researches, we adopt minimal subgraph
semantics for conjunctive queries. In our approach, when given a conjunctive query, we try to
approximate information units including all the given keywords in the following
three steps: (1) we distinguish standard route links from the others, (2) we find minimal
subgraphs connected via those links and including all the keywords, and (3) we compute the score
of each subgraph based on the locality of the keywords within it in order to examine whether it is
really a logical information unit relevant to the query.
A Characterization of Compound Documents on the Web
Eyal de Lara, et al;.,
http://www.cs.toronto.edu/~delara/papers/compdoc.pdf
Recent developments in office productivity suites make it easier for users to publish rich
compound documents on the Web. Compound documents appear as a single unit of information
but may contain data generated by different applications, such as text, images, and spreadsheets.
Given the popularity enjoyed by these office suites and the pervasiveness of the Web as a
publication medium, we expect that in the near future these compound documents will become an
increasing proportion of the Web.s content. As a result, the content handled by servers,
proxies, and browsers may change considerably from what is currently observed. Furthermore, these
compound documents are currently treated as opaque byte streams, but future Web infrastructure may
wish to understand their internal structure to provide higher-quality service. In order to guide
the design of this future Web infrastructure, we characterize compound documents currently found
on the Web. Previous studies of Web content either ignored these document types altogether or did
not consider their internal structure. We study compound documents originated by the three most
popular applications from the Microsoft Office suite: Word, Excel, and PowerPoint. Our study
encompasses over 12,500 documents retrieved from 935 different Web sites.
Browsing intricately interconnected paths
Dave, Pratick, et. al.,
Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
http://www.ht03.org/papers/pdfs/13.pdf
Graph-centric and node-centric browsing are the two commonly identified hypertext-browsing
paradigms. We believe that path-centric browsing, the browsing behavior exhibited by path
interfaces, is an independent browsing paradigm that combines useful aspects of the two commonly
supported cases
Web Search Based on Micro Information Units
Xiaoli Li, et. al.
http://www2002.org/CDROM/poster/78.pdf
.A Web page is often populated with a number of small information units, which we call micro
information units (MIU). Each unit focuses on a specific topic and occupies a specific area of the
page.
Internet search is one of the most important applications of the Web. One shortcoming of existing
search techniques is that they do not give due consideration to the micro-structures of a Web
page. A Web page is often populated with a number of small information units, which we call
micro information units (MIU). Each unit focuses on a specific topic and occupies a
specific area of the page. During the search, if all the keywords in the user query occur in a
single MIU of a page, the top ranking results returned by a search engine are generally relevant
and useful. However, if the query words scatter at different MIUs in a page, the pages returned
can be quite irrelevant. The reason for this is that although a page has information on individual
MIUs, it may not have information on their intersections. In this paper, we propose a technique to
solve this problem. At the off-line preprocessing stage, we segment each page to identify the
MIUs in the page, and index the keywords of the page according to the MIUs in which they occur. In
searching, our retrieval and ranking algorithm utilizes this additional information to return
those most relevant pages. Experimental results show that this method is able to dramatically
improve the search precision.
Retrieval and organizing Web pages by Information Unit.
W. S Lee, K. S. Candan, V. Quoc and D. Agrawal.
WWW10, Hongkong, 2001.
http://www10.org/cdrom/papers/466/
Since WWW encourages hypertext and hypermedia document authoring (e.g., HTML or XML), Web authors
tend to create documents that are composed of multiple pages connected with hyperlinks or
frames. A Web document may be authored in multiple ways, such as (1) all information in one
physical page, or (2) a main page and the related information in separate linked pages. Existing
Web search engines, however, return only physical pages. In this paper, we introduce and
describe the use of the concept of information unit, which can be viewed as a
logical Web document consisting of multiple physical pages as one atomic retrieval
unit. We present an algorithm to efficiently retrieve information units. Our algorithm can perform
progressive query processing over a Web index by considering both document semantic similarity and
link structures. Experimental results on synthetic graphs and real Web data show the
effectiveness and usefulness of the proposed information unit retrieval technique.
A. Heterogeneous Metasite Challenges:
Large collections and collections of collections of large collections.
Do you want to index the: site as a whole; the database component, ebooks component, or links
collection component or all components of the site.
For tons of examples of this type:
Go to: http://infomine.ucr.edu/cgi-bin/search
Turn off all fields except keyword
Search for: digital libraries or virtual libraries
Online Books
http://digital.library.upenn.edu/books/
challenge: create record for whole site + records for several thousand constituent books.
E.g.,
index entry: http://onlinebooks.library.upenn.edu/webbin/book/authorstart?M
entry: Maas, Julie, illust.: Arguments
With the Thought Police, by John Bart Gerald (HTML at nightslantern.ca)
link: http://www.nightslantern.ca/book/entry.htm
Whole site/collection > books
Others of this type include:
Electronic Text Center
http://etext.lib.virginia.edu/
Whole site/meta-collection > collections by language > English language collections > English
online resrouces > Subject: Native Americans > Abbott, Jacob: Aboriginal America > divided into
chapters > whole book
Challenge: record for the whole site; records for individual collections in several languages ;
records for individual works in each collections while eliminating several intervening
pointer/index pages w/o
B. Compound Document Challenges
Internet objects that consist of multiple parts/objects that are intended by the author to be used
as a logical whole but which are not organized necessarily in a single site and/or are not
organized hierarchically and/or are heterogeneous. E.g., a gov report that contains a report,
spreadsheet, data sets, link list in separate places.
Compound Documents - Hopping Domains:
The Works of William Shakespeare
at the Electronic Text Center, University of Virginia Library
http://etext.virginia.edu/shakespeare/works/
Measure for measure
Act
I, Scene II
The Middle English Collection
http://etext.lib.virginia.edu/mideng.browse.html
The
Harley Lyrics
Lyric
1
Compound Documents - Resource Fractionation (this a word?):
The Yî King [Book of Changes]
http://www.sacred-texts.com/ich/ictp.htm
Table of Contents
http://www.sacred-texts.com/ich/ictoc.htm
book chapters/sections (e.g.):
Hsu http://www.sacred-texts.com/ich/ic05.htm#page_67
Sung http://www.sacred-texts.com/ich/ic06.htm#page_69
challenge: create record for the whole work and do not drill into chapters
Whole book > avoid separate chapters
Abbott, Jacob. Aboriginal America
Electronic Text Center, University of
Virginia Library
accessible as chapters http://etext.lib.virginia.edu/toc/modeng/public/AbbAbor.html
or entire book:
entire
book.