Partial List of iVia Features, Functions, Uses, Technologies
September 2004
Contents:
- Introduction and Background
- List of Basic Features, Functions, and Uses of iVia
- General and Basic
- iVia User Features
- Content Development and Management -- Features, Tools and Machine
Assistance for Institutional Collaborators and Expert Content Builders:
- Support for Institutional Collaborators
- Support for Expert Content Builders
- Targeted Metadata Generation for Existent Collections (component of NSDL
iVia)
- Major Technologies List
- Technologies for Internet Resource Discovery/Identification
- Mostly Automated Crawling
- Semi-automated Mode-Expert Interactions with Web Crawlers
- Technologies for Metadata Generation and Rich Text Identification/Extraction
Systems - Author Supplied and Automatically Created Metadata
- Rich Text and Key Phrase Identification and Extraction
- Classification and Classifiers
- Semi-automated Mode-Expert Interactions in Classification
I.) Introduction and Background:
iVia is a powerful and flexible open source Internet finding tool and collection building
system. It is useful in building collections of metadata
and full-text data representing resources on the Internet. It can also be used to generate
metadata for existent collections comprised of resources
both on and off the open Web. The metadata generated includes library-standard subject schema.
iVia supports single or multiple subject focuses. It
supports both single and multiple institutional efforts and is intended to aid in
multi-institutional collaborations. User retrieval options are
numerous for both fielded and full-text data and support both beginning and advanced
searchers. iVia supports custom branding, interfaces and data
views for those accessing its collections. Numerous modes of content building are possible
featuring varying levels of editorial review, styles of
indexing and divisions of labor. iVia is noteworthy because it saves resources and labor by
integrating fully-automated, semi-automated and fully
manual modes of record building. Resource discovery on the open Web through various iVia Web
crawlers as well as metadata generation through iVia
classifiers (and other means) translate to collections that require fewer resources and less
expert labor to reach significant size. iVia is intended
as a systems platform for collection building that emphasizes and empowers the expert through
the use of machine learning technology. It enables
expert built collections to better scale and meet user expectations in regard to Internet
finding tool content.
In summary, iVia is about innovative systems development involving new technologies based
in machine-learning approaches and emphasizing automated
and semi-automated Internet resource discovery and metadata generation. These technologies can
be characterized as:
- Enabling Technologies which:
- Facilitate cooperative service, finding tool and collection building
- Support multiple modes of collaboration, collection building, access
- Emphasize machine-assistance in collection development (both in resource
discovery and metadata generation)
- Participatory Technologies that:
- Define and create synergies emphasizing points of expert/machine interaction
to augment and amplify the performance of both
- Communityware Technologies Supportive of Academic Community Expertise, Values and
Effort such as:
- Excellence, objectivity and service orientation
- In-depth knowledge of subject domain
- In-depth knowledge of researcher needs for:
- Powerful interfaces and sophisticated access
- Rich, consistent and well-organized metadata
- Machine assistance to enhance expert performance and scalability
II.) List of Basic Features, Functions, and Uses of iVia
- General and Basic:
- iVia End Uses - Brief:
- National level finding tool and collection - Metadata management and
creation: INFOMINE and NSDL iVia
- Collection development for other collections
- Subject guides/Pathfinders for libraries
- Integrating Web content with courses
-
Major iVia Applications:
-
Major Cooperators:
- NSDL
- Library of Congress
- Wake Forest University Library
- UCLA Library
- California State University, Sacramento, Library
-
Content Managed:
- Metadata and representative rich, full-text for free or fee-based
Internet "objects" such as Web sites, books, databases, eprint archives and others
-
Hardware:
- Public search interface server for end users and content builders
(incl. expert guided crawler) interfaces.
- Public search interface backup server.
- Database server (both the metadata and full-text databases are
here).
- Database server backup.
- Crawler/classifier processes server (e.g., vlcrawler, Nalanda
iVia).
- OAI-PMH import/export server.
- Additional mass storage equipment: 2 terabytes of storage
including a RAID array (1 terabyte of storage) accessible via NFS (networked storage)
-
Software:
- Open source code base: Size - 10Mb/230k lines
- Debian Linux operating system
- Apache Web server software
- Most code in C++
- Some interface code in Java, PHP
- MySQL database (most database management functions)
- Berkeley DB database (for fast indexing functions)
-
Standards:
- Dublin Core field structure
- Library of Congress Subject Headings (LCSH)
- Library of Congress Classifications (LCC)
- OAI-PMH data transfer
- MARC translation to Dublin Core
-
Major Fields Supported:
- URL
- Local URL (if different from above)
- Creator
- Subject - LCSH
- Subject - LCC
- Keywords
- Description
- Selected full-text (1-3 pages of rich text)
- MyI
- Other fields (40+)
-
iVia User Features
- Multiple collections or categories supported
- Search Performance:
- Fast (MySQL with Berkeley DB indexing functions)
- Scales to millions of records and associated full-text (1-3 pages per
record)
- Search Features:
- Rich, full-text and/or fielded metadata search-able
- Boolean and proximity operators supported
- Search modification on results page
- Spell check and spelling suggestions for queries
- Limit searching:
- Any standard field (e.g., title, subject, keywords, etc.)
- Origin -- expert or robot created
- General subject category (e.g., Gov Info or BioAgMed)
- Resource type
- Resource access (fee, free, in-between)
- Audience supported (among expert created records)(e.g.,
academic, K-12, lifelong learner)
- Browse Indexes Supported:
- All general categories or combinations of general categories (e.g.,
Gov. Info. and/or Bio./Ag./Med)
- Subject indexes (e.g., LCSH or LCC)
- Keywords
- Specific research disciplines hierarchy (under development; e.g.,
Medicine >Cardiology...)
- Other indexes (e.g., Megatopics, a key-word-in-context index)
- Results display: Titles only, Regular (ti, su, creator, description,
expert/robot created), All
- Results ranking: relevance or title
- Linked indexing to expand or narrow searches
- Multiple languages support (under development, both for content and
interface)
- User Feedback/Awareness Tools:
- Email alert service for new resources added to the collection
- New resource suggestions
- Comment on this resource
-
Content Development and Management -- Features, Tools and Machine Assistance for
Institutional Collaborators and Expert Content
Builders:
- Support for Institutional Collaborators
- Custom Data Views and Access Supported:
- Your front-end supported:
- Users access through your front-end
- Point of access auto-detected and interface profiles
and data views you have created are activated
- Theme-ing : modular interface building functions for both
users and content builders
- MyI field for customizing user access and data views
- Choice of parallel fields for display (e.g., different
lengths/styles in annotation)
- Meta-searching access to external collections (as a user choice
in setting up a search or in the case of zero results can search other collections)
- Meta-searching by other collections: CDL Searchlight and Ex
Libris access INFOMINE
- Multiple Modes of Content Building Supported:
- Multiple means of ramping up and contributing
- Multiple content building styles supported
- Multiple levels of editorial review
- Multiple levels of allowable content builder activity
controlled through password system
- Use of parallel and custom fields
- Pending record database
- Record history tracking (creation and edits)
- Hybrid Collections of Heterogeneous Metadata - Support for Multiple
Incoming Data Streams and Types of Records:
- Different streams of metadata:
- Expert created within INFOMINE: UCR and
collaborators (e.g., Wake forest, UCLA);
- Expert created and imported to INFOMINE: U.C.
Shared Cataloging Project (MARC import); LexisNexis; OAI-PMH (NSDL and other
collections).
- Automatically created within INFOMINE: robot
selected and created
- Semi-automatically created within INFOMINE: robot
selected and with expert augmentation
- Heterogeneous metadata cohabitation
- Title only (LexisNexis data)
- Metadata only: both Dublin Core (DC) and MARC
translated to DC for all records
- Representative rich, full-text plus fielded metadata
for most records (both robot and expert created)
- Record merging/overlays supported
- Duplicate check for automatically generated/imported
records.
-
Support for Expert Content Builders:
- Machine-assistance for Automated and Semi-automated Resource
Discovery:
- Automatically built, crawler created collection of links,
metadata, and representative rich full-text for useful resources
- Monitor/adjust thresholds for resource acceptance
weightings in fully automated crawler
- Monitor list of potential duplicates found through fuzzy
matches
- Semi-automated expert guided crawler with drill down/drill
out.
- Semi-automated focused crawling: subject and project
focused crawlers.
- Crawler suggestions (highest weighted resources) flagged
for expert review and refinement
- System/database usage statistics used as suggestions
(i.e., the most user-linked or accessed robot records flagged) to experts
- Machine-assistance for Automated Metadata Generation or Record
Import/Export:
- Crawler/classifier built metadata collection
- Semi-automated "foundation record". (= basic "ore" or
metadata) created for expert augmentation: can be created
through fully-automated or expert guided processes
- Batch Import/Export:
- OAI-PMH: for both importing and exporting records
(primarily from NSDL)
- MARC records (UC Shared Cataloging Project)
- LexisNexis records
-
Machine Augmentation of Expert Created Records:
- Rich-text harvested to augment human created metadata
- Generation of specific metadata to augment/round out existent
metadata records for targeted collections
-
Specific Machine-assistance to Experts in Record Building:
- Duplicate check (exact and fuzzy URL and title matching)
- Cloning records
- Multiple record editing (batch editing)
- Pull down menus of various controlled vocabularies (e.g.,
resource types, keywords)
- URL canonization
- Missing data alert for content building form
- URL checking and new URL identification (looks for broken
URLs and suggests possible new ones)
- User corrections/suggestions/new content forms
- Online and point-of-need guidance via manuals and style
guides
- Assistance in collection development for other
collections
-
Targeted Metadata Generator for Existent Collections (component of NSDL
iVia):
- Expert guided crawler metadata generation for existent, relatively
homogeneous collections (e.g., eprint archives)
- Metadata assignment accuracy increased (rich text is present and easy
to identify, e.g., abstracts)
- Purpose: assignment of new metadata for existent collections
III.) Major Technologies List
- Technologies for Internet Resource Discovery/Identification:
- Mostly Automated Crawling:
- Virtual Library (VL) crawler: crawling the community of academic,
expert-vetted VLs
- Focused Crawling and Crawlers: Nalanda iVia Focused Crawler with
Apprentice/Monitor:
- Focused or Topic Specific Crawling:
- Linkage or Web graph analysis
- Content/text similarity analysis
- Preferential Focused Crawling:
- More efficient focused crawling
- Link/page clues used to identify best links to
crawl
- Semi-automated Mode -- Expert Interactions with Web Crawlers:
- Seed Set generation from vetted and/or unvetted sources, e.g.:
- Vetted: all expert, academic VLs
- Unvetted: Google, Teoma, and other large engines
- Interactive topic distillation -- content builder feedback on interim
crawl results to improve crawl:
- Positive Example Based Learning
- Web graph visualization using data visualization
techniques
- Lifting and gradient ascent/descent to modify weightings/Web
graph
- Community blacklists
- Expert guided crawlers with drill down and drill out -- A crawler that
experts feed specific
URLs to and which crawls these expert-specified depths and lineages out
from the original URL.
- Technologies for Metadata Generation and Rich Text Identification/Extraction -
Automatically Created and Author Supplied Metadata:
- Rich Text and Key Phrase Identification and Extraction:
- Improvement of aboutness rich text detection (aboutness text = that
rich text intended by the author to succinctly describe
the theme of the resource)
- Improved PhraseRate (our key phrase identification and extraction
utility)
- Classification and Classifiers:
- General on subject generation
- Classifier training millions of MARC records (for Internet / other
resources)
- Classification algorithms:
- Improved Support Vector Machines (SVM)
- k-Nearest Neighbor (kNN)
- Improved Naïve Bayes (NB)
- Logistic Regression (LR) (emphasis)
- Ensembles of classifiers (possible)
-
Description generation: author supplied descriptions or extracted
key phrases
- Semi-automated Mode-Expert Interactions in Classification:
- Expert review and refinement of highly weighted crawler/classifier
created records