- Description
- ::
- Integration
- ::
- Documentation
- ::
- Publications
- ::
- Download
Partial List of iVia Features, Functions, Uses, Technologies
September 2004
Contents:
- Introduction and Background
- List of Basic Features, Functions, and Uses of iVia
- General and Basic
- iVia User Features
- Content Development and Management -- Features, Tools and Machine
Assistance for Institutional Collaborators and Expert Content Builders:
- Support for Institutional Collaborators
- Support for Expert Content Builders
- Targeted Metadata Generation for Existent Collections (component of NSDL iVia)
- Major Technologies List
- Technologies for Internet Resource Discovery/Identification
- Mostly Automated Crawling
- Semi-automated Mode-Expert Interactions with Web Crawlers
- Technologies for Metadata Generation and Rich Text Identification/Extraction
Systems - Author Supplied and Automatically Created Metadata
- Rich Text and Key Phrase Identification and Extraction
- Classification and Classifiers
- Semi-automated Mode-Expert Interactions in Classification
- Technologies for Internet Resource Discovery/Identification
I.) Introduction and Background:
iVia is a powerful and flexible open source Internet finding tool and collection building system. It is useful in building collections of metadata and full-text data representing resources on the Internet. It can also be used to generate metadata for existent collections comprised of resources both on and off the open Web. The metadata generated includes library-standard subject schema. iVia supports single or multiple subject focuses. It supports both single and multiple institutional efforts and is intended to aid in multi-institutional collaborations. User retrieval options are numerous for both fielded and full-text data and support both beginning and advanced searchers. iVia supports custom branding, interfaces and data views for those accessing its collections. Numerous modes of content building are possible featuring varying levels of editorial review, styles of indexing and divisions of labor. iVia is noteworthy because it saves resources and labor by integrating fully-automated, semi-automated and fully manual modes of record building. Resource discovery on the open Web through various iVia Web crawlers as well as metadata generation through iVia classifiers (and other means) translate to collections that require fewer resources and less expert labor to reach significant size. iVia is intended as a systems platform for collection building that emphasizes and empowers the expert through the use of machine learning technology. It enables expert built collections to better scale and meet user expectations in regard to Internet finding tool content.
In summary, iVia is about innovative systems development involving new technologies based in machine-learning approaches and emphasizing automated and semi-automated Internet resource discovery and metadata generation. These technologies can be characterized as:
- Enabling Technologies which:
- Facilitate cooperative service, finding tool and collection building
- Support multiple modes of collaboration, collection building, access
- Emphasize machine-assistance in collection development (both in resource discovery and metadata generation)
- Participatory Technologies that:
- Define and create synergies emphasizing points of expert/machine interaction to augment and amplify the performance of both
- Communityware Technologies Supportive of Academic Community Expertise, Values and
Effort such as:
- Excellence, objectivity and service orientation
- In-depth knowledge of subject domain
- In-depth knowledge of researcher needs for:
- Powerful interfaces and sophisticated access
- Rich, consistent and well-organized metadata
- Machine assistance to enhance expert performance and scalability
II.) List of Basic Features, Functions, and Uses of iVia
- General and Basic:
- iVia End Uses - Brief:
- National level finding tool and collection - Metadata management and creation: INFOMINE and NSDL iVia
- Collection development for other collections
- Subject guides/Pathfinders for libraries
- Integrating Web content with courses
-
Major iVia Applications:
- INFOMINE
- NSDL iVia
-
Major Cooperators:
- NSDL
- Library of Congress
- Wake Forest University Library
- UCLA Library
- California State University, Sacramento, Library
-
Content Managed:
- Metadata and representative rich, full-text for free or fee-based Internet "objects" such as Web sites, books, databases, eprint archives and others
-
Hardware:
- Public search interface server for end users and content builders (incl. expert guided crawler) interfaces.
- Public search interface backup server.
- Database server (both the metadata and full-text databases are here).
- Database server backup.
- Crawler/classifier processes server (e.g., vlcrawler, Nalanda iVia).
- OAI-PMH import/export server.
- Additional mass storage equipment: 2 terabytes of storage including a RAID array (1 terabyte of storage) accessible via NFS (networked storage)
-
Software:
- Open source code base: Size - 10Mb/230k lines
- Debian Linux operating system
- Apache Web server software
- Most code in C++
- Some interface code in Java, PHP
- MySQL database (most database management functions)
- Berkeley DB database (for fast indexing functions)
-
Standards:
- Dublin Core field structure
- Library of Congress Subject Headings (LCSH)
- Library of Congress Classifications (LCC)
- OAI-PMH data transfer
- MARC translation to Dublin Core
-
Major Fields Supported:
- URL
- Local URL (if different from above)
- Creator
- Subject - LCSH
- Subject - LCC
- Keywords
- Description
- Selected full-text (1-3 pages of rich text)
- MyI
- Other fields (40+)
- iVia End Uses - Brief:
-
iVia User Features
- Multiple collections or categories supported
- Search Performance:
- Fast (MySQL with Berkeley DB indexing functions)
- Scales to millions of records and associated full-text (1-3 pages per record)
- Search Features:
- Rich, full-text and/or fielded metadata search-able
- Boolean and proximity operators supported
- Search modification on results page
- Spell check and spelling suggestions for queries
- Limit searching:
- Any standard field (e.g., title, subject, keywords, etc.)
- Origin -- expert or robot created
- General subject category (e.g., Gov Info or BioAgMed)
- Resource type
- Resource access (fee, free, in-between)
- Audience supported (among expert created records)(e.g., academic, K-12, lifelong learner)
- Browse Indexes Supported:
- All general categories or combinations of general categories (e.g., Gov. Info. and/or Bio./Ag./Med)
- Subject indexes (e.g., LCSH or LCC)
- Keywords
- Specific research disciplines hierarchy (under development; e.g., Medicine >Cardiology...)
- Other indexes (e.g., Megatopics, a key-word-in-context index)
- Results display: Titles only, Regular (ti, su, creator, description,
expert/robot created), All
- Results ranking: relevance or title
- Linked indexing to expand or narrow searches
- Multiple languages support (under development, both for content and
interface)
- User Feedback/Awareness Tools:
- Email alert service for new resources added to the collection
- New resource suggestions
- Comment on this resource
- Multiple collections or categories supported
-
Content Development and Management -- Features, Tools and Machine Assistance for
Institutional Collaborators and Expert Content
Builders:
- Support for Institutional Collaborators
- Custom Data Views and Access Supported:
- Your front-end supported:
- Users access through your front-end
- Point of access auto-detected and interface profiles and data views you have created are activated
- Theme-ing : modular interface building functions for both users and content builders
- MyI field for customizing user access and data views
- Choice of parallel fields for display (e.g., different lengths/styles in annotation)
- Meta-searching access to external collections (as a user choice in setting up a search or in the case of zero results can search other collections)
- Meta-searching by other collections: CDL Searchlight and Ex Libris access INFOMINE
- Your front-end supported:
- Multiple Modes of Content Building Supported:
- Multiple means of ramping up and contributing
- Multiple content building styles supported
- Multiple levels of editorial review
- Multiple levels of allowable content builder activity controlled through password system
- Use of parallel and custom fields
- Pending record database
- Record history tracking (creation and edits)
- Hybrid Collections of Heterogeneous Metadata - Support for Multiple
Incoming Data Streams and Types of Records:
- Different streams of metadata:
- Expert created within INFOMINE: UCR and collaborators (e.g., Wake forest, UCLA);
- Expert created and imported to INFOMINE: U.C. Shared Cataloging Project (MARC import); LexisNexis; OAI-PMH (NSDL and other collections).
- Automatically created within INFOMINE: robot selected and created
- Semi-automatically created within INFOMINE: robot selected and with expert augmentation
- Heterogeneous metadata cohabitation
- Title only (LexisNexis data)
- Metadata only: both Dublin Core (DC) and MARC translated to DC for all records
- Representative rich, full-text plus fielded metadata for most records (both robot and expert created)
- Record merging/overlays supported
- Duplicate check for automatically generated/imported records.
- Different streams of metadata:
- Custom Data Views and Access Supported:
-
Support for Expert Content Builders:
- Machine-assistance for Automated and Semi-automated Resource
Discovery:
- Automatically built, crawler created collection of links, metadata, and representative rich full-text for useful resources
- Monitor/adjust thresholds for resource acceptance weightings in fully automated crawler
- Monitor list of potential duplicates found through fuzzy matches
- Semi-automated expert guided crawler with drill down/drill out.
- Semi-automated focused crawling: subject and project focused crawlers.
- Crawler suggestions (highest weighted resources) flagged for expert review and refinement
- System/database usage statistics used as suggestions (i.e., the most user-linked or accessed robot records flagged) to experts
- Machine-assistance for Automated Metadata Generation or Record
Import/Export:
- Crawler/classifier built metadata collection
- Semi-automated "foundation record". (= basic "ore" or metadata) created for expert augmentation: can be created through fully-automated or expert guided processes
- Batch Import/Export:
- OAI-PMH: for both importing and exporting records (primarily from NSDL)
- MARC records (UC Shared Cataloging Project)
- LexisNexis records
-
Machine Augmentation of Expert Created Records:
- Rich-text harvested to augment human created metadata
- Generation of specific metadata to augment/round out existent metadata records for targeted collections
-
Specific Machine-assistance to Experts in Record Building:
- Duplicate check (exact and fuzzy URL and title matching)
- Cloning records
- Multiple record editing (batch editing)
- Pull down menus of various controlled vocabularies (e.g., resource types, keywords)
- URL canonization
- Missing data alert for content building form
- URL checking and new URL identification (looks for broken URLs and suggests possible new ones)
- User corrections/suggestions/new content forms
- Online and point-of-need guidance via manuals and style guides
- Assistance in collection development for other collections
- Machine-assistance for Automated and Semi-automated Resource
Discovery:
-
Targeted Metadata Generator for Existent Collections (component of NSDL
iVia):
- Expert guided crawler metadata generation for existent, relatively homogeneous collections (e.g., eprint archives)
- Metadata assignment accuracy increased (rich text is present and easy to identify, e.g., abstracts)
- Purpose: assignment of new metadata for existent collections
- Support for Institutional Collaborators
III.) Major Technologies List
- Technologies for Internet Resource Discovery/Identification:
- Mostly Automated Crawling:
- Virtual Library (VL) crawler: crawling the community of academic, expert-vetted VLs
- Focused Crawling and Crawlers: Nalanda iVia Focused Crawler with
Apprentice/Monitor:
- Focused or Topic Specific Crawling:
- Linkage or Web graph analysis
- Content/text similarity analysis
- Preferential Focused Crawling:
- More efficient focused crawling
- Link/page clues used to identify best links to crawl
- Focused or Topic Specific Crawling:
- Semi-automated Mode -- Expert Interactions with Web Crawlers:
- Seed Set generation from vetted and/or unvetted sources, e.g.:
- Vetted: all expert, academic VLs
- Unvetted: Google, Teoma, and other large engines
- Interactive topic distillation -- content builder feedback on interim
crawl results to improve crawl:
- Positive Example Based Learning
- Web graph visualization using data visualization techniques
- Lifting and gradient ascent/descent to modify weightings/Web graph
- Community blacklists
- Expert guided crawlers with drill down and drill out -- A crawler that experts feed specific URLs to and which crawls these expert-specified depths and lineages out from the original URL.
- Seed Set generation from vetted and/or unvetted sources, e.g.:
- Mostly Automated Crawling:
- Technologies for Metadata Generation and Rich Text Identification/Extraction -
Automatically Created and Author Supplied Metadata:
- Rich Text and Key Phrase Identification and Extraction:
- Improvement of aboutness rich text detection (aboutness text = that rich text intended by the author to succinctly describe the theme of the resource)
- Improved PhraseRate (our key phrase identification and extraction utility)
- Classification and Classifiers:
- General on subject generation
- Classifier training millions of MARC records (for Internet / other resources)
- Classification algorithms:
- Improved Support Vector Machines (SVM)
- k-Nearest Neighbor (kNN)
- Improved Naïve Bayes (NB)
- Logistic Regression (LR) (emphasis)
- Ensembles of classifiers (possible)
- Description generation: author supplied descriptions or extracted key phrases
- Semi-automated Mode-Expert Interactions in Classification:
- Expert review and refinement of highly weighted crawler/classifier created records
- Rich Text and Key Phrase Identification and Extraction: