Chapter 7: Importing Records from OAI-PMH and MARC

This chapter describes the iVia record import systems. There are two main sources for imported records in iVia: MARC files such as library catalogs, and OpenArchives data providers.

Marc-Readable Catalog (MARC) format is a standard defined by the Library of Congress' Network Development and MARC Standards Office. Many different types of record can be stored as MARC files. iVia's MARC_importer is a program for importing the records in a MARC file into an iVia database.

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol for transferring metadata records between digital library collections. iVia’s OAI-PMH-importer is a program for importing records from external collections into the iVia database.

This chapter is in three parts. The first describes the iVia import system in general, the second explains how to configure and use the MARC importer, and the third explains how to configure and use the OAI-PMH importer.

iVia also contains an OAI-PMH data provider program, called OAI-PMH-server, which can be used to distribute iVia records to other sites. In conjunction with the importer, federated collections can exchange records and be kept "in sync" with each other. The OAI-PMH-server program is discussed in Chapter 8.

The iVia Import System

When records are imported into an iVia collection from a foreign source, several issues arise. First, we have to keep track of the foreign source and foreign id of the imported records. This is especially important if the imported records are likely to be updated in the future. Second, we need a way to check whether each imported record is a duplicate of an existing record, for example by searching for other records with the same URL. Third, we need a policy for deciding how to handle duplicate records. Finally, we need to log the changes made to the iVia database, especially when large batches of records are imported.

Tracking Imported Records

iVia’s main (record_info) database table contains three fields that are used to track the source and ownership of any imported records: foreign_source, foreign_id, and institutional_owner. The foreign_source field is a text field that identifies the source of the record. If it is empty, it means the record was created on the local installation; if it has a value like "NSDL", "SCP", or "INFOMINE" it means that records was imported from the named source. The foreign_id field is a text field that contains the unique identifier of the imported record at the foreign source. If the foreign_source is empty, foreign_id must also be empty. If the record has been imported from an OAI-PMH server, then the foreign_id is likely to be an OAI identifier like oai:infomine.ucr.edu:128956. If the record was imported from a MARC tape, then the foreign_id will often be a MARC control number. The institutional_owner field is a text field that contains the name of the institution who "owns" this record. In cases of joint ownership, a semi-colon separated list may be used.

Existing Records

When a record is imported, two checks are immediately performed: first we check to see if this is an existing record. An existing record is a copy of the imported record that we have previously imported. We check by searching the iVia database for records with identical values in the foreign_source and foreign_id fields. In the cases where an existing record is found, we almost always want to update this record rather than create a new one.

Duplicate Records

Next, we check to see if there are any duplicate records in the database. Duplicate records are those that describe the same resource. Suppose, for example, that we import a record with the URL http://www.nzdl.org/, but there is already a record in the database with that URL. In this case, the imported record is almost certainly a duplicate of the existing record.

More complex cases arise when it is not obvious whether one record is a duplicate of another. For example, suppose http://www.nzdl.org/ is imported, and the database contains http://nzdl.org/ and http://nsdl.org/welcome.html and http://www.nzdl.org/cstr/. The first is clearly a duplicate of the imported record; the second is probably a duplicate; the third is probably not a duplicate. The iVia importers handle these complexities by supporting several different match methods for detecting duplicates, and allowing the user to choose the method most appropriate for a particular import. The match methods include:

Usually, importers use the EXACT_URL or FUZZY_URL match methods.

Duplicate records are split into three groups: higher-priority duplicates, same-priority duplicates and lower-priority duplicates. The "priority" of a duplicate is based on its foreign_source value and calculated using the "priority_list" configuration option in the iVia.conf file. For example, if the priority list for an installation is "SCP,vlcrawler,afcrawler", it means that records with SCP in the foreign_source field are higher-priority than records whose priority is "vlcrawler" or "afcrawler". Suppose a record is imported with its foreign_source set to "vlcrawler"; then any duplicates found whose foreign_source is "SCP" are higher-priority duplicates, any duplicates whose foreign_source is "vlcrawler" are same-priority duplicates, and any duplicates whose foreign_source is "afcrawler" are lower-priority duplicates. Note that records whose foreign_source is empty (i.e. locally-created records) have the highest priority of all.

Import Policies

Once an importer has identified the existing record (if it exists) and the set of duplicate records (if any) for a newly imported record, it must decide what action(s) to take. For example, an existing record may be updated, or a new record may be created, or an existing record might be deleted, or the duplicate records might be deleted, or the imported record may simply be ignored. The importer decides what action or actions to take by following an Import Policy (also called a duplicate policy). An Import Policy is a set of rules for deciding how to handle imported records.

There are several import policies available, including:

Generally, for lower-value imported records (e.g. robot-created records, and simple OAI-PMH harvests) it is a good idea to use the KEEP_HIGHER or REMOVE_LOWER_KEEP_HIGHER policies. Note that the REPLACE policy is dangerous and should rarely be used: it deletes and replaces existing local records.

Summary

All iVia importers use the same underlying mechanism to import records. A matching method and import policy can be provided by the user, and are used to decide what action to take for each imported record. These actions update the database, and are recorded in a log file.

Importing Records with the MARC Importer

The iVia MARC importer is controlled by a configuration file that specifies what import policy should be used to handle duplicates, what matching method should be used to find duplicates, how metadata should be extracted from MARC fields and added to iVia record fields, and where the importer should log its actions.

The MARC Importer Configuration File

An example MARC_import file is distributed with iVia in iVia-X-Y/etc/MARC/sample.conf. This section will explain how the configuration file works.

The [Logging] section allows you to set the verbosity (amount of output) of the MARC importer. The MARC importer will created a log under the iVia log directory called MARC_import, and will log its operations in a file called MARC_importer.log. It may also create other log files in this directory, which will be referenced in the main log.

 [Logging]
 verbosity = 3 # level of output: 0 = none, 5 = too much, 3 = normal.

In the [Duplicates] section, the duplicate_policy and match_method and variables are used to specify the duplicate policy and match method (see above). In a MARC importer, it is often a good idea to perform a duplicate check based on title rather than URL (unless you are sure your MARC records contain URL information in field 856.

 [Duplicates]
 duplicate_policy = KEEP_HIGHER
 match_method= EXACT_TITLE

The [Settings] section contains global settings. The foreign_source, remote_user and expert_created fields are used to set the field with those names in the iVia record. (The remote_user will be overridden when the MARC importer is used through the Web interface). The foreign_id_fields variable contains a comma-separated list of the MARC fields to search in order to find the foreign_id for this record. The fields_to_merge variable is only relevant id the duplicate policy is one that merges records; in that case it contains a list of all the fields that will be merged when one record ids merged into another (if empty, then all resource description metadata fields will be merged.) In this example, three specific fields are named.

 [Settings]
 foreign_source    = "sample-marc"
 remote_user  = "john"
 expert_created    = "true"
 foreign_id_fields = "1,10,16,35"
 fields_to_merge   = "title,keywords,subjects"

The remaining section in the file are used to specify which MARC fields are imported by the MARC importer.

The [url] section is used to configure how the URL is extracted from the record. By default, the URL is extracted from field 856. The fall back_pattern (if set) generates a fake URL for records that do not have URL data in field 856. The clean_up variable specifies what post-processing to clean up URLs. The "canonize" option is the most thorough, but can take several seconds per URL, so we recommend "clean" or "none".

 [url]
 fallback_pattern = "http://www.example.org/cgi-bin/view_record=CONTROL_NO"
 clean_up = "clean" # options: none, clean, canonize.

The [description] section describes how the description is extracted. There are several available methods: summary, standard, MARC, AllPossible, LexisNexis, LOC, and SCP. Usually, standard will be most appropriate (though MARC is good for debugging.) A fallback_value can be provided, which is the default used when no description can be extracted.

  [description]
  method    = "standard" # options: 
  fallback_value = "No description available" 

All the remaining sections contain instructions for extracting data from MARC record fields, and all are in the same "MarcFieldExtractor" format. Normally we include these from the "generic" sample files, as shown below.

  include generic_authors.conf
  include generic_coverage.conf
  include generic_keywords.conf
  include generic_lcc.conf
  include generic_my_infomine.conf
  include generic_resource_types.conf
  include generic_subjects.conf
  include generic_title.conf

Here is an example "MarcFieldExtractor" section for extracting "keywords" metadata.

  [keywords]

  initial_text = ""
  initial_serial_text = ";Ejournals;Electronic journals;" 
  initial_monograph_text = ";Electronic text;"
  fallback_value = ""

  field_join = ";"
  subfield_join = ";"

The initial_text variable is for default text that is always added to a field. The initial_serials_text is also added if the field is a serial (and initial_monograph_text if it is a monograph). The fallback_value is a default which is added only if no other text is extracted into this field. (NB: if you set one of the initial_text variables, then fallback_value will never be used.)

The field_join string is used as a delimiter if two or more MARC fields are joined together to form an iVia field. The subfield_join is used to join two (or more) MARC subfields before adding them to an iVia field. In this example, both are set to ";" because that is the iVia delimiter for the keywords field; and we want every MARC subfield to appear as an iVia keyword.

All other variables in the section are SearchField definitions. There can be any number of SearchField definitions, in the section, and it doesn't matter what variable names they have (here we use field1, field2, field3...). They look like this:

  field1 = "650 * *-0 * /NEWSPAPERS/Newspapers/g,/[P|p]eriodicals/\ /g"
  field2 = "651 * *-0 * /[P|p]eriodicals/\ /g"
  field3 = "246 * *   a"
  field4 = "730 * *   a"
  field5 = "740 * *   a"
  field6 = "793 * *   *"

Each value consists of up to five parts (separated by spaces), which when combined specify how to extract metadata from MARC fields. The first part is a number, identifying a MARC field. The next two parts are expressions identifying acceptable values for the first and second indicator tags. Possible values are '*' (any value), a lowercase letter or number (a specific value), or an expression comprised of these symbols and the '+' (add field) and '-' (remove field) characters, such as "a+b+1" (values 'a', 'b' and '1' are used) or "*-b-c-3" (every value except 'b', 'c' or '3' is valid). The fourth part is an expression that identified which subfields to extract, in the same format as the indicator tags (i.e. '*' means extract every subfield, 'a' means only extract subfield a, and so on). These first four parts are required, but the fifth is optional. It is a comma-separated list of PERL regular expressions that will be applied to the extracted MARC data to perform simple transformations. For example, "/NEWSPAPERS/Newspapers/g" changes the case of "NEWSPAPERS" to "Newspapers", and "[P|p]eriodicals/\ /g" replaces the term "Periodicals" (and also "periodicals") with a single space, effectively deleting it.

Running the Importer

MARC importer configuration files are stored in the etc/MARC directory. Assuming you have installed iVia in /home/fred, and your MARC file is in /home/fred/catalog.mrc, then the import command is MARC_import /home/fred/etc/MARC/sample.conf filename.marc. No output (other than errors) is sent to the screen; instead, it will be stored in the log files specified in the [Logging] directory.

Importing records with the OAI-PMH-importer

The iVia OAI-PMH-importer program is controlled by a configuration file, called OAI-PMH-importer.conf, that specifies where records should be imported from, what import policy should be used to handle duplicates, what matching method should be used to find duplicates, what OAI-PMH metadata format should be used, how the imported records are transformed into iVia records, and where the importer should log its actions.

The OAI-PMH-importer.conf File

An example OAI-PMH-importer file is distributed with iVia in etc/OAI-PMH-importer.conf. This section will explain how the configuration file works.

The [Logging] section sets up the OAI-PMH-importer logging files. The verbosity variable specifies how much output should be displayed in the log (0 = none, 3 = normal, 4 = detailed, 5 = too much). The importer will create a series of log files in the directory specified by the directory setting.

  [Logging]
  verbosity = 3
  directory = /home/fred/log/OAI-PMH-importer

In the [Repositories] section, the harvest_list variable specifies the list of OpenArchives repositories that will be imported. In this example, we will import records from two repositories: first, all the available records will be imported from INFOMINE, and then the set of CalTech Electronic Thesis Database records will be downloaded from the NSDL.

  [Repositories]
  harvest_list = "infomine.ucr.edu,CaltechETD"

Each repository listed in the harvest list has an associated section describing that input source in detail. In this example, there are two: INFOMINE and CaltechETD, and their configuration is shown below:

[infomine.ucr.edu]
base_url                = "http://infomine.ucr.edu/cgi-bin/OAI-PMH-server"
sets                    = ""
metadata_prefix         = "ivia_internal"

harvest_mode            = "INCREMENTAL"

foreign_source          = "infomine.ucr.edu"
remote_user             = "infomine-importer"

duplicate_policy        = "KEEP_HIGHER"
duplicate_check         = "FUZZY_URL"

pre_process_section     = "ivia_internal_pre_process"
field_defaults_section  = "ivia_internal_field_defaults"
field_map_section       = "ivia_internal_field_map"
post_process_section    = "ivia_internal_post_process"


[CaltechETD]
base_url                = "http://services.nsdl.org:8080/nsdloai/OAI"
sets                    = "CalTechETD"
metadata_prefix         = "nsdl_dc"

harvest_mode            = "INCREMENTAL"

foreign_source          = "CalTechETD"
remote_user             = "caltech-importer"

duplicate_policy        = "KEEP_HIGHER"
duplicate_check         = "EXACT_URL"

pre_process_section     = "nsdl_dc_pre_process"
field_defaults_section  = "nsdl_dc_field_defaults"
field_map_section       = "nsdl_dc_field_map"
post_process_section    = "nsdl_dc_post_process"

In each section, the following OAI-PMH must be specified. The base_url is the URL of the OAI-PMH server. The sets variable is a comma-separated list of OAI-PMH sets to harvest from this server, or "" to harvest all available records. The metadata_prefix is the OAI-PMH metadataPrefix argument: when transferring data between iVia installations you should use ivia_internal, when transferring records from the NSDL you should use nsdl_dc, other sites may have their own formats. If none of these is available, you can use oai_dc (every OAI-PMH data provider must support oai_dc).

The harvest_mode variable can be set to FULL or INCREMENTAL; the former will perform a full OAI-PMH harvest, while the latter will only harvest the records that have changed since the last update. You will almost always want to set this variable to INCREMENTAL.

The foreign_source variable contains the string that will be used as the for foreign_source field in iVia record_info database. In this case infomine.ucr.edu and CalTechETD will be used. The remote_user is the iVia Adder user name that will be credited with making changes to the iVia database.

The duplicate_policy and duplicate_check variables describe the iVia duplicate handling policy and duplicate matching methods, whose functions and possible values were described above. The record_must_be_valid variable can be set to true to force the importer to ignore any record that lacks required fields. (When records are rejected, the import log will explain why.)

Finally, the field_defaults_section, field_map_section, pre_process_section and post_process_section variables contain the name of four other configuration sections which explain how the imported metadata will be translated from the XML formats used by OAI-PMH into iVia records. These sections are stored in supporting configuration files and are included into OAI-PMH-importer.conf using the following (or similar) lines:

  include OAI-PMH-importer.ivia_internal.conf
  include OAI-PMH-importer.nsdl_dc.conf

The Supporting Configuration Files.

The etc directory contains supporting OAI-PMH importer configuration files like OAI-PMH-importer.nsdl_dc.conf and OAI-PMH-importer.ivia_internal.conf. These specify how the imported metadata will be added to the iVia database for each OAI-PMH metadata format. For each format, we need:

Examples of each of these sections from OAI-PMH-importer.nsdl_dc.conf are shown below:

[nsdl_dc_pre_process]
discard_invalid_urls    = true
verify_single_valid_url = true


[nsdl_dc_field_defaults]
access              = "free"
audience_levels     = "academic"
categories          = ";unknown;"
expert_created      = "true"
institutional_owner = ""
LCC                 = ""
my_infomine         = ";OAI-PMH-importer;NSDL;"
restricted_to       = ""
subjects            = ""


[nsdl_dc_field_map]
dc:contributor  = contributor
dc:creator      = creator
dc:description  = ivia_description
dc:identifier   = url
dc:publisher    = publisher
dc:subject      = keywords
dc:title        = title
dc:rights       = ivia_description
dc:type         = keywords


[nsdl_dc_post_process]
canonize_all_metadata = true
augment_title         = false
augment_categories    = false
augment_creators      = false
augment_description   = false
augment_keywords      = false
augment_lcsh          = false
must_be_valid         = false

Running the Importer

Once the configuration file is created and saved in the etc directory, an import can be attempted by running the OAI-PMH-importer command. The importer should then harvest records from each of the repositories in the harvest_list in turn. No output (other than errors) is sent to the screen; instead, it will be stored in the log files as specified in the [Logging] configuration.

The OAI-PMH is designed for incremental harvesting, and OAI-PMH-importer exploits this ability when the harvest_mode variable for a repository is set to INCREMENTAL. The first time you run the importer, it will harvest all the records (this can take a long time for large repositories). However, on subsequent runs, it will request only the records that have changed or been added since the previous run. We recommend you set up the importer to run every night in incremental mode.

Finally, your Adders (and you!) will almost certainly be interested to see what the OAI-PMH-importer is doing. We recommend you set up the iVia Log File Viewer to allow give then quick access to the OAI-PMH-importer logs.