Chapter 8: Exporting Records with OAI-PMH

The iVia OAI-PMH-server is a program for exporting records using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). The server is easily added to an iVia installation, and will then be found at

http://[your host name]/cgi-bin/OAI-PMH-server

The server distributes records using OAI-PMH version 2. It can be used to transfer records between iVia installations, and to export records from iVia to other metadata management systems.

Overview

The OAI-PMH-server program is controlled by configuration files, and is activated by adding a file called OAI-PMH-server.conf to the etc directory. Almost any metadata can be exported from iVia if it is identified in this configuration file.

Almost any OAI-PMH metadataPrefix can be supported. Configuration files are provided for oai_dc (the minimal, required format), nsdl_dc (the NSDL's qualified DC format) and ivia_internal, a format created by INFOMINE for moving sets of records between iVia installations.

The server can be configured to return different sets of records to different OAI-PMH clients based on their IP addresses. This allows you to control which sets of records are available to different groups.

The OAI-PMH-server.conf file

Some example configuration files supplied with iVia. The OAI-PMH-server.conf-template file, for example, is a well-documented example of a configuration file.

To make the configuration process a little simpler, subsidiary files can be included in a configuration file with the include command. You may see commands like the following in the main configuration file:

include OAI-PMH-server.oai_dc.conf

This command tells the server to include the named subsidiary file. Generally, any configuration file named OAI-PMH-server.*.conf is a subsidiary file, and may be included by the main file. For example, the OAI-PMH-server.oai_dc.conf file contains configuration information describing how to support the oai_dc metadataPrefix.

The remainder of this chapter will examine the various parts of the configuration file.

Number of documents per request

The configuration file begins with a brief description of the file, and a single global setting: the max_records_per_request variable controls how many records are returned in response to each request.

# OAI-PMH-server.conf -- Configuration of the iVia Open Archives data provider.

max_records_per_request = 200 

For any given request, if there are more than max_records_per_request records that should be returned, the server will generate an OAI-PMH resumption token that the client can use to retrieve the next batch of records.

Self-description provided through the Identify verb

The [Identify] section of the file is used to identify the server, and is used both to respond to the OAI-PMH Identify verb, and to set some global configuration items.

[Identity]
repositoryIdentifier = "www.example.org"
repositoryName       = "The www.example.org iVia installation"
baseURL              = "http://www.example.org/cgi-bin/OAI-PMH-server"
adminEmail           = "oai@www.example.org"
earliestDatestamp    = "1978-01-01T00:00:00Z"

The repositoryIdentifier variable is used as an identifier for this repository (it is the "repository" part of the OAI-PMH resource identifiers, which take the form "oai:[site_prefix]:[record_id]").

The repositoryName, baseURL, adminEmail and earliestDatestamp variables contain the elements named and described in the OAI-PMH Identity verb specification and should be updated to reflect your iVia installation.

The Metadata Formats supported by the server

The Metadata section specifies the format of the iVia records that are exported using OAI-PMH.

[Metadata]
formats = "oai_dc,nsdl_dc,ivia_internal"

include OAI-PMH-server.oai_dc.conf   # The standard Dublin Core metadataPrefix.
include OAI-PMH-server.nsdl_dc.conf  # The NSDL's qualified DC metadataPrefix.
include OAI-PMH-server.ivia_internal.conf  # The iVia internal metadata format.
include OAI-PMH-server.ivia_about.conf     # The iVia "about" section format.

The first section, [Metadata], contains a single variable, formats, that is a comma-delimited list of the Metadata formats supported by this configuration file. In this example, the oai_dc, nsdl_dc, ivia_internal and ivia_about metadataPrefix values are supported.

For each supported metadatPrefix, we have to provide additional sections that describe how the metadata values stored in fields in the record_info database are mapped to the XML tags specified by the metadataPrefix. In this case, this additional specification is stored in supporting configuration files and included in the current file. These supporting files will be described below.

Defining and controlling access to sets

The next section specifies the location and presentation of the sets of records that are exported by the OAI-PMH server.

The records available to a client can be varied based on the IP address blocks of the clients. This is implemented by defining a list of available subsets for each IP address block in the [Sets] section. When a machine connects and requests all the available records, it will be offered the records corresponding to the first IP block that matches. Further, when a client requests a particular set, it will only be provided if the set is available to the client's IP address block.

[Sets]
match138.23.89.35/32    = "all_records,sample,expert_records,NSDL,sample"
match138.23.88.00/24    = "sample,expert_records,NSDL,sample"

match132.236.180.52/24  = "NSDL,sample"
match128.173.49.52/24   = "sample"

# Note: there are no sets for other (unknown) clients.

When a client connects and requests records but doe snot provide a set argument, then the set is deduced from the [Default Set] section.

[Default Sets]
match138.23.89.35/32    = "all_records"    # Infomine.ucr.edu can see all records.
match138.23.88.00/24    = "expert_records" # Machines on our subnet see expert records.

match132.236.180.52/24  = "NSDL"           # The National Science Digital Library.
match128.173.49.52/24   = "sample"         # The OpenArchives.org repository explorer gets a sample.
match0.0.0.0/0          = "small_sample"   # All other machines get a small sample.

The special subset name "DYNAMIC" is used to enable dynamic subsets. Dynamic subsets do not have to be formally defined: any set name is valid, and if the set is not formally specified, then a specification is automatically created that contains all records which match HARVEST:[setname] in the my_infomine field of the record_info table in the iVia master database (where [setname] is the set requested).

Specifying which records belong to sets

Each set declared in the global_sets and subsets sections above must be specified in a section with the same name. Each set must have a name parameter (the OAI-PMH setName), and a where_clause parameter (which describes how the set should be selected from the record_info table in the database). An optional description may be provided.

[all_records]
name         = "All iVia Records"
where_clause = ""

[expert_records]
name         = "Expert-created Records"
description  = "Records that were discovered and described by human experts."
where_clause = "expert_created='true'"

[infomine]
name         = "Expert-created INFOMINE Records"
description  = "Records discovered and created by INFOMINE librarians."
where_clause = "expert_created='true' AND foreign_source='' AND access != 'local'"

[NSDL]
name         = "INFOMINE Records exported to the NSDL"
description  = "A subset of the INFOMINE Collection, focusing on Government Publications and Maps."
where_clause = "expert_created='true' AND foreign_source='' \
                AND access != 'local' AND site_checker_status != 'failed' \
                AND (categories LIKE '%;govpub;%' OR categories LIKE '%;maps;%')"

[sample]
name         = "INFOMINE Sample records"
description  = "A subset of the INFOMINE collection, focused on Government Publications and Maps."
where_clause = "expert_created='true' AND foreign_source='' \
                AND access != 'local' AND my_infomine LIKE '%;NSDL-MAP-GOVPUB-SAMPLE;%'"

Metadata locations

To do.

Supporting files specifying for MetadataPrefix configuration

The [Metadata] section described above imported several subsidiary files for describing location of the metadata in the table, and how it is translated in to XML elements and output by the OAI-PMH server. This section describes the nsdl_dc schema file.

Note that, as a rule, you do not need to edit these files. Just include them if you need them.

The first part of the definition of the nsdl_dc schema is a section called [nsdl_dc].

# OAI-PMH-server.nsdl_dc.conf
#
# To support nsdl_dc metadataPrefix, add the line "include OAI-PMH-server.nsdl.conf"
# to your OAI-PMH-server.conf file, and "nsdl_dc" to the "formats" list in the
# "[Metadata]" section.

[nsdl_dc]
schemaVersion              = "1.01.001"
xmlns:nsdl_dc              = "http://ns.nsdl.org/nsdl_dc_v1.01"
xmlns:dc                   = "http://purl.org/dc/elements/1.1/"
xmlns:dct                  = "http://purl.org/dc/terms/" 
xmlns:xsi                  = "http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation         = "http://ns.nsdl.org/nsdl_dc_v1.01 \
                              http://ns.nsdl.org/schemas/nsdl_dc/nsdl_dc_v1.01.xsd"
schema                     = "http://ns.nsdl.org/schemas/nsdl_dc/nsdl_dc_v1.01.xsd"
container                  = "nsdl_dc"
default_namespace          = "dc"
fields                     = "nsdl_dc_identifier,nsdl_dc_title,nsdl_dc_format,\
                              nsdl_dc_keywords,nsdl_dc_LCSH,nsdl_dc_description"
associated_about_container = "ivia_about"

The schema and schemaVersion variables define the location of the XML schema to use. This schema will display records inside the XML elements specified by the container variable. The variables that begin with xmlns: and xsi: are the namespace and schema declarations for the container element, and the default_namespace is the default namespace of the container element.

If the metadata records are to be returned with about elements, these about XML elements are defined by the associated_about_container variable.

Finally, the fields variable lists the fields that are returned by this metadataPrefix. It consists of a comma-separated list of field identifiers, each of which is defined by an anonymous section (below). Here are the sections defined by this metadataPrefix:

[nsdl_dc_identifier]
db_field  = "url"
xml_field = "identifier"
required  = "yes"

[nsdl_dc_title]
xml_field = "title"
db_field  = "title"
required  = "yes"

[nsdl_dc_format]
xml_field = "format"
db_field  = "media_type"
required  = "yes"

[nsdl_dc_keywords]
xml_field = "subject"
db_field  = "keywords"
delimiter = ";"

[nsdl_dc_LCSH]
xml_field      = "subject"
xml_attributes = "xsi:type=\"dct:LCSH\""
db_field       = "subjects"
suppress       = "-none-" # Perl regexp pattern.
delimiter      = ";"

[nsdl_dc_LCC]
xml_field      = "subject"
xml_attributes = "xsi:type=\"dct:LCC\""
db_field       = "LCC"
suppress       = "^unknown$" # Perl regexp pattern.

[nsdl_dc_description]
xml_field  = "description"
db_field   = "ivia_description"
strip_html = true

For each metadata field, we specify the database field where the metadata is found (db_field, which can be a comma-separated list of fields), the XML field where the metadata will be transmitted (xml_field), and a few other variables.

We can indicate that a field is required by setting the required variable to "yes". We can indicate there is more than one metadata element in each database field with the delimiter variable. We can suppress certain metadata values with the suppress variable. We can provide a constant value (i.e. one that is returned for every record) for a metadata field with the literal variable. We can indicate that a field contains HTML and should be converted to text with the strip_html variable. And we can provide additional XML attributes (for example, to provide qualified DC) with the xml_attributes variable. The sections above contain examples off all these options.

Note: we assume each database field contains text that is encoded using "iso-8859-15 (Latin-9)".