Introduction
===============================================================================
                      blobcacher/Docs/plan.initial.html
-------------------------------------------------------------------------------
 By: Keith Humphreys
 Last Modified: Fri Nov  2 16:14:58 PST 2001
===============================================================================

PURPOSE:
BlobCacher is a "blob cache database", whose focus is to be fast,
robust, and able to accompany multiple process requesting information.
It emphasizes stability of the database structure.

-------------------------------------------------------------------------------

OVERVIEW:
BlobCacher is managed by a server, which serialized writes and
allows parallel reads.  Blobs are sequentially appended to a data file,
and there is an independent extensible hash table that provides
the means for locating blobs within the data file.  Only the server 
has write capabilities, but all other programs have read access to the
data file, up to file permissions.

To access the actual blob itself, a client queries the server by submitting
a key for the blob. The server returns: (entry_time, begin_blob, end_blob),
where all values returned are zero if the blob has not been stored.
With this management information, the the client can
1) decide if the blob is stale, and embark on gathering a newer version
   to submit to the server.  Though this process, entries can be updated,
   but no entry can be deleted by an external program.
2) access any part of the blob from the data file using the location
   coordinates.  As the content of the file is not changing within the
   coordinate interval [begin_blob, end_blob], multiple programs can 
   simultaneously access and process blobs in a non-interfering manner.

Maintenance Overview
MAINTENANCE (for version I):
Periodically (about once per week) the server does maintenance work on the
database, during which time write access is denied.  A maintenance program
is activated, and a new cleaned and preened database is formed.
Optionally, material can be expired by omission.  Once the new database
is formed,
0) it detaches any read stream connected clients,
   (this is an implicant signal to the clients that they will need to
    re-attach to the datafile after re-establishing connection.
    This prevents programs from receiving new coordinates for blobs
    and trying to access it in the old file.)
1) the server goes inactive for about 1 minute,
   (this gives time for the programs who received blob coordinates to attach
    to the old data file.)
2) it makes sure there are no *.old files,
   (for space reasons, the *.old files should have been removed as
   the first step in the maintenance cycle.)
3) it renames the older database files:
   bcdb_data -> bcdb_data.old,
   bcdb_index -> bcdb_index.old
   (if we die now, we still have the database.  Moving the new on top
    of the old runs the risk of collapsing mid stream), 
4) it renames the newer database to the standard name,
   bcdb_data.new -> bcdb_data,
   bcdb_index.new -> bcdb_index
5) it reopens for business.
The actual read down time is a bit over 1 minute.

MAINTENANCE (for version II):
  When the database get large, updating time becomes substantial.
  To minimize down time, writing and access can be done concurrently with
  the updating:

  With a bit more sophisticated read policy (older/newer file flag),
  we can eliminate the write down time (at least to 1 minute).
  To do this, any write request below the current hash_key_update
  is placed in the newer file, and any hash key above the hash_key_update
  is integrated in the older file.
  Similarly, any read request below the current hash_key_update
  is returned coordinates to the newer file, and any above are returned
  the older coordinates.  Of course, a newer/older flag must be returned
  in the management packet, and this packet must be checked by the
  client.  Initially, we probably don't need this kind of enhancement,
  but when we get to a larger database, this may become an issue.

  In the crawler terra-byte stage of development, well need to go to raid...
  But this can all be done in one file stripped across multiple disks
  (as I understand).

Database Structure
DATABASE STRUCTURE:
Note: I'm intentially avoiding commitments to linear or extended hash at
this point.  The comments here are independent of that...

The cache information is stored in two sections:
Data file:
  This is an append only file.  It is readable by any program that has
  file access permissions, but writable only by the server.
  The data that is appended is verbatim data.  For compressed data,
  it is up to the write client to compress the data in any manner it
  sees fit.
Index File:
  This file is a hash file, and stores a constant length management structure.
  It is directly accessible only by the server.  The management component
  contains constant length fields and is 32 bytes in length:
  {
	time_t entry_time;   // in seconds.
	uint32_t hash_value; // the hash value for entry.
	uint64_t hash_id;    // a key in bucket identifier.
	loff_t begin_blob;   // starting byte offset in data file.
	loff_t end_blob;     // ending byte + 1 in data file.
  }
  Some explanation is in order here as there is some novelty here.
  In literature, hash tables contain the key, and our table does not.
  The problem with storing the key is they can be variable and long.
  This creates an number of inefficiencies.  The proposal for this hash
  table is to:

  1) Store a hash_id.  One of the main reasons to store the key is to serve
  as an identifier for selection of the correct object from the hash bucket
  or hash overflows.  If we use an "orthogonal" hash function to the indexing
  hash function to create a hash_id, and the size of hash_id is 64 bits,
  then the odds that a pair of entries in a hash bucket are the same is
  1/18,446,744,073,709,551,616.  We see that it is more likely that a
  meteor will land on the computer!  So by introducing the tiniest amount
  of error, we gain tremendous space saving as well as having the
  algorithmic benifits of a constant field size structure.  Also note that
  comparing hash_id's is MUCH faster than string comparisons.

  REMARK: In reality, it is _more_ likely that a transient error will be made
  in a key string comparison in the computer hardware than comparing two 64 bit
  numbers.  So an argument can be made that this hash_id style of identifying
  a key is actually less error prone then string comparing!  That is, while
  on a perfect (theoretical) computer there is the tiniest amount that it
  is more error prone, on real computers with electrical spikes and what not,
  the hash_id should be more secure.

  2) Store the hash value itself.  As we no longer store the key, we can't
  recompute the hash_value when we split buckets.  Thus we need to store the
  hash_value itself.  This costs a bit of space, but as the hash values
  don't need to be recomputed, it speeds up the splitting process.

  So using this alternate data structure, we
  1) use 12 bytes instead of the MUCH larger storage space required for
     key storage, saving storage.
  2) have the benifits of a constant field structure,
  3) have numerical comparisons which are much faster,
  4) don't need to recompute hash values when splitting.
  If our hash identifier is orthogonal to the hash value function, then
  on a real machine we loose no accuracy.

Limitations:
  1) As far as the database structure is concerned, there is no limit on the
     size of an entry's key.
  2) The size of the blob file is limited to 18,446,744,073,709,551,616 bytes.
  3) The size of any blob is limited to 18,446,744,073,709,551,616 bytes.
  4) The number of reasonable entries in the file system is 4,294,967,296,
     which is more than the number of web pages estimated on the Internet.
	 This can easily be enlarged.

  In a very real sense, as far as the database structure is concerned,
  the (hash_value,hash_id) pair are the key.

  To prepare access to the database, there is a library function
  struct bcdb_key bcdb_make_key (
	const char *key_string,  
	const size_t key_length
  );
  where we have
  struct bcdb_key {
	uint32_t hash_value; // the hash value for entry.
	uint64_t hash_id;    // a key in bucket identifier.
  };

  The key can be of any length and contain embedded zeros, but it must
  be less in size than key_length.
  If key_length == 0, then make_bcdb_key() assumes that you have passed
  a zero terminated "c-string" string.
  Note: A c++ version of make_bcdb_key() that accepts string objects will
  also be provided.

  Multiple instances of a key are not allowed in the database.
  To be more precise, if key_string_1 yields bcdb_key_1 and is in the hash
  table, and key_string_2 yields bcdb_key_2 == bcdb_key_1 is submitted,
  then the record for key_string_1 is updated for the data submitted for
  key_string_2, provided the submission for key_2 completed it's write
  phase and the hash table gets updated.

Client
CLIENT WRITE API/BEHAVIOR:
  To write, a client attaches to the write port, which is a unix stream socket.
  Only one process at a time is accepted off the port queue to write,
  which serializes input.  The writing has a time limit that is quite short.
  It is assumed that the write client is prepared to dump it's contents into
  the socket.  Any compression or general file formating is left to the client.

  The client first sends a struct bcdb_write packet:
  struct bcdb_write {
	uint32_t echo_identifier;
	struct bcdb_key;
  }
    
  The server responds with an OK packet (the echo_identifier)
  then closes it's write channel, leaving the read channel open on the
  write port.  (The echo_identifier in this case is just a convenience
  for programs that are multi-thread or have made multiple access
  attempts.  Whatever...)

  Subsequent submissions are sequential chunks of the blob.
  The server appends the blob onto the data file end.
  If the client times out, we return the file pointer to it's old location
  and wait for the next write client.
  If the client completes it's dump in time, then the hash index is updated
  after it finishes appending to the data file.

Remarks:
  If the plug is "pulled", then we can end up with "don't care" modifications
  to the datafile: there is unreferenced garbage on the end, which will get
  buried in the current version of the data file.  This will be removed
  during the maintenance procedure, as it's unreferenced.  The hash table
  is only modified after the data is stored.  As hash buckets have
  small constant length entries, updating the hash and dealing with
  overflows is robust.

CLIENT QUERY API/BEHAVIOR:
  A client queries the server by passing a struct bcdb_query to the server
  on a read port.
  struct bcdb_query {
	uint32_t echo_identifier;
	struct bcdb_key;
	// possibly port response info for domain datagram;
  };

  The server responds with a packet that is constant length:
  struct bcdb_responce {
	uint32_t echo_identifier;
	time_t entry_time;
	loff_t begin_blob;
	loff_t end_blob;
	// maintenance old/new info in version II
  };
  echo_identifier:
    This is a 32 bit segment that the client sent.  It purpose
	in life is to allow the client to match requests with responses,
	if for example it has multiple connections going.  These could be,
	for example, indices into a request array.
  entry_time:
    If the item is not in the database, then 0 is returned.
    Else, the entry time, in seconds is returned (32 bits).
    This allows the client to determine if the blob is stale,
    so that it can ignore it or update it.
  begin_blob:
    This is the beginning location of the blob in the data file.
  end_blob:
    This is the ending location of the blob in the data file.

  The entry_time, begin_blob, and end_blob are copied verbatim from the
  hash table entry.

  In version II, there will be an old/new file flag, to facilitate data
  access during maintenance.

  Any byte written to a referenced part of the data file will never be altered
  in the incarnation of the data file.  When any kind of clean up or expiration
  is done by the maintenance utilities, a new file is created, built, then moved
  to the active file.  Hence, programs that are accessing data will have
  unmodified though their access period.

  The server guarantees that the data store(s) will not be renamed
  within 60 seconds of a returned query.  So if a program finds info it likes,
  it can then open the data file and will be able to extract it.
  (This prevents a program with old coordinates from attempting to access
   the data in that location from new file.  In version II, it means
   that the file names will be as expected.)

  The server has two modes of "read" connections, designed for two types of
  clients.

Transient read: unix datagrams.
  Packet needs clients return "address" (using unix datagram sockets).
  This access method is designed for clients that need only one query.
  A typical task would be a process that is responsible for insuring that
  a current version of a web page is stored in the database.  It could
  first query the database to determine if the document is present and
  of recent vintage.  If not, it can engage on fetching the page and then
  enter the write queue.  The transient query facilitates multiple
  parallel process gathering web pages, in that it places little burden on
  the server, and that it reduces time for the client query the database.

Durable reads: unix streams.
  When extracting multiple data from the data file, the file should be
  opened first and a steam access to the server established.
  This reduces time for both queries and file access.
  In the event that the database has finished a maintenance routine
  and is preparing to rename file(s), the server will disconnect all stream
  connections and refuse all access for one minute.
  The clients, upon reestablishing, are expected to reopen the data file(s).
  So programs that attach to the data file(s) for multiple access are
  in effect being signaled to re-open the data file(s).
  (This prevents clients from receiving coordinates for data in the
   new file while attached to the old file.)

Note:
  There is a slight possibility that a client can receive a positive
  response on the existence and timeliness of data in the database,
  while at the same time a maintenance program has a request to to expire
  information with a different sense of timeliness, causing the data
  to be expired before the client actually gets around to accessing it.
  If it re-requests it's location, the server will claim it is not present.
  This is not considered a fault in the database, but in management and
  programming policy!

Misc.

DEVELOPMENT SCHEDULE:
Version II will not be needed until we get to crawling.

-------------------------------------------------------------------------------

BLOB SERVER:
To make a blob server, a standard server wrapper can be easily constructed.
Upon receiving keys, it just passes to the blob cacher, then relay between
the BlobCacher and the client.