Digital Preservation

This notebook is a place to record useful tidbits of information about digital preservation. The Wikipedia page on Digital Preservation provides quite a good overview of the field, but this notebook intends to dig a little deeper.

Roughly speaking, the technical side of digital preservation can be broken down into two parts:

Most of this notebook deals with the latter issue, i.e. with how to preserve the meaning of digital objects so that they remain accessible over time. Of course, all of that work is based on the assumption that we can keep the actual binary data safe from bit-rot, and so the first section will look at that issue.

See also items tagged as digital preservation

Bit Preservation

The other sections of this notebook generally assume that there will be some digital objects storage system that can be relied upon to hold the bytes safely over time. Indeed, I believe it is sensible to deal with the bit-storage and access problems separately, in the sense that any access solutions should be designed to work independently of the chosen bit-store solution.

I used to think that the bit-storage problem was essentially solved, but I have since realised that fighting bit-rot is not as easy as I originally thought.

...tbc...

Keeping Bits Safe

Digital Object Formats

See also this related effort: Wikibook: Choosing The Right File Format.

File Format Registries

Documents

Databases

Preservation-Format Approach.

Images

Digital Object Properties

Significant Properties

See JISC: The Significant Properties of Digital Objects

Issues with this concept are legion.

Property Extraction Methods

Magic numbers

Magic numbers is the name for the standard UNIX mechanism used to identify file types. This approach is not limited to UNIX, but is usually considered a UNIX related practice that has since spread to many other platforms. The mechanism is based on a database that maps byte strings and positions to file types. Common examples are GIF, JPEG and TIFF, that all contain reliable markers, at least for identifying the general file format. Minor version identification (i.e., format characterization rather than format identification) often requires a more sophisticated approach, e.g. parsing of header structures. On a UNIX system, the magic number mechanism can be accessed using the file command, e.g.:

kb005264:~/Code/Graphics andersjohansen$ file logo.png

logo.png: PNG image data, 64 x 64, 8-bit/color RGBA, non-interlaced

Advantages are fast performance and reliable identification for many file formats. Limitations include that often minor version variations can’t be determined (i.e., less than reliable characterization), that the method does not allow for formats that lack reliable identification strings (such as text files), and that it uncritically accepts the evidence (e.g., if a text file contains the magic number for a GIF file, the file command will identify it as a GIF file, regardless of other evidence).

Property Extraction Tools

DROID

DROID identifies files using the magic numbers approach.

  • Refers to PRONOM, e.g. ''info:pronom/fmt/100'' is HTML 4.01.
  • Required manual proxy configuration via text-file-hacking.
  • Use ProxySelector and other magic?
  • DROID 5

JHOVE

JHOVE is an extensible, Java-based tool developed for the JSTOR/Harvard Object Validation Environment (http://www.jstor.org/) aimed at validating digital objects. It builds on a magic number approach and adds much richer parsing functionality in order to extract more information and more thoroughly assess validity.

  • Apparently tends to fail awkwardly when the network latency is high.
  • Also uses it's own system of type identifiers.
  • Adds attributes/metadata?
  • Validates?

Inferno

Inferno is a Java-based tool for rule inference and application of such sets of rules. It is currently heavily biased towards charcterization. It was developed at the Danish Royal Library as a proof-of-concept for using rule inference as an unifying approach to characterization, both directly to perform characterization tasks, and indirectly to integrate results from various existing characterization tools in an optimal way.

Inferno has been successfully applied to file and text string characterization by text file encoding (Latin 1, UTF-8, UTF-16LE and UTF-16BE), file type (text file encoding, JPEG and PNG) and language used in text file (Danish, Swedish, English and Norwegian).

  • Implementation not available yet.
  • Is this done using one of the available rule engines?

Other Tools

Preservation Strategies

In many cases, a mixture.

Migration

Authenticity issues.

Emulation

Complexity, Russion dolls.

Living Archive

e.g. The Web.

Limited coverage.

Normalised Archive

Limited coverage and controlled input.

See Quality Control Methods.

Migration Tools & Pathways

Migrations

Could also merge in the CRiB service.

Quality Control Methods

Quality Assurance

This is the hard part.

e.g the comparator that compares two sets of measured properties and evaluates the difference.

Digital Object Storage

Here, we will not worry about the system that keeps the bits safe, but start with the assumption that we have some reliable digital object repository that supports one or more protocols, allowing items to be read and written.

Ideally, we want repositories of digital objects, with features like:

Repository Systems

Digital Object Storage Systems

Links

ADAPT ACE

ACE (Auditing Control Environment) is a system that incorporates a new methodology to address the integrity of long term archives using rigorous cryptographic techniques. ACE continuously audits the contents of the various objects according to the policy set by the archive, and provides mechanisms for an independent third-party auditor to certify the integrity of any object.

Acumen

About

Tools for turning a collection of xml metadata (MODS, METS, EAD) and digital assets into an online digital library with a minimum of effort. Includes server software for automatic indexing and presentation, and misc utilities.

Related Links

Archivematica

The Archivematica project is integrating a number of open source tools and applications using a micro-services design pattern to create a comprehensive digital preservation system. Archivematica is designed for compliance with the ISO-OAIS functional model and implements media type preservation plans based on an analysis of the significant characteristics of file formats.

Digital Preservation Software Platform

About

The Digital Preservation Software Platform (DPSP) is free and open source software developed by the National Archives of Australia. The DPSP is a collection of software applications which support the goal of digital preservation.

Content Access Protocols

Content Access Prototols

The pipes that make the sources work.

OAI-PMH

See this overview.

AtomPub

Fedora

  • Supports a range of APIs, some Fedora-specific, and OAI-PMH too.
  • Stores and moves in FOXML

Others

List, from http://discerning.com/topics/standards/resource_management.txt

  • list(resource_path, query_expr, accept_mime_type) these kinds of formats
  • HXDLG http://hdlg.sourceforge.net/ xmlns=http://www.hdlg.info/XML/filesystem
  • manifest.xml xmlns=http://openoffice.org/2001/manifest
  • atom:feed "application/rss+xml revision=http://purl.org/rss/1.0/"
  • RMP (builtin)
  • Web Collections http://www.w3.org/TR/NOTE-XMLsubmit
  • OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/static-repository"
  • TODO: RDDL http://www.rddl.org/rddl2 (explain what namespaces mean) and http://www.w3.org/2001/tag/doc/nsDocuments/
  • simply lists all metadata objects for all immediate children, in a XML response wrapper

  <collection xmlns="http://gupe.org/rmp" path="some/resource_path">
    <collection name="fred" ...>
      <atom:entry>...</atom:entry>
    </collection>
    <data name="fred/xml">
    </data>
  </collection>

Standards for Document Repositories

Metadata Standards

PREMISE

Crosswalking

Migrations that convert between metadata formats, mapping elements of one into the other.

References

Digital Preservation Projects

Digital Preservation

Software Archives

The Digital Preservation Community

Physical Media

Notes on different physical media for storing digital data.

Disk Image Tools

All disks are block devices... blah... Volumes, partitions, file systems, ???

Imaging Tools

Software for pulling the bits off a disk. Note that forensicswiki.org has some useful info on this.

Drive Emulation

Converting between image formats

File System Support

How to interpret the bits of a disk image and turn it into a set of files. e.g. NTFS, FAT, ADFS, ISO, etc.