This notebook is a place to record useful tidbits of information about digital preservation. The Wikipedia page on Digital Preservation provides quite a good overview of the field, but this notebook intends to dig a little deeper.
Roughly speaking, the technical side of digital preservation can be broken down into two parts:
Most of this notebook deals with the latter issue, i.e. with how to preserve the meaning of digital objects so that they remain accessible over time. Of course, all of that work is based on the assumption that we can keep the actual binary data safe from bit-rot, and so the first section will look at that issue.
See also items tagged as digital preservation
The other sections of this notebook generally assume that there will be some digital objects storage system that can be relied upon to hold the bytes safely over time. Indeed, I believe it is sensible to deal with the bit-storage and access problems separately, in the sense that any access solutions should be designed to work independently of the chosen bit-store solution.
I used to think that the bit-storage problem was essentially solved, but I have since realised that fighting bit-rot is not as easy as I originally thought.
...tbc...
See also this related effort: Wikibook: Choosing The Right File Format.
Preservation-Format Approach.
See JISC: The Significant Properties of Digital Objects
Issues with this concept are legion.
Magic numbers is the name for the standard UNIX mechanism used to identify file types. This approach is not limited to UNIX, but is usually considered a UNIX related practice that has since spread to many other platforms. The mechanism is based on a database that maps byte strings and positions to file types. Common examples are GIF, JPEG and TIFF, that all contain reliable markers, at least for identifying the general file format. Minor version identification (i.e., format characterization rather than format identification) often requires a more sophisticated approach, e.g. parsing of header structures. On a UNIX system, the magic number mechanism can be accessed using the file command, e.g.:
kb005264:~/Code/Graphics andersjohansen$ file logo.png
logo.png: PNG image data, 64 x 64, 8-bit/color RGBA, non-interlaced
Advantages are fast performance and reliable identification for many file formats. Limitations include that often minor version variations can’t be determined (i.e., less than reliable characterization), that the method does not allow for formats that lack reliable identification strings (such as text files), and that it uncritically accepts the evidence (e.g., if a text file contains the magic number for a GIF file, the file command will identify it as a GIF file, regardless of other evidence).
DROID identifies files using the magic numbers approach.
JHOVE is an extensible, Java-based tool developed for the JSTOR/Harvard Object Validation Environment (http://www.jstor.org/) aimed at validating digital objects. It builds on a magic number approach and adds much richer parsing functionality in order to extract more information and more thoroughly assess validity.
Inferno is a Java-based tool for rule inference and application of such sets of rules. It is currently heavily biased towards charcterization. It was developed at the Danish Royal Library as a proof-of-concept for using rule inference as an unifying approach to characterization, both directly to perform characterization tasks, and indirectly to integrate results from various existing characterization tools in an optimal way.
Inferno has been successfully applied to file and text string characterization by text file encoding (Latin 1, UTF-8, UTF-16LE and UTF-16BE), file type (text file encoding, JPEG and PNG) and language used in text file (Danish, Swedish, English and Norwegian).
In many cases, a mixture.
Authenticity issues.
Complexity, Russion dolls.
e.g. The Web.
Limited coverage.
Limited coverage and controlled input.
Could also merge in the CRiB service.
This is the hard part.
e.g the comparator that compares two sets of measured properties and evaluates the difference.
Here, we will not worry about the system that keeps the bits safe, but start with the assumption that we have some reliable digital object repository that supports one or more protocols, allowing items to be read and written.
Ideally, we want repositories of digital objects, with features like:
ACE (Auditing Control Environment) is a system that incorporates a new methodology to address the integrity of long term archives using rigorous cryptographic techniques. ACE continuously audits the contents of the various objects according to the policy set by the archive, and provides mechanisms for an independent third-party auditor to certify the integrity of any object.
Tools for turning a collection of xml metadata (MODS, METS, EAD) and digital assets into an online digital library with a minimum of effort. Includes server software for automatic indexing and presentation, and misc utilities.
The Archivematica project is integrating a number of open source tools and applications using a micro-services design pattern to create a comprehensive digital preservation system. Archivematica is designed for compliance with the ISO-OAIS functional model and implements media type preservation plans based on an analysis of the significant characteristics of file formats.
The Digital Preservation Software Platform (DPSP) is free and open source software developed by the National Archives of Australia. The DPSP is a collection of software applications which support the goal of digital preservation.
The pipes that make the sources work.
See this overview.
List, from http://discerning.com/topics/standards/resource_management.txt
<collection xmlns="http://gupe.org/rmp" path="some/resource_path">
<collection name="fred" ...>
<atom:entry>...</atom:entry>
</collection>
<data name="fred/xml">
</data>
</collection>
Migrations that convert between metadata formats, mapping elements of one into the other.
Notes on different physical media for storing digital data.
All disks are block devices... blah... Volumes, partitions, file systems, ???
Software for pulling the bits off a disk. Note that forensicswiki.org has some useful info on this.
How to interpret the bits of a disk image and turn it into a set of files. e.g. NTFS, FAT, ADFS, ISO, etc.