Long-term Archiving of Digital Content
October 28, 2003
I had never given much thought to long-term storage and archiving of content. Well, I did to the extent that I typically advocated XML and other standards-based technology—mainly so content and data would outlast any particular software system or application.
But I had never thought much about truly long-term storage—what some in the field like to refer to as the “100 year digital object”—until I was engaged by the State of Washington to help study the feasibility of creating a digital archive. Part of my analysis was to interview other organizations that had undertaken similar initiatives. The most interesting interview was with Harvard University. I have included a summary of the case study here.
The central administration of Harvard University is four years into a project, The Library Digital Initiative, to bring the handling, archiving, and preservation of digital material up to par with the handling of analog material. Harvard has a centralized library services organization that supports the university archives and over 20 independent research and specialized libraries on the Harvard campus and elsewhere in the world. This centralized library services organization, known as the Office for Information Systems, has been developing the physical infrastructure, storage, software, tools, and professional services for digital archiving.
To date, the project has focused on the core infrastructure and on "challenge grants" that have funded archiving efforts in various libraries. The infrastructure in place includes servers, storage, and a wide array of supporting hardware and software falling into three general categories:
• Hardware and software to support the collection of digital material. This ranges from hardware and software for digitizing and converting analog materials, software for cataloging the digital materials with the inclusion of metadata, hardware and software to support the data repository, and software for indexing the digital text and metadata.
• Hardware and software to support the access to digital material. This includes access tools such as portals, catalogs, and other finding aids, as well as delivery tools allowing users to download and view textual, image-based, multimedia, and cartographic data.
• Core software for functions such as authentication and authorization, name administration, and name resolution.
The projects completed and in development under the challenge grant program represent a wide variety of data and content types, reflective of the many research specialties at an organization such as Harvard. They include:
• A project to digitize and catalog biomedical images such as CAT scans and MRIs that have lasting value to researchers and educators. Images have been converted to high resolution Tag Image File Format (TIFF), with lower-resolution Joint Photographic Experts Group (JPG) images used for Web distribution. Metadata is being captured in a specialized eXtensible Markup Language (XML) vocabulary that was designed by Harvard.
• A project to digitize, catalog, and index the annual reports of the University, since they began producing formal annual reports in the 1800s. Over 100,000 pages have been scanned to high resolution TIFF (again using JPG for the Web distribution), with structural metadata being captured using the METS (Metadata Encoding and Transmission Standard) XML vocabulary. The resulting reports can be viewed in a Web-based page turning application that uses the METS metadata for navigation. Sections or entire reports can then be automatically collected into Portable Document Format (PDF) files for printing and distribution.
• A project to digitize and index older directories of United State and United Kingdom businesses. Printed directories are being re-keyed into an XML vocabulary that captures elements such as company name and address. The XML is then rendered into HTML for viewing through the use of an XSLT (Extensible Stylesheet Language Transformations) style sheet.
• Projects to digitize, catalog, and index primary field research materials in such diverse fields as botany, Slavic languages, and music. In most of these cases, image data is captured as TIFF, audio data as AIFF (Audio Interchange File Format) and metadata in an XML vocabulary such as METS.
While the projects represent a wide variety of material, there are a number of common attributes to the projects when it comes to issues of digitization, data conversion, and application of metadata. These attributes include:
• There is not a single, monolithic approach to conversion. Researchers and archivists have a number of options for content conversion methods and target formats. In some cases, document material is captured as page images only (always TIFF), sometimes as page images with full text captured through OCR for search and retrieval, and sometimes in XML as a neutral format for later reprocessing. The archivists make the decision on approach, often in consultation with technical staff from the centralized Office for Information Systems.
• The greater "intellectual" emphasis seems to be on metadata. Whatever format the data is captured in, Harvard is promulgating standards for metadata. As mentioned above, they favor the METS Document Type Definition (DTD) for structural, administrative, and technical metadata. For finding aids, they have based their tools on the EAD (Encoded Archival Description) DTD; all content contributed to the central repository and catalog is submitted with an EAD-based finding aid.
• Their approach is pragmatic. While they do have substantial resources for digital archiving, they are faced with practical decisions on every project. For example, the annual reports project uses TIFF pages supported by OCR-captured text for indexing. They use the OCR text "as is," without correction, knowing that the accuracy is somewhere around 95 percent. However, because the OCR text is used for search and retrieval of pages, the results of using it are better than the specific accuracy of the text. That is, OCR-scanned text with some errors will still successfully return 99 percent of pages because the hits are based on page boundaries. (For example, a word appearing more than once on a page is incorrectly captured once and correctly captured once. The page is still correctly returned on a full-text search.)
• They have relied heavily on commercial applications for the repository and indexing applications themselves, while the submission, discovery, and delivery applications have been largely custom Java applets and Java servlets accessed through browser interfaces. Thus, the full text indexes are managed in Oracle 9i's Text feature, and the XML metadata repository is managed in Software AG's Tamino.
It is worth noting that, while Harvard's efforts to date have been substantial, they are largely based on the conversion of non-digital legacy data. That is, there is very little "born digital" data in Harvard's Library Digital Initiative. In fact, the university archives still have an official policy of not accepting digital material, so the challenge grant projects have fallen outside this official purview of the archives.
Despite this, Harvard's efforts are still very instructive, and have positioned the institution very well to continue with digital archiving projects. To date, they have 350,000 digital objects under management, totaling some three terabytes of content. They also have impressive infrastructure for collecting, digitizing, cataloging, storing, and distributing digital material.
Harvard also looked at costs associated with fee-for-service archives, something other projects did not do. In particular they looked at what the Online Computer Library Center, Inc (OCLC) charges for storage. OCLC stores on a "bag of bytes" basis; they do not provide viewing technology. Onus for that is on the customer. OCLC's charges:
• $60 per GB if < 100 GB.
• $32 per GB if 101-1000 GB.
• $15 per GB if > 1000 GB.
Harvard also figured the cost per year of physical storage versus digital storage, based on a collection of 2,202 volumes (729,000 pages, 322 pages per volume).
The conclusion is that, wherever possible, store only the ASCII data (including metadata markup). The metadata capture and/or creation will cost more up front, but the payoff in access and cost savings over the long run will be enormous.
Posted by Bill Trippe at October 28, 2003 4:42 PM








