Another Archiving Tale
November 20, 2003
In the course of researching the use of metadata standards for the long-term archiving of information, I spent time researching GILS—the Government Information Locator Service. The following is a brief case study on how the state of Texas is using GILS for government records.
The state of Texas built a GILS-based system to provide easy access to information from over 180 state agencies. Each agency operated as an isolated information silo and could not communicate with one another. It is important to note that the Texas archivists were only interested in making sure that metadata would be created for the legacy data, and the data would be available from a centrally searchable location. The project did not deal with the conversion of legacy data from one format into another, but is considered here because it was concerned with the conversion of Web pages into resources that could be archived and repeatedly accessed over time, even if certain web pages were no longer being maintained online. This allows a user - for legal or cultural reasons - complete access to the collection of pages.
When the state of Texas first instituted the project, it relied on the participating agencies to alert archivists to any changes on their Web sites. In this manner, 8,000 records - representing either a single Web page or a collection of Web pages (whatever the agency considered to be a Web "publication") were harvested for archiving.
However, the Texas archivists have been finding that relying on the state agencies is not reliable; many records are missed. They are currently upgrading their software to include a "harvesting" application that will automatically "crawl" the state's web sites and identify any changed, deleted, or added publications that need to be archived and tracked. Kevin Marsh of the Texas State Library and Archives Commission (TLSAC), who is overseeing the project, said that a test of the new system harvested 32,000 records, a four-fold improvement over the manually-intensive older system.
Harvested publications are preserved on a TSLAC server and published on request through a server at the University of North Texas (UNT). TRAIL records for currently online publications or Web sites are linked directly to the Web sites in question. Non-current records are moved into TSLAC's Electronic Depository Program and matching publications are moved to the UNT server. Users can search by subject, agency, keyword, and descriptor fields, as well as by date range and full text. Additionally, MARC records are automatically generated and provided to UNT for their catalog.
TRAIL is based on Blue Angel Technology's MetaStar Enterprise Suite. This software provides the following functionality:
• Data entry.
• Database management.
• Database search and retrieval utilizing Z39.50 (an ISO standard defining a protocol for computer-to computer information retrieval).
• User gateway design and management.
MetaStar Enterprise utilizes Oracle 8i as the underlying database. The data is formatted and manipulated with XML tagging. MetaStar Enterprise also utilizes PCDocs Fulcrum technology for harvesting data directly from web servers.
This software was selected for the following capabilities:
• Z39.50 compliance.
• Ability to work effectively with multiple metadata formats including Dublin Core.
• Capability to blend targeted record searching and full-text harvest and searching.
TRAIL runs on two Sun Enterprise 450 servers under Solaris7.
The State of Texas estimated that the Blue Angel Technologies solution cost less than one-quarter of the price it would have incurred to develop the system internally and took far less time. Just as importantly, the cost to maintain the system is extremely low. Currently, less than one full time person within the Texas State Library and Archives Department manages TRAIL.
Lessons learned include :
• GILS metadata is difficult to capture.
• Limited updating and maintenance of GILS records is necessary.
• No clear agreement could be reached on the adequacy of GILS record data elements (perhaps the richer structure provide by EAD could allay this problem).
• Different types of resources are represented in GILS records and user community is sometimes confused by:
o An inordinately high degree of user sophistication required to exploit GILS.
o Users were interested in or expecting to gain access to full text.
o GILS records were hard to read, contained unnecessary information, and were not linked to the actual source identified.
o Variances existed in the extent of information contained in GILS records.
o The service seemed qualitatively and quantitatively unpredictable and uneven.
Posted by Bill Trippe at November 20, 2003 4:08 PM








