Look at all the people…

A few months ago, one of my colleagues, Paul Hammer, a software developer with the UMD Libraries’ Software Systems Development and Research (SSDR), stopped by my office and mentioned to me that something in one of my recent blog posts was bothering him. Specifically, it was these two sentences:

Unfortunately, unlike our dependable analog collections, keeping track of all of this digitized content can sometimes be unwieldy.   One of my big goals is to reach the point where an inventory of these digital collections can provide me with the equivalent of a “Shelf location” and statistics at the push of a button.

Paul reminded me that a lot of human effort, management and coercion went into acquiring, tracking, cataloging and circulating information in the analog world.  If the staff, managers and profession were not diligently encouraging librarians, archivists and other professionals into using similar standards and practices, then no two collections would be remotely comparable.  He noted: “We need to recognize that this effort is just as big and difficult in the computer world.  Computers do not do all of this work for you regardless of how much we wish out were otherwise.  Computers just offer a really big room of shelves on which to put things and the ability to program helpers.  Helpers who are only capable of doing *exactly* what you ask of them — at nearly speed of light.”

I want to thank Paul for putting things in perspective.  First, his comments reminded me that Rome was not built in a day. Second, as Paul, and many of the recent projects I have worked on have shown, computers will only do exactly what you tell them to do and only contain as much logic as the human provides to them.  Third, I think that it is safe to say that standards and best practices are even  more important in the digital world than in the analog.

Last year, the UMD Libraries received funding for a project to digitize a portion of correspondence written by the American author, Katherine Anne Porter, whose papers reside at the University of Maryland.  What seemed at first to be a straightforward project turned into quite a complex and interesting one that is still not 100% complete.  At least a dozen UMD Libraries’ staff participated in some portion of the project, not to mention external parties such as our digitization vendor.  Joanne Archer in Special Collections and University Archives (SCUA) managed the project.   Two content specialists within SCUA (Librarian Emeritus Beth Alvarez and PhD candidate Liz DePriest) selected the approximately 2000 letters for the first phase of digitization.  Robin Pike, Manager, Digital Conversion and Media Reformatting (DCMR), facilitated the contracts and negotiation with the digitization vendor.  The correspondence was digitized in eight batches, and Special Collections staff had to prepare metadata for every letter, and prepare the packages for delivery.   Once digitization was complete, Eric Cartier (DCMR) performed QC on all of the deliverables (TIF, JPG, OCR text and hOCR xml).  Trevor Muñoz, Assistant Dean for Digital Humanities Research, used the raw data to develop several proof-of-concept possibilities for future data use and analysis.  Josh Westgard, graduate assistant for Digital Programs and Initiatives (DPI), facilitated transfer of the files for preservation.

And that is not all.  Fedora as a repository is an excellent example of a computer system that needs to be told exactly what to do.  We have not, to date, added any complex objects of the type of these letters (digital objects represented by an image, an OCR file, and an hOCR file).  DPI gathered the requirements for this new object type (UMD_CORRESPONDENCE) and delivered them to Software Systems Development and Research (SSDR).  Ben Wallberg, Manager, SSDR and two developers, Irina Belyaeva and Paul Hammer, worked to translate those requirements into reality.  What followed was a period of testing and analysis.  Likewise, we currently add content to our Fedora repository in three ways: 1) one-by-one using a home-grown web-based administrative interface and 2) using project-specific batch loading scripts that require developer mediation, and 3) using a batch loader developed by Josh Westgard in DPI that currently only works with audio and video content. For the Katherine Anne Porter project, logic dictated that we go with Door #2, and use a project-specific batch loading process.  In this case, SSDR and DPI agreed to use this as an excuse to develop and test an alternate method for batch ingest, with an eye towards developing a more generic, user-driven batch loader in future.

Irina and Paul worked on the batch loader for Katherine Anne Porter, and, when it was ready for testing, we ran into a series of minor, but educational complications.  First, it was necessary to massage and clean-up the metadata much more than anticipated, since SCUA had been using the spreadsheet to capture more information than needed for ingest. Second, other types of metadata errors caused the load to fail numerous times. This led, however, to the development of more rigorous validation checks on the metadata prior to ingest.  After the load was complete, I worked with Josh Westgard to analyze the success and we uncovered additional minor glitches, which we will account for in later loads.

The work is not complete.  The letters are ingested, but not viewable.  We still need to make changes to both our back-end administrative tool and our front-end public interface in order to accommodate this new content type.  And who knows what other types of user needs and requirements will necessitate additional work.  The data itself is rich and interesting.  Our hope is that it will be used both by scholars conducting traditional types of archival research as well as digital humanists interested in deciphering and analyzing the texts by computer-driven means.

This spring, Digital Systems and Stewardship hired its first ever Project Manager.  Ann Levin comes to the UMD Libraries with years of experience working on systems much more complex than our own.  As is obvious from the project description above, all of our work currently touches many different people with different skills and priorities within our organization.  It is our hope that we can start to formalize some of this work, develop more consistent workflows, and develop policies and procedures that ensure adherence to specified best practices and standards moving forward. The work has already started.  As Paul correctly pointed out to me several months ago, working with computers requires just as much, if not more, human involvement than some of our analog work. Planning is key. One reason the word “digital” causes instant anxiety for many people is that just as things such as access and indexing can move much more swiftly in a digital system than analog, it is also possible to entirely eliminate data instantly.  Paul provided this analogy:

Imagine an archive where everyone working there had the power to empty and restock the shelves with a wave of their hand.  That any given shelf could suddenly disappear.  That a box that used to be really popular can still be taken off the shelf but we have forgotten how to open it.  All of these things are all too possible in digital storage.  Think of the extra vigilance necessary just to know that what you have is really what you have.

Scary. But my original sentiment remains the same. With every new project, we move closer towards trusting our work, and reaching a point where creating, managing, and providing access to digital content really can seem as simple as the “push of a button.”  We just need to recognize all of the work, effort, and vigilance that goes into creating that single button.

Where is all of our digital stuff?

I like to think that we, at the University of Maryland, are not unlike other university libraries, in that we have a lot of digital content, and, just like with books, we have it in a lot of different places.    Unfortunately, unlike our dependable analog collections, keeping track of all of this digitized content can sometimes be unwieldy.   One of my big goals is to reach the point where an inventory of these digital collections can provide me with the equivalent of a “Shelf location” and statistics at the push of a button.  One project I have been working on has involved documenting and locating all of the UMD Libraries’ digital content, in a first step towards this goal.  I am focusing right now on things that we create or that we own outright, vs. content that comes to us in the form of a subscription database, which is a whole issue in itself. We don’t have one repository to rule them all in a physical sense. Rather, I like to think of our “repository” at present as an “ecosystem.” Here are some parts of our digital repository ecosystem.

DRUM (DSpace) http://drum.lib.umd.edu

Stats: Close to 14,000 records.  Approximately 8,800 of these are University of Maryland theses and dissertations.

DRUM is the Digital Repository at the University of Maryland. Currently, there are three types of materials in the collections: faculty-deposited documents, a Library-managed collection of UMD theses and dissertations, and collections of technical reports.  As a digital repository, files are maintained in DRUM for the long term. Descriptive information on the deposited works is distributed freely to search engines. Unlike the Web, where pages come and go and addresses to resources can change overnight, repository items have a permanent URL and the UMD Libraries committed to maintaining the service into the future.  In general, DRUM is format-agnostic, and strives to preserve only the bitstreams submitted to it in a file system and the metadata in a Postgres database.  DSpace requires the maintenance of a Bitstream Format Registry, but this serves merely as a method to specify allowable file formats for upload; it does not guarantee things like display, viewers, or emulation.  DSpace does provide some conversion services, for example, conversion of Postscript format to PDF.  DRUM metadata may be OAI-PMH harvested, and portions of it are sent to OCLC via the Digital Collections Gateway. A workflow exists to place thesis and dissertation metadata into OCLC. Most of DRUM is accessible via Google Scholar.

Digital Collections (Fedora) http://digital.lib.umd.edu

Stats: 21,000 bibliographic units representing over 220,000 discrete digital objects.

Digital Collections is the portal to digitized materials from the collections of the University of Maryland Libraries.  It is composed primarily of content digitized from our analog holdings in Special Collections and other departments. The University of Maryland’s Digital Collections support the teaching and research mission of the University by facilitating access to digital collections, information, and knowledge.  Content is presently limited to image files (TIFF/JPG), TEI, EAD, and streaming audio and video.  Fedora manages the descriptive metadata, technical metadata, and the access derivative file.   While Fedora can be developed to accept any format, our implementation currently only easily accepts TIFF and JPG images, and TEI-encoded/EAD-encoded XML documents. We are not currently using Fedora to inventory/keep track of our preservation TIFF masters.  Audiovisual records are basically metadata pointers to an external streaming system.  Fedora metadata may be OAI-PMH harvested, and portions of it are sent to OCLC via the Digital Collections Gateway.  Google does crawl the site and many resources are available via a Google search.

Chronicling America (Library of Congress) http://www.chroniclingamerica.loc.gov

Stats: We have currently submitted approximately 25,000 newspaper pages to the Library of Congress, and anticipate a total of 100,000 pages by August 2014.

Chronicling America is the website that provides access to the files created and submitted as part of the National Digital Newspaper Project (NDNP) grants.  We submit all files (TIFF, JP2, PDF, ALTO XML) to the Library of Congress, and they archive a copy.  We are currently archiving a copy locally, in addition to the copies archived by LoC.  One complete copy of each batch is sent to UMD’s Division of IT for archiving. In addition, Digital Systems and Stewardship saves a copy of each batch to local tape backup, and retains the original batch hard drive in the server room in McKeldin Library.

HathiTrust http://www.hathitrust.org

Stats: Nothing yet! Plan to begin submitting content in 2014

HathiTrust provides long-term preservation and access services to member institutions.  For institutions with content to deposit, participation enables immediate preservation and access services, including bibliographic and full-text searching of the materials within the larger HathiTrust corpus, reading and download of content where available, and the ability to build public or private collections of materials. HathiTrust accepts TIFF images and OCR files in either ALTO XML or hOCR.  They provide conversion tools to convert TIFF masters into JPEG 2000 for access purposes.

Internet Archive http://www.archive.org

Stats: Almost 4,000 books, with over 840,000 pages

The Internet Archive is a 501(c)(3) non-profit that was founded to build an Internet library. Its purposes include offering permanent access for researchers, historians, scholars, people with disabilities, and the general public to historical collections that exist in digital format. The UMD Libraries contribute content to the Internet Archive in two ways.  First, we submit material to be digitized at a subsidized rate as part of the Lyrasis Mass Digitization Collaborative.  The material must be relatively sturdy, and either not be in copyright, or we should be able to prove that we have permission from the copyright holder.  We have also been adding content digitized in-house (usually rare or fragile), and upload the access (PDF) files and metadata to the Internet Archives ourselves.  The Internet Archive produces JPEG2000 and PDF files at the time of digitization.  They produce both cropped and uncropped JPEG2000 files for each volume. The UMD Libraries saves locally and archives to the UMD Division of IT the cropped JPEG2000 files and the PDFs.

***

I am already aware of other types of digital content that we will have to track.  Born-Digital records and personal files from our Special Collections and University Archives.  eBooks in PDF and other formats that we purchase for the collection and have to determine how to serve to the public.  Publications, such as journals, websites, and databases.  Research data.  I hope to return to this post in 2020 and smile at how confused, naive, and inexperienced we all were at all of this.  Until then, I will keep working to pull everything together as best I can.

National Agenda for Digital Stewardship

The National Digital Stewardship Alliance (NDSA), a voluntary membership organization of leading government, academic, and private sector organizations with digital stewardship responsibilities who collaborate to “establish, maintain, and advance the capacity to preserve our nation’s digital resources for the benefit of present and future generations,” just released their 2014 National Agenda for Digital Stewardship.

In the Agenda, they recognize that “it has become increasingly difficult to adequately preserve valuable digital content because of a complex set of interrelated societal, technological, financial, and organizational pressures.”  Among the identified “pressures” are the usual suspects: lack of time, funding, staff, priorities, etc.  Here at the University of Maryland Libraries, we are just now formalizing our digital preservation policies and procedures, despite the fact that we have been creating and managing digital collections for close to a decade. We are not unusual.  Groups like the NDSA, who are actively communicating with each other, developing standards, and encouraging collaboration, are helping to demystify the complicated world of digital preservation and to make it seem an attainable goal.

The Agenda identifies four areas of digital content that they feel need special attention this year: electronic records, research data, web and social media, and moving image and recorded sound.  All of these content areas are first and foremost on our minds at the University of Maryland Libraries.  In the past year, we have joined forces with our colleagues in the Maryland Institute for Technology in the Humanities (MITH) to form a Born-Digital Working Group to develop policies and procedures for working with born-digital content, including electronic records.  In 2012, we purchased a FRED workstation.   We do not have everything figured out, not by a long shot, but the fact that we are taking incremental steps towards tackling this issue is important.  In 2012 we also hired a Research Data Librarian, who is in the process of working with a project team to develop a business case for research data services at the University of Maryland.  We have been archiving web content for several years using the Internet Archive’s Archive-It tool.  And in the past year, we have greatly increased our digitization of audio recordings, including creating a digitization lab for in-house work.

So we can pat ourselves on the back.  It is sometimes difficult to recognize and appreciate the work that we do when it seems like there is still so much left to be done. We need to develop better strategies for providing access to our digital content, for maintaining and preserving that content, and for planning into the future.  We are working on it.