I like to think that we, at the University of Maryland, are not unlike other university libraries, in that we have a lot of digital content, and, just like with books, we have it in a lot of different places. Unfortunately, unlike our dependable analog collections, keeping track of all of this digitized content can sometimes be unwieldy. One of my big goals is to reach the point where an inventory of these digital collections can provide me with the equivalent of a “Shelf location” and statistics at the push of a button. One project I have been working on has involved documenting and locating all of the UMD Libraries’ digital content, in a first step towards this goal. I am focusing right now on things that we create or that we own outright, vs. content that comes to us in the form of a subscription database, which is a whole issue in itself. We don’t have one repository to rule them all in a physical sense. Rather, I like to think of our “repository” at present as an “ecosystem.” Here are some parts of our digital repository ecosystem.
DRUM (DSpace) http://drum.lib.umd.edu
Stats: Close to 14,000 records. Approximately 8,800 of these are University of Maryland theses and dissertations.
DRUM is the Digital Repository at the University of Maryland. Currently, there are three types of materials in the collections: faculty-deposited documents, a Library-managed collection of UMD theses and dissertations, and collections of technical reports. As a digital repository, files are maintained in DRUM for the long term. Descriptive information on the deposited works is distributed freely to search engines. Unlike the Web, where pages come and go and addresses to resources can change overnight, repository items have a permanent URL and the UMD Libraries committed to maintaining the service into the future. In general, DRUM is format-agnostic, and strives to preserve only the bitstreams submitted to it in a file system and the metadata in a Postgres database. DSpace requires the maintenance of a Bitstream Format Registry, but this serves merely as a method to specify allowable file formats for upload; it does not guarantee things like display, viewers, or emulation. DSpace does provide some conversion services, for example, conversion of Postscript format to PDF. DRUM metadata may be OAI-PMH harvested, and portions of it are sent to OCLC via the Digital Collections Gateway. A workflow exists to place thesis and dissertation metadata into OCLC. Most of DRUM is accessible via Google Scholar.
Digital Collections (Fedora) http://digital.lib.umd.edu
Stats: 21,000 bibliographic units representing over 220,000 discrete digital objects.
Digital Collections is the portal to digitized materials from the collections of the University of Maryland Libraries. It is composed primarily of content digitized from our analog holdings in Special Collections and other departments. The University of Maryland’s Digital Collections support the teaching and research mission of the University by facilitating access to digital collections, information, and knowledge. Content is presently limited to image files (TIFF/JPG), TEI, EAD, and streaming audio and video. Fedora manages the descriptive metadata, technical metadata, and the access derivative file. While Fedora can be developed to accept any format, our implementation currently only easily accepts TIFF and JPG images, and TEI-encoded/EAD-encoded XML documents. We are not currently using Fedora to inventory/keep track of our preservation TIFF masters. Audiovisual records are basically metadata pointers to an external streaming system. Fedora metadata may be OAI-PMH harvested, and portions of it are sent to OCLC via the Digital Collections Gateway. Google does crawl the site and many resources are available via a Google search.
Chronicling America (Library of Congress) http://www.chroniclingamerica.loc.gov
Stats: We have currently submitted approximately 25,000 newspaper pages to the Library of Congress, and anticipate a total of 100,000 pages by August 2014.
Chronicling America is the website that provides access to the files created and submitted as part of the National Digital Newspaper Project (NDNP) grants. We submit all files (TIFF, JP2, PDF, ALTO XML) to the Library of Congress, and they archive a copy. We are currently archiving a copy locally, in addition to the copies archived by LoC. One complete copy of each batch is sent to UMD’s Division of IT for archiving. In addition, Digital Systems and Stewardship saves a copy of each batch to local tape backup, and retains the original batch hard drive in the server room in McKeldin Library.
Stats: Nothing yet! Plan to begin submitting content in 2014
HathiTrust provides long-term preservation and access services to member institutions. For institutions with content to deposit, participation enables immediate preservation and access services, including bibliographic and full-text searching of the materials within the larger HathiTrust corpus, reading and download of content where available, and the ability to build public or private collections of materials. HathiTrust accepts TIFF images and OCR files in either ALTO XML or hOCR. They provide conversion tools to convert TIFF masters into JPEG 2000 for access purposes.
Internet Archive http://www.archive.org
Stats: Almost 4,000 books, with over 840,000 pages
The Internet Archive is a 501(c)(3) non-profit that was founded to build an Internet library. Its purposes include offering permanent access for researchers, historians, scholars, people with disabilities, and the general public to historical collections that exist in digital format. The UMD Libraries contribute content to the Internet Archive in two ways. First, we submit material to be digitized at a subsidized rate as part of the Lyrasis Mass Digitization Collaborative. The material must be relatively sturdy, and either not be in copyright, or we should be able to prove that we have permission from the copyright holder. We have also been adding content digitized in-house (usually rare or fragile), and upload the access (PDF) files and metadata to the Internet Archives ourselves. The Internet Archive produces JPEG2000 and PDF files at the time of digitization. They produce both cropped and uncropped JPEG2000 files for each volume. The UMD Libraries saves locally and archives to the UMD Division of IT the cropped JPEG2000 files and the PDFs.
I am already aware of other types of digital content that we will have to track. Born-Digital records and personal files from our Special Collections and University Archives. eBooks in PDF and other formats that we purchase for the collection and have to determine how to serve to the public. Publications, such as journals, websites, and databases. Research data. I hope to return to this post in 2020 and smile at how confused, naive, and inexperienced we all were at all of this. Until then, I will keep working to pull everything together as best I can.