In my first post, I gave a general introduction to the Historic Maryland Newspapers Project. Today’s post will give you an overview of the NDNP technical specifications and metadata.
As of October 30, 2013, there are 6,673,511 newspaper pages available on Chronicling America. Metadata is essential in serving those pages to Chronicling America’s users, and there is a lot of it. The NDNP technical specifications document is nearly 80 pages long and encompasses a plethora of metadata standards. Why are all of these standards necessary? Some of them ensure that NDNP projects follow best practices for digital preservation; others ensure access and easy searching by our users; still others embed information directly into the files that users can download.
Here’s a summary of the metadata schemes used by NDNP projects:
- XML – Extensible Markup Language – XML is (according to Google) “a metalanguage that allows users to define their own customized markup languages.” This makes it the perfect vessel to encompass all of the metadata standards used by the NDNP, which is why it’s first on the list.
- Structural metadata
- METS – Metadata Encoding and Transmission Standard – METS is the glue that holds all of the components of a newspaper issue together. It uses file structures and maps to connect all of the newspaper page files (archival TIFFs, internet-browser-friendly JPEG2000s, full-text searchable PDFs, and plain-text OCR output files for each and every page) to their structural, descriptive, and administrative metadata, contained within XML files.
- Descriptive metadata
- MODS – Metadata Object Description Schema – Issue- and page-specific information, such as date of publication, volume, issue number, edition, printed page number, and so on, is stored within MODS objects.
- Dublin Core – Dublin Core is a set of elements and vocabularies intended to minimally describe digital objects. It is the standard the NDNP uses for metadata that is embedded within the JPEG2000 and PDF files. This means when Chronicling America users download JP2s or PDFs of newspaper pages, they get some basic metadata about the newspaper, who digitized it, and the source of the microfilm used for digitization.
- Administrative metadata
- Preservation metadata
- PREMIS – Preservation Metadata: Implementation Strategies – The PREMIS data model and Data Dictionary were designed to capture preservation metadata for virtually any digital archiving system and method of metadata creation. Each newspaper page file has a PREMIS record with a SHA-1 hash value, file size, format designation, and application that created the file.
- Technical metadata
- MIX – Metadata for Images in XML – MIX was designed to be a template for creators of digital images so that they could comply with ANSI/NISO Z39.87-2006, an international standard for technical metadata for digital still images. MIX captures information such as the source of the digital images (microfilm in our case), image producer, the scanner’s manufacturer and model, and so on.
- ALTO – Analyzed Layout and Text Object – Nested within METS, ALTO describes the layout and contents of OCR outputs.
- Preservation metadata
Metadata standards are not the only standards used to ensure that NDNP awardees are providing uniform data to the Library of Congress. As previously mentioned, the MIX technical metadata is designed to enable the exchange and/or storage of data as laid out in ANSI/NISO Z39.87-2006. Some MODS metadata is pulled from MARC records. Newspapers’ MARC records are created or updated prior to digitization according to the CONSER (Cooperative Online Serials) cataloging standard. XMP (Extensible Metadata Platform) is an ISO standard that allows Dublin Core elements to be embedded within our JPEG2000 and PDF files according to the RDF (Resource Description Framework) data model. Finally, NDNP data are saved on external hard drives according to the BagIt hierarchical file packaging format.
I tried to make this post exhaustive of all of the standards used by the NDNP, but instead I just feel exhausted (and probably so do you)! When we’ve all recovered in a few weeks, check back for the Historic Maryland Newspapers Project’s next post on the difficulties of achieving accurate OCR results when digitizing historic newspapers.