I attended the Mid-Atlantic Regional Archives Conference semi-annual meeting in November and attended the session “Refining Archival Data.” The presenters, Jefferson Bailey, Maureen Callahan, and Alex Duryee, demonstrated Open Refine (formerly Google Refine) by showing how they have used it for data cleanup, restructuring data, and how they can enhance data with linked open data, such as GIS coordinates.
Seeing this demonstration made me think about how much time my colleagues and I at UMD Libraries have spent completing and refining legacy data to turn it into good metadata, or figuring out how to merge data from a finding aid or several spreadsheet inventories and achieve the metadata that we require for Digital Collections at UMD Libraries. UMD Libraries, through the National Public Broadcasting Archives, is a partner in the American Archive of public broadcasting project. Over the summer, I spent over 11 hours refining and mapping over 2,800 legacy metadata records, a portion of our contribution, for ingest into the content management system, the first step in preparing the content for digitization. It took so much time to edit a portion of our records that we had to map the rest of the fields without cleaning up the data. We are now working on refining the remaining 5,000-some records. It is clear that we need a more efficient method.
I have also been thinking about how we ingest the resulting digital assets from vendor-based digitization projects. We typically include just enough metadata from the finding aid or an item-level inventory in a tracking spreadsheet or shipping box-level inventory so the vendor and we can track the items. On occasion, we have a very detailed inventory from which we pull information, but most often we don’t. Whenever we get our deliverables back from a vendor on a hard drive, we have problems ingesting the digital assets. Digital Programs and Initiatives and Software and Systems Development and Research are working on improved, standardized batch ingest processes for multiple file types and digital object types, which will be resolved in early 2014, but the issue of incomplete or unrefined data still exists.
I decided to experiment with Open Refine to add and clean up data for 16 diaries that we sent to a vendor for digitization. While I was only working with 16 metadata records, it was a good chance to experiment with two of the features I might use to enhance vendor-based spreadsheets—“Fill down” and “Apply to all identical cells.” Both functions batch-edit columns or selected or identical cells, and helped me complete 66 columns of data for 16 records in about 20 minutes. I exported (saved) the Open Refine Project as a project, in case I needed to go back to the data, and also exported the project as a CSV file that we can repurpose for ingest.
In the future, I also intend to try additional features including: cluster similar terms so I can batch-edit these into a single term (like an authorized heading), remove trailing spaces (whitespace), batch-edit inconsistent capitalization, change Microsoft Excel’s wonky date formats into the ISO standard, and split cells with multiple information types into separate cells, a problem often found in legacy inventories. I hope Open Refine will continue to work well for us as we manipulate data and create refined metadata.