A Brief Introduction to APIs

These are the notes accompanying a presentation to the UMD Libraries Emerging Technologies Discussion Group on April 22, 2014.

The Wikipedia definition for API, or Application Programming Interface, is

In computer programming, an application programming interface (API) specifies how some software components should interact with each other.

This is a very broad definition but does emphasize the primary feature that an API is a computer-computer interface, rather than the human-computer interface with which we are most familiar.  Keyboard, mouse, and display are generally used to create a visually based human-computer interaction experience, like with a GUI or web browser .  But in general these are difficult for computers to interact with, so separate APIs are created which allow programs to interact with each other.

One traditional type of API is the code library.  A code library consists of a set of function calls, which are well documented, for a program to use to interface with another application.  For example, see these excerpts from the DSpace 4.1 API documentation for the org.dspace.content  package the org.dspace.content.Item class. DSS uses this API for an automated load of Electronic Theses and Dissertations from Proquest into DRUM (see EtdLoader class).

Web Service as API have become very popular with the advent of network based services.  A web service is typically served over HTTP, the same protocol used for requesting web pages via your browser.  Most web services provide data back in the form of either XML or JSON. JSON was developed as a more light-weight alternative to XML when programs began to be executed from the web browser using JavaScript, though is now used widely outside of JavaScript programs.

See these examples using the Google Maps API:

In many case a service will provide a Web Service API and a code library API which calls the Web Service for you.

Getting started with the Digital Public Library of America (DPLA) API

This section is by Karl Nilsen, Research Data Librarian.

DPLA has a Web Service API that provides programmatic access to over 7 million metadata records collected from a variety of libraries, archives, and museums. While the query process can seem complicated at first, the basic actions are pretty simple: you submit queries in HTTP and receive responses in JSON-LD. The responses won’t be especially readable, but it’s important to keep in mind that JSON (and HTTP) will typically be consumed by programs rather than humans. If you build an application for non-technical users, you probably won’t show them the HTTP queries or the JSON responses at all—instead, you’ll create an interface that makes query design and response visualization more user-friendly. That being said, you have to understand the query design and response structure if you want to produce applications that satisfy your users’ expectations and support their research methods. To help you understand the possibilities and limitations of their API, DPLA provides a detailed guide to query design and response structure.

Before you can use the API, you need to get a personal API key from DPLA. Your key acts as a unique username, and you have to include your key in every HTTP query. DPLA uses the API key as mechanism for protecting their system against abusive or excessive users. For example, if your queries burden their system, they can block your API key. As a rule, you shouldn’t share your API key with anyone.

At ETDG, I demonstrated a few queries written in Python, but you can write the same queries in other programming languages. The code is merely a set of instructions for sending the HTTP query, receiving the JSON response, and manipulating the results.

Here’s a simple script that submits a query for “bicycle” in any metadata field, returns only 10 results, and prints the result:

# activate additional functionality in Python
import urllib, json

# design your query
api_call = urllib.urlopen('http://api.dp.la/v2/items?q=bicycle&page_size=10&api_key=YOUR_API_KEY_GOES_HERE')

# submit your query to the API
results = json.load(api_call)

# print the response
print(results)

To improve the readability somewhat, you could print the results with this command:

print(json.dumps(results, indent=4)) 

If we run this code, we receive 10 results (as we requested). Before we consider a more complex example, it’s important to understand which 10 records, of all the relevant items in DPLA, we received. There are close to 2500 items in DPLA that contain “bicycle” in the metadata, so why did we receive these particular records? Are they the earliest 10 records in the database by data of publication, the most recent, a random sample, the latest 10 additions to the database, or another set? We should probably contact DPLA to find out exactly how their system works, but given that we don’t know just yet, we wouldn’t want to draw any conclusions from the results. (Even if we requested all 2500 items, we should still ask questions about the provenance, scope, and representativeness of the results.) DPLA provides various parameters for limiting and sorting data, and these techniques can help us make our results interpretable.

Here’s a script that builds a more complex query. It submits a query for “bicycle” in any metadata field, restricts the results to photographs, returns only the text description that accompanies each item, returns up to 400 results, and saves the text descriptions to a file. The script removes any items that return 0 (no description found in the metadata), so the actual number of results may be less than 400. (Code revised 2014-05-21)

txt_file = open("descriptions.txt", "w")</pre>
<pre>api_call = urllib.urlopen('http://api.dp.la/v2/items?q=bicycle&sourceResource.format=Photographs&fields=sourceResource.description&page_size=400&api_key=YOUR_API_KEY_GOES_HERE')
results = json.load(api_call)

for item in results.get('docs', 0):
    text = item.get('sourceResource.description', 0)
    if text != 0:
        text = text.encode('utf-8')
        txt_file.write(text+'\n')

txt_file.close()

Since we constrained the responses to a particular metadata element (text descriptions), we can easily retrieve only the information that interests us and skip the rest. Moreover, we can also retrieve hundreds or thousands of results in seconds. Imagine how long it would take to copy text descriptions by hand from DPLA’s user interface! Here are three examples from the descriptions:

Three men, one in uniform (police?), adjusting the wheel of a bicycle on a dirt track, with onlookers on the bleachers in background. Probably a gathering of students at the University of North Carolina-Chapel Hill.

The reverend Mirko Mikolasek rides a bicycle which had been made by the Evangelical Church of Cameroon. He is surrounded by children.; Mirko Mikolasek is a missionary of the Société des missions évangéliques de Paris (Paris evangelical mission society).

3 images. Bicycle trip, 3 September 1958. Gary Swanson–22 years (California Institute of Technology fellowship winner, returns from 4500-mile bicycle trip). ; Caption slip reads: “Photographer: Mack. Date: 1958-09-03. Reporter: Farrell. Assignment: Cyclist. Special instructions: Early Friday. 29-30-4: Gary Swanson, 22, Caltech Fellowship winner returns from 4500-mile bicycle trip.

Having retrieved these descriptions and saved them in a plain text file, we could proceed to analyze them in various ways. The descriptions may tell us something about bicycles and bicycling in America and elsewhere. Content analysis or natural language processing could be productive approaches.

RSS  and Atom

RSS (Rich Site Summary) and Atom are APIs for syndication (or feeds) of published content.  But rather than being specific to a vendor or its services, it is a standard designed to be reused by multiple applications and services.

The UMD Libraries website is used to publish news on a regular basis.  See this nice interface for a human to visit the website and get the latest news.

UMDLibrariesNews

But what if I don’t want to visit this page regularly to get the lastest news.  What if I want a program to do it for me and also aggregate this news with news from other sites.  The main website interface is not easy for a computer to parse so we also publish the news using the RSS API at  http://www.lib.umd.edu/news/feed

UMDLibrariesNewsRSS

Now someone can create a program or service, like Feedly, to regularly check for news updates for me and present them to me whenever new content is published.

UMDLibrariesNewsFeedly

Google Calendar

When we needed to add a calendar function for library open/close hours to the website we didn’t have a stock solution within Hippo CMS.  Google Calendar offers a nice interface for creating events and is especially easy to create repeating events with exceptions,  eg McKeldin Library is open every Monday 9-11 during the Spring semester except on Labor Day.  We use Google Calendar to maintain hours information and then use the provided API to create Hippo documents to display in the website:

GoogleCalendar

[Author’s Note: while researching this post I discovered that we use Google Calendar API v2 which will be deprecated in favor of the new v3 API based on JSON data objects instead of the GData format. I’ll refer to v2 since that is what we currently use, but any new code should use the new API v3.]

The Google Calendar API v2 is a web service which is built on top of and extends the standard Atom protocol.  The documentation provides all you need to extract your data so you can of course write your own custom code but it could be a bit of work to do from scratch:

GoogleCalendarGDataFeed

Fortunately Google additionally provides code libraries which handle the parsing for you.  See for example the Java Guide which we use since Hippo CMS is implemented using the Java language.  We call the Google Calendar GData API to get the list of calendar events, convert them to Hippo documents, and then make them available for navigation on the website.

LibraryHours

It would be possible to query Google Calendar in real-time, when the hours page request is made, but for performance and availability reasons we choose to sync the data to Hippo so Google Calendar is only consulted when we initiate a pull of the latest data.

OAI-PMH

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a standard API for the harvesting of metadata.  We expose our DRUM and Digital Collections metadata for harvesting using OAI-PMH.  OCLC uses this API to harvest our metadata for inclusion in WorldCat (see WorldCat Digital Collection Gateway).

Additional Resources

Here are some additional resources you may be interested in exploring.

OCLC

 Library of Congress

Google Scholar

Google has not provided an API for Google Scholar.

Compilations

Stew of the Month: March 2014

Welcome to a new issue of Stew of the Month, a monthly blog from Digital Systems and Stewardship (DSS) at the University of Maryland Libraries. This blog will provide news and updates from the DSS Division. We welcome comments, feedback and ideas for improving our products and services.

General Announcements

We have been working diligently on security updates and complying with our campus security policies. We are  currently facing some AV equipment issues in our Special Events Room. We apologize for the problems this has caused and plan to address this and treat it as a high priority.

Department Updates

Consortial Library Application Services (CLAS)

In support of UMBC’s plans to participate in Rapid ILL, David Wilt did an extract of UMBC’s serials and microfilm holdings information from Aleph, and uploaded the file to the RapidILL.org site.

David also completed work on implementing the materials booking function in Aleph for Towson. He is now working on configuring and implementing the booking function for Shady Grove, where it is wanted for booking equipment.

Hans Breitenlohner is working with Salisbury on implementing Single Sign-on (SSO).

Linda Seguin got Ex Libris to fix a problem with the way SFX sends title searches to the Aleph catalog, a glitch in formatting that added plus signs (+) to the search string, causing the searches to fail. She has tested the fix and confirms that the problem is corrected.

Linda also made changes and re-indexed 30,000 records in the Aleph Test OPAC in support of a USMAI Cataloging Policy Committee (CPC) proposal to display and index uncontrolled subject headings to compensate for the lack of LC subject headings in Ebook Library (EBL) records.

Heidi Hanson and Ingrid Alie will be attending the ELUNA 2014 Annual Meeting in Montréal, Canada, April 29 – May 2.

Digital Conversion and Media Reformatting (DCMR)

On March 31, Robin Pike attended the Society of American Archivists Accessioning and Ingest of Electronic Records one-day workshop, hosted by UMD Libraries. It focused on aspects of accessioning and ingesting born digital records into an archive, what submission agreements and donor agreements may look like when dealing with born digital records or collections of paper and born digital records, and tools that may be beneficial during the process of ingesting digital records for transferring files or creating a disk image, validating files and files formats, scanning for personally identifiable information, and file conversion or normalization.

The majority of Robin’s time in March was dedicated to FY15 project planning. As the chair of the Digitization Initiatives Committee, she collaborated with Joanne Archer, Heather Foss, Eileen Harrington, and Carla Montori to analyze project proposals and compile a draft budget for outsourced digitization projects across UMD Libraries. Resources Group will be discussing the proposed budget in April. Robin reviewed notes from December through February digitization stakeholder meetings and began to compile a list of potential in-house projects. She will be working with collection managers to solidify this list of FY15 in-house projects and clarify all the projects over the coming months.

Henry Borchers completed video digitization setups for VHS and Betacam and has made considerable progress with creating procedures and workflows. He has now tested digitization for DVCAM/MiniDV, VHS, and Betacam formats. Henry has focused on creating streamlined equipment configurations that increase automation and decrease manual configuration when switching between different format equipment. DCMR will test this setup on pilot projects in the coming months.

On March 27-29, Eric Cartier attended the Sound+ conference hosted by the UMD English Department. The event featured scholars who discussed the “relationship between sound and text,” emphasizing the interdisciplinary nature of sound studies. Eric found the session “Sounding the Humanities, Sounding the Sciences” which featured discussion on auditory scene analysis and research results on how brains parse foreground and background sounds, particularly interesting.

Eric worked with John Schalow and Joe Carrano to develop a more streamlined process to review digital object metadata to expedite the approval of digital images created in-house. DCMR’s long-term goal is to dramatically decrease the amount of time to perform quality assurance on the file and metadata between an object’s digitization and when it becomes public.

Students in the Hornbake Digitization Center worked on digitizing numerous requests and small projects including digitizing audio cassettes and 1/4″ open reel tapes from the Katherine Anne Porter papers, expanding upon the effort to digitize large portions of the correspondence in the collection. Publicly-available digitized materials are linked from the finding aid and can also be found by searching digital.lib.umd.edu. Students also digitized photographs and book illustrations for the upcoming Special Collections Bladensburg exhibit, which will open in the fall.

Digital Programs and Initiatives (DPI)

Jennie Knies attended the Library Publishing Coalition Forum in Kansas City, MO, from March 5-6.  The UMD Libraries are members of the Library Publishing Coalition, whose mission is “to foster collaboration, share knowledge and develop common practices, all in service of publishing and distributing academic and scholarly works.” This useful and interesting meeting featured plenary speakers, panel discussions and work groups devoted to articulating the role of academic libraries in the area of digital publishing. Slides are also available for select presentations. A discussion session on “Beyond the Article” was particularly interesting. It highlighted that we are not alone in grappling with the blurred lines between digital projects, data, born-digital records, especially with regards to humanities data.  Digital Programs and Initiatives is in the process of drafting a plan for digital publishing at the UMD Libraries and the information obtained at this forum greatly informs our work in that area.

On March 7, Liz Caringola attended the Digital Maryland Conference 2014, held by the Maryland Digital Cultural Heritage (MDCH), the Maryland State Library Resource Center (SLRC), and the Enoch Pratt Free Library. Liz presented on the progress of the Historic Maryland Newspapers Project thus far and its plans for the future. Other presentations focused on the Digital Public Library of America (DPLA), Artstor, CONTENTdm, and a sampling of the many digital projects that local institutions are currently working on. See the conference website for the agenda and list of speakers.

The Historic Maryland Newspapers Project has extended the deadline to apply for the Wikipedian-in-Residence position until April 18, 2014. For more information, see a previous Digistew blog post or the job posting.

Marlin Olivier joined Research Data Services as our Data Curation Assistant. Marlin is in the digital curation specialization at the iSchool and has a bachelor’s degree in biology and religion. In addition to working with Research Data Services, Marlin works for a non-profit that manages digital performance royalties.

We are working with the Center for Agricultural and Natural Resource Policy – CANRP (https://agresearch.umd.edu/canrp) to include their publications in DRUM.  Located in the Department of Agricultural and Resource Economics at the University of Maryland, CANRP provides research, education and outreach on public policies facing Maryland, the US, and the world.  A wide variety of the center’s publications will be deposited in DRUM including extension bulletins, fact sheets, monographs, policy reports, and research briefs.  Check out some of their research at http://hdl.handle.net/1903/14189.

The UMD Libraries is now a member of the MDPI (Multidisciplinary Digital Publishing Institute), a publisher of more than 100 scientific peer-reviewed, open access journals.  As a member, UMD authors receive a 10% discount on article processing fees for submissions to any MDPI journal.

Response to the UMD Libraries Open Access Publishing Fund has been so good that funds have already been exhausted for this fiscal year.  Thanks to Dean Steele and the Library Resources Group, an additional $5,000 has been added to the fund which will hopefully sustain the service until the end of June.

Josh Westgard worked a lot on archiving files for permanent preservation in March.  Besides backing up the usual monthly output of the digitization center, he helped to inventory and archive a large number of .warc files from the Libraries’ web crawling program, as well as several thousand images from a special digitization project on the correspondence of Katherine Anne Porter.  In the context of the Prange Collection digitization project, he helped to inventory and consolidate records relating to files that were first created and archived nearly a decade ago.  In addition, he helped to prepare and validate the metadata for the Katherine Anne Porter project for ingest into the Libraries’ Fedora-based digital collections repository, and drew up procedures for applying access controls to audio and video assets in Libraries’ digital collections.

Software Systems Development and Research (SSDR)

Irina Belyaeva began work on the DRUM upgrade to DSpace 4, which will be a rather large three version jump since we last upgraded in July 2011.  This upgrade serves not only to stay current with the latest DSpace bug fixes, security updates, and new features but also paves the way to begin adding new DRUM based services for Research Data.

Shian Chang and Cindy Zhao have been working with Laura Cleary in Special Collections and University Archives to begin adding a new Exhibits feature to Hippo.  This project will use a new Hippo 7.7 feature called Blueprints to allow SSDR staff to routinely create new Exhibits for Special Collections without any custom programming.  The Exhibit website template will feature Responsive Web Design for display on desktops, tablets, and phones using the Bootstrap web toolkit.

Development and support for Drupal based sites has been on the rise recently in SSDR, so Paul Hammer used the Lynda video instruction site to get initial training and then configured a new sandbox environment for SSDR and CLAS use.  Paul is now set to join Cindy Zhao and Shian Chang as the development support team for Libi, the USMAI public website, and the USMAI staff website.

Ben Wallberg, Jennie Knies, and Joshua Westgard attended the March 10 meeting of the Washington D.C Fedora User Group.  We received general updates on the Fedora community and DuraSpace and on development of Fedora 4.  Area Fedora users reported on their current activities, with Ben providing the UMD Libraries’ update.  Upon disclosure that we are running the ancient Fedora 2.2.2 version there was discussion of possible upgrade paths.  There was some dissent, but the general consensus was that we should avoid a two step upgrade (2 to 3 followed by 3 to 4) and jump straight to Fedora 4 given that we could defer implementation until after Fall 2014.  We also discussed how we might begin Fedora 4 training for SSDR staff through the process of Fedora 4 beta testing.

Sneaking in on the last day of the month, Mohamed Abdul Rasheed rejoined SSDR as our newest Software Developer.  See a previous post for a Research Study he worked on the last time around.  Welcome back Mohamed!

User Systems and Support

The team has been busy updating our security software and procedures in compliance with the campus policies and procedures. The team also has planned in the past month, technology procurement and budget for FY15.

Expanding Audio Digitization Capacity: Introducing PAADS

In 2012, the UMD Libraries Digitization Center expanded from three Epson 10000XL flatbed scanners and one Zeutschel OS12000 to include one more Epson 10000XL, an Epson V700 Perfection, and the beginnings of an audio digitization station. Two years later, we have two operational audio digitization workstations and one video digitization workstation nearing completion. Over the last several months, we have been planning the next stage in developing digitization capacity at UMD Libraries–reorganizing, updating, and staffing the former digitization lab in the Michelle Smith Performing Arts Library (MSPAL), now known as the Performing Arts Audio Digitization Studio (PAADS).

Rationale
Under the management of DCMR, PAADS will be an extension of audio digitization efforts in the UMD Libraries. The studio is located within MSPAL and will primarily serve the digitization requests and projects for collections within MSPAL, including the International Piano Archives at Maryland (IPAM) and Special Collections in Performing Arts (SCPA). With the increasing volume of requests from these collection areas, we realized the value of expanding digitization operations into the library. The studio will allow DCMR to digitize these collections without coordinating the transfer of physical materials between libraries, and will provide faster access to digitization-on-demand for requests.

Creating a digitization studio in this space is a smart decision for increasing access to collections and is a good financial option for expanding our operations. The space was originally designed as an audio studio with optimum sound absorption and minimal vibration interference, making it an ideal space for preservation-level digitization work. Most of the legacy media players are already housed in the space (though a few pieces will need to be repaired). We will update the digital convertor, interface, and a few auxiliaries. We also plan to update the wiring and connections to reflect a more streamlined audio digitization workstation, which relies on multi-purpose equipment and more flexible software. The proposed PAADS configuration is very similar to the configuration of the audio workstations in the Hornbake Digitization Center; the similar setups will enable us to train students more quickly.

Project Timeline
After a conversation with MSPAL staff, Robin Pike, Henry Borchers, and Eric Cartier met to discuss expanding audio digitization in October 2013. Over the next two months, Borchers and Cartier documented and assessed the equipment in the digitization studio. Borchers delivered an analysis report to Pike in January 2014, which she integrated into the Analysis and Proposal sections of a larger plan. Pike delivered this plan to MSPAL staff in February and they collaborated to complete it throughout March. The Associate Deans of Public Services and DSS recently approved this plan.

Parts of the Plan
The plan addresses the business case and operational need in the Introduction. Borchers’s Analysis includes a list of the equipment and their operational status, and the overall operational and functional status of the studio in its current configuration. The Proposal features a list of new equipment to purchase, a budget, and details of the plan to reconfigure the space. The Installation Timeline follows the Proposal, and is based on the hours Borchers estimated it took him to set up one audio digitization wotkstation in the Hornbake Digitization Center. It also includes a list of dependencies or restrictions on the timeline, such as the availability of new equipment and unforeseen issues with legacy equipment currently in the studio. Pike also included a Staffing Plan, stating who would be in the workspace, when, and that DCMR’s goal is to staff the space approximately 20 hours/week, if budgets permit. It concludes with a plan for staff hours and access, a communication plan between MSPAL and DCMR during the setup phase and during production operations, and a commitment to maintenance of the space.

Implementation
We ordered the new equipment this week and hope to start installing it at the end of April. Borchers and Cartier plan to photograph the progress of the space so we can share the phases towards completion on this blog.