These are the notes accompanying a presentation to the UMD Libraries Emerging Technologies Discussion Group on April 22, 2014.
The Wikipedia definition for API, or Application Programming Interface, is
This is a very broad definition but does emphasize the primary feature that an API is a computer-computer interface, rather than the human-computer interface with which we are most familiar. Keyboard, mouse, and display are generally used to create a visually based human-computer interaction experience, like with a GUI or web browser . But in general these are difficult for computers to interact with, so separate APIs are created which allow programs to interact with each other.
One traditional type of API is the code library. A code library consists of a set of function calls, which are well documented, for a program to use to interface with another application. For example, see these excerpts from the DSpace 4.1 API documentation for the org.dspace.content package the org.dspace.content.Item class. DSS uses this API for an automated load of Electronic Theses and Dissertations from Proquest into DRUM (see EtdLoader class).
See these examples using the Google Maps API:
- XML response: http://maps.googleapis.com/maps/api/geocode/xml?address=20742&sensor=false
- JSON response: http://maps.googleapis.com/maps/api/geocode/json?address=20742&sensor=false
In many case a service will provide a Web Service API and a code library API which calls the Web Service for you.
Getting started with the Digital Public Library of America (DPLA) API
This section is by Karl Nilsen, Research Data Librarian.
DPLA has a Web Service API that provides programmatic access to over 7 million metadata records collected from a variety of libraries, archives, and museums. While the query process can seem complicated at first, the basic actions are pretty simple: you submit queries in HTTP and receive responses in JSON-LD. The responses won’t be especially readable, but it’s important to keep in mind that JSON (and HTTP) will typically be consumed by programs rather than humans. If you build an application for non-technical users, you probably won’t show them the HTTP queries or the JSON responses at all—instead, you’ll create an interface that makes query design and response visualization more user-friendly. That being said, you have to understand the query design and response structure if you want to produce applications that satisfy your users’ expectations and support their research methods. To help you understand the possibilities and limitations of their API, DPLA provides a detailed guide to query design and response structure.
Before you can use the API, you need to get a personal API key from DPLA. Your key acts as a unique username, and you have to include your key in every HTTP query. DPLA uses the API key as mechanism for protecting their system against abusive or excessive users. For example, if your queries burden their system, they can block your API key. As a rule, you shouldn’t share your API key with anyone.
At ETDG, I demonstrated a few queries written in Python, but you can write the same queries in other programming languages. The code is merely a set of instructions for sending the HTTP query, receiving the JSON response, and manipulating the results.
Here’s a simple script that submits a query for “bicycle” in any metadata field, returns only 10 results, and prints the result:
# activate additional functionality in Python import urllib, json # design your query api_call = urllib.urlopen('http://api.dp.la/v2/items?q=bicycle&page_size=10&api_key=YOUR_API_KEY_GOES_HERE') # submit your query to the API results = json.load(api_call) # print the response print(results)
To improve the readability somewhat, you could print the results with this command:
If we run this code, we receive 10 results (as we requested). Before we consider a more complex example, it’s important to understand which 10 records, of all the relevant items in DPLA, we received. There are close to 2500 items in DPLA that contain “bicycle” in the metadata, so why did we receive these particular records? Are they the earliest 10 records in the database by data of publication, the most recent, a random sample, the latest 10 additions to the database, or another set? We should probably contact DPLA to find out exactly how their system works, but given that we don’t know just yet, we wouldn’t want to draw any conclusions from the results. (Even if we requested all 2500 items, we should still ask questions about the provenance, scope, and representativeness of the results.) DPLA provides various parameters for limiting and sorting data, and these techniques can help us make our results interpretable.
Here’s a script that builds a more complex query. It submits a query for “bicycle” in any metadata field, restricts the results to photographs, returns only the text description that accompanies each item, returns up to 400 results, and saves the text descriptions to a file. The script removes any items that return 0 (no description found in the metadata), so the actual number of results may be less than 400. (Code revised 2014-05-21)
txt_file = open("descriptions.txt", "w")</pre> <pre>api_call = urllib.urlopen('http://api.dp.la/v2/items?q=bicycle&sourceResource.format=Photographs&fields=sourceResource.description&page_size=400&api_key=YOUR_API_KEY_GOES_HERE') results = json.load(api_call) for item in results.get('docs', 0): text = item.get('sourceResource.description', 0) if text != 0: text = text.encode('utf-8') txt_file.write(text+'\n') txt_file.close()
Since we constrained the responses to a particular metadata element (text descriptions), we can easily retrieve only the information that interests us and skip the rest. Moreover, we can also retrieve hundreds or thousands of results in seconds. Imagine how long it would take to copy text descriptions by hand from DPLA’s user interface! Here are three examples from the descriptions:
Three men, one in uniform (police?), adjusting the wheel of a bicycle on a dirt track, with onlookers on the bleachers in background. Probably a gathering of students at the University of North Carolina-Chapel Hill.
The reverend Mirko Mikolasek rides a bicycle which had been made by the Evangelical Church of Cameroon. He is surrounded by children.; Mirko Mikolasek is a missionary of the Société des missions évangéliques de Paris (Paris evangelical mission society).
3 images. Bicycle trip, 3 September 1958. Gary Swanson–22 years (California Institute of Technology fellowship winner, returns from 4500-mile bicycle trip). ; Caption slip reads: “Photographer: Mack. Date: 1958-09-03. Reporter: Farrell. Assignment: Cyclist. Special instructions: Early Friday. 29-30-4: Gary Swanson, 22, Caltech Fellowship winner returns from 4500-mile bicycle trip.
Having retrieved these descriptions and saved them in a plain text file, we could proceed to analyze them in various ways. The descriptions may tell us something about bicycles and bicycling in America and elsewhere. Content analysis or natural language processing could be productive approaches.
RSS and Atom
RSS (Rich Site Summary) and Atom are APIs for syndication (or feeds) of published content. But rather than being specific to a vendor or its services, it is a standard designed to be reused by multiple applications and services.
The UMD Libraries website is used to publish news on a regular basis. See this nice interface for a human to visit the website and get the latest news.
But what if I don’t want to visit this page regularly to get the lastest news. What if I want a program to do it for me and also aggregate this news with news from other sites. The main website interface is not easy for a computer to parse so we also publish the news using the RSS API at http://www.lib.umd.edu/news/feed
Now someone can create a program or service, like Feedly, to regularly check for news updates for me and present them to me whenever new content is published.
When we needed to add a calendar function for library open/close hours to the website we didn’t have a stock solution within Hippo CMS. Google Calendar offers a nice interface for creating events and is especially easy to create repeating events with exceptions, eg McKeldin Library is open every Monday 9-11 during the Spring semester except on Labor Day. We use Google Calendar to maintain hours information and then use the provided API to create Hippo documents to display in the website:
[Author's Note: while researching this post I discovered that we use Google Calendar API v2 which will be deprecated in favor of the new v3 API based on JSON data objects instead of the GData format. I'll refer to v2 since that is what we currently use, but any new code should use the new API v3.]
The Google Calendar API v2 is a web service which is built on top of and extends the standard Atom protocol. The documentation provides all you need to extract your data so you can of course write your own custom code but it could be a bit of work to do from scratch:
Fortunately Google additionally provides code libraries which handle the parsing for you. See for example the Java Guide which we use since Hippo CMS is implemented using the Java language. We call the Google Calendar GData API to get the list of calendar events, convert them to Hippo documents, and then make them available for navigation on the website.
It would be possible to query Google Calendar in real-time, when the hours page request is made, but for performance and availability reasons we choose to sync the data to Hippo so Google Calendar is only consulted when we initiate a pull of the latest data.
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a standard API for the harvesting of metadata. We expose our DRUM and Digital Collections metadata for harvesting using OAI-PMH. OCLC uses this API to harvest our metadata for inclusion in WorldCat (see WorldCat Digital Collection Gateway).
- DRUM: http://drum.lib.umd.edu/oai-pmh/?verb=Identify
- Digital Collections: http://digital.lib.umd.edu/oaicat/ (includes a built-in stylesheet for human-computer interaction in the web browser)
Here are some additional resources you may be interested in exploring.
- Developer Network, http://www.oclc.org/developer/home.en.html (see Web Services)
- WorldCat Digital Collection Gateway, http://oclc.org/digital-gateway.en.html
Library of Congress
- SRU Search/Retrieval via URL, http://www.loc.gov/standards/sru/
- Chronicling America, http://chroniclingamerica.loc.gov/about/api/
- Name Authorities: http://lccn.loc.gov/no97025481, http://lccn.loc.gov/no97025481/marcxml
Google has not provided an API for Google Scholar.
- Towards a Google Scholar API, http://wowter.net/2014/02/26/towards-google-scholar-api/
- User app written to scrape data from the Google Scholar HTML, https://github.com/ckreibich/scholar.py
- MIT Libraries: APIs for Scholarly Resources, http://libguides.mit.edu/apis