Leave a comment

Solr System

We are in the process of integrating Apache Solr to work with our Fedora-based digital repository on the back-end.  I am not going to pretend that I know and understand all of the technical details about Solr, as listed on their home page.  My layperson interpretation of its features are as follows:

  1. Solr is a standalone enterprise search server with a REST-like API. I think this means that Solr runs on its own and can be accessed via a URL in a web browser.
  2. You put documents in it (called “indexing”) via XML, JSON, CSV or binary over HTTP. At UMD, our “documents” are the FOXML xml files where Fedora stores our metadata
  3. You query it via HTTP GET and receive XML, JSON, or CSV results. We can use a web browser and a URL to construct queries.

At UMD, we ingest content into our Fedora repository via two methods: a home-grown web-based administrative interface for adding images and via batch ingest (which currently requires developer assistance). We use the administrative interface to manage the metadata for our digital objects. However, the administrative interface has always lacked robust reporting capabilities. Solr includes a robust administrative interface of its own that allows for the construction of complex queries and reporting outputs. For me, as a user, this is Solr’s greatest benefit for us. Our Software Systems Development and Research team try whenever possible to put as much knowledge in the hands of the users.  It is a win-win situation.  For them, it eliminates having to answer and investigate really basic questions for me, and for me, it enables me to achieve results and do my work without having to depend on others.

Solr requires first the development of a schema, which is essentially a file that explains what we want to index and how.  Understanding how to read and interpret the schema is a first step to understanding how Solr works. First, you define fields, and these fields are related to our metadata.  In a simple example, a “Title” field in Solr is an index on the <title> tag in our Fedora metadata.  Within a field, we can define how the field acts.  For example, we have defined a field type of “umd_default” that runs a series of filters on our data.  These filters are the key to understanding how searching works in Solr. I’m going to use the following piece of correspondence as an example: Letter by Truman M. Hawley to his brother describing Civil War battle. Includes envelope, September 26, 1862. When Solr indexes this title it does a number of things.  Many of these things are customizable, and this is what is important to understand.

  1. It separates and analyzes each word and assigns locations to them. “Letter” is in location 1 and takes up spaces 0-5 (the space at the end of the word is included in the word)
  2. It determines the type of word (is it alphanumeric? Or just a number? 1864 is just a number)
  3. It removes punctuation. Finally, a place where no one cares about commas.
  4. It removes stopwords. We apply a “StopFilterFactory” filter to remove stopwords. These can be customized. In our system, “by,” and “to,” are considered stopwords and we do not index them.
  5. It converts everything to lower case. Solr does not have to do this. We apply a “LowerCaseFilterFactory”  with the assumption that our users will not need to place emphasis or relevancy on case in searches.
  6. We apply an “AsciiFoldingFilterFactory” that converts alphabetic, numeric, and symbolic Unicode characters which are not in the “Basic Latin” Unicode block into their ASCII equivalents, if one exists. So, for example, a search on “Munoz” will match on “Muñoz”
  7. We apply the “PorterStemFilter” to the Title field.  This filter applies an algorithm that essentially truncates words based on assumptions about endings. In the example above, “describing” becomes “describ” and “battle” becomes “battl.”
What we are left with is indexing on the following terms:

letter truman m hawlei hi brother describ civil war battl includ envelop septemb 24 1864 This means that I could run the following query in Solr q= Title:(truman AND civil AND describ AND battl) and receive this letter as a hit.  Solr still allows for the capability of phrase queries (“Letter by Truman M. Hawley”), or for wildcard searches: (Truman AND Hawl*). Our implementation of Solr currently assumes a boolean “OR” as the default operator in a search string. So, if I thought to myself, I am interested in looking for content having to do with the Civil War in the month of September, I might type into a search box something like “civil september.” How this translates based on our configuration is “Search the Title field for anything containing the term “civil” OR “septemb.” Here are just a few examples out of my over 300 results:

  • The Greek beginning,Classical civilization
  • The Classical age,Classical civilization
  • Ancient civilizations The Vikings
  • Ancient civilizations The Aztecs
  • Ancient civilizations The Mayans
  • The Civil War in Maryland Collection
  • The Celts,Ancient civilizations
  • Acts of faith,Jewish civilization in Spain
  • Brick by brick: a civil rights story

How is this possible? Well, if I investigate how our PorterStemFilter analyses “civilization,” I discover that it becomes “civil.”  Also, as a user, in my brain, I am thinking that I want results that have to do both with the Civil War AND September, and Solr is returning results that have to do with either.  If I manually adjust my search to be a boolean “AND” search – Title:(civil AND September), I only see three relevant results. This might lead me to believe that we should instantly change our default search to “AND” instead of “OR” since obviously, if I type a search into a box and it has two terms, I want to see records with both those terms.  Our current default in our public interface is “AND.” And also, we should turn off the PorterStemFilter because all of those “civilization” hits are annoying. If I want to search for “Civil*” I will search for “Civil*.”

But is it so simple? What is best for the user? What default settings will be most useful for our users? This is a different discussion and I will be working with my colleagues on the Collections side of things to try to answer some of these questions. Solr is so robust, and can be used to fit so many different situations, that truly configuring it in the most effective way is overwhelming, but also exciting.

About these ads

About Jennie Levine Knies

Jennie Levine Knies is the Manager, Digital Programs and Initiatives, University of Maryland Libraries.

Discuss!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,785 other followers

%d bloggers like this: