Biodiversity Heritage Library for Europe: BHL and its developers: Andreas Kohlbecker

The second BHL-Europe developer to be featured on the blog is Andreas Kohlbecker from the Botanic Garden and Botanical Museum in Berlin. Read on to find out more on his involvement in BHL!

So Andreas, what do you do for BHL-Europe?

Currently, I’m one of the main developers of the search facilities of BHL-Europe. I am for example dedicating my time to the solr server, which is our search engine. This server is responsible for indexing the content stored in the BHL-Europe system and thus is the technological base for making the content accessible from the portal via search and browse functionalities. Simply put, it’s the connection between the portal, where all information and scans are made available via the web, and the archived and stored data behind it.

When I came to the project in November 2009, I coordinated the integration of BHL and BHL-Europe with the EDIT-project (European Distributed Institute of Taxonomy). After that time I was concerned with architectural questions regarding the taxonomic intelligence tools of BHL Europe. I could fill this position since I am both biologist and computer scientist. After that time I took care of setting up some core services for the BHL-Europe infrastructure and services supporting the development process like Jenkins, our continuous integration server.

Can you tell us some more about Solr?

Solr is a very fast open source enterprise search platform written in Java. It offers full-text search, spell check, faceted search, hit highlighting and it has tools that allows to deal with umlauts and special characters. It even has the possibility to find correct hits if you don't type the words exactly as they are stored in the index, we call this feature 'fuzzy search'. The full-text indexing and search core of Solr is the 'Lucene' search engine. Solr extends the capabilities of Lucene and has REST-like HTTP/XML and JSON APIs that allow to use Solr remotely and makes its application independent of specific programming languages.

One of the advantages of using Solr is that it not only can create an index, but that it can also store the information that has been indexed. We are making use of this feature as a caching mechanism to speed up things in the BHL-Europe portal. For example if you run a search in the portal all the information displayed in the result lists is directly coming out of Solr, so we don’t have to ask our core system every time for details on the matching entities. Only for example clicking on a link to the actual metadata or book viewer will access the Fedora-system.

The linking element between this core system, Fedora (the system that stores all the scans and the according metadata) and Solr is gSearch. If you put a new document or book into Fedora, gSearch recognizes this, and then transforms the internal Fedora data into a Solr-document. Then it tells Solr to index this document. This means the Solr-index is always up to date with the latest changes in Fedora.

Has the development of the BHL-Europe search facilities been difficult?

For the development process In general, one of the biggest difficulties was the fluctuation of staff, which is very common to projects like this.

Another big challenge more specific to the search functionality was to make the search functionality of portal as smooth and effective as possible. We had to align the different search options that are offered by the portal: simple search, advanced search, faceted search and browsing by several categories. Improving these aspects of the user interface was and still is an iterative process. In the beginning, we had wrong expectations of what the user wanted, and it was difficult to foresee all the hurdles on the way to a functional system.

Other demanding difficulties were due to a design principle of solr/Lucene which prevents you from reliably using wild-cards and umlauts in combination. This might be OK the for the Anglo-American linguistic area , but it is actually a bug if you have to deal with language full of diacritic character like in Europe.

You were mentioning Jenkins before, can you tell us something about this integration environment?

In the previous interview Chris Sleep was talking about github, the source code repository and version control system. Jenkins is the next step in the development workflow. It helps us automating repetitive and crucial tasks: all code changes made by developers need to be build and deployed to testing servers. The integration server holds the whole BHL-Europe system and always reflects the most recent state of development. Once the system there has proven to be mature enough to be tested the BHL-Europe system components are deployed to the next stage, to the testing servers. There all in depth testing can than take place in a stabilized environment. Once testing has proven the system to be ready for release it is deployed to the last stage in the chain, to the production servers and the new features and bug fixes will become public. Jenkins is responsible for performing these build and deployment tasks. It for example monitors our github repository for changes and triggers specific deployment jobs in turn.

BHL and its developers: Andreas Kohlbecker

No comments: