Search   |   Back Issues   |   Author Index   |   Title Index   |   Contents

Conference Report

spacer

D-Lib Magazine
November/December 2008

Volume 14 Number 11/12

ISSN 1082-9873

The Use of Digital Object Repository Systems in Digital Libraries (DORSDL2)

ECDL 2008 Workshop Report

 

Gert Schmeltz Pedersen
Technical Information Center of Denmark
Technical University of Denmark
<gsp@dtic.dtu.dk>

Kåre Fiedler Christiansen
The State and University Library, Denmark
<kfc@statsbiblioteket.dk>

Matthias Razum
FIZ Karlsruhe
<Matthias.Razum@FIZ-Karlsruhe.de>

Red Line

spacer

The 2nd European Workshop on the use of Digital Object Repository Systems in Digital Libraries (DORSDL2) <http://dorsdl2.cvt.dk/>, held in conjunction with ECDL 2008 <http://www.ecdl2008.org>, took place on 18 September 2008, in Aarhus, Denmark. The workshop was attended by 40 people (including speakers) from 10 countries.(The 1st DORSDL Workshop, <http://www.lib.uoa.gr/dorsdl/>, took place on 21 September 2006, in Alicante, Spain, see <http://www.dlib.org/dlib/october06/pedersen/10pedersen.html>.)

Digital libraries and digital repositories are – in many ways – two sides of the same coin. The DORSDL2 workshop brought together researchers and practitioners from both fields, and aimed to transfer knowledge and connect. The target group for the workshop was comprised primarily of repository researchers, developers, and managers. The workshop addressed both experiences and novel concepts with a technological and/or organizational stance.

The workshop covered a variety of practical digital library development issues and how their resolution can (or cannot) be carried out in the context of the digital object repository at hand. The full-day workshop was comprised of three sessions: "Applications", "Architectures", and "Search", each with an invited speaker. These sessions were followed by a concluding discussion.

The first session ("Applications") was opened by the invited speaker Sandy Payette (Executive Director, Fedora Commons, USA). In her presentation "Repositories: Disruptive Technology or Disrupted Technology?", Sandy considered the role of digital object repositories in the context of cloud computing and highly distributed systems. After a first generation of systems focusing on institutional repositories and digital library applications, and a second generation embracing Web 2.0 techniques like annotations or collaborative filtering, we are currently on the verge of a third generation, which will cover the data-intensive aspects of e-Science and e-Research. This requires a shift from "repository islands" towards distributed, web-oriented, open, and interoperable infrastructures. In such an environment, "the repository" might no longer be a well-defined place in the library, but rather an entry point to a highly distributed fabric of storage and services. The recently released Fedora 3.0 software provides a logical next step in this direction.

The next speaker was Lodewijk Bogaards (Data Archive and Networked Services, The Netherlands). In his presentation "Easy On Fedora – Using eSciDoc; turnkey access?", Lodewijk pointed out that current repository systems lack a kind of middleware layer with higher-level services that ease the implementation of more complex applications. His use case, an archival software for research data sets from the arts, humanities and social sciences, showed that adding middleware to the software stack helps developers to concentrate on their domain-specific business logic instead of having to "re-invent the wheel". The middleware of their choice, the freely available eSciDoc Infrastructure, provided them with a set of predefined content models, integration into their authentication system, elaborated search capabilities, and more.

The last speaker in this session was Elsebeth Kirring (The State and University Library, Denmark). In her talk "Building a User Oriented Automatically Generated Custom User Interface", she described the challenge to ingest and describe non-textual objects into their repository in an efficient and adaptive process. Based on the latest release of the Fedora Commons software, her team created a content-model driven software that generates user interfaces for capturing metadata. Each content model represents one of the different object types in the repository.

The second session ("Architecture") began with the invited speaker Herbert Van de Sompel (Los Alamos National Laboratory) with a talk titled "What to do with a million books: aDORe for storage and access". Herbert presented the impressive work done in Los Alamos to combine components based on standards to a system that was hugely scalable and extensible. The fact that the system is based on open standards made the system very well suited for interoperability with other systems, and thus the talk tied in well with the talks from the previous sessions. In today's world of ever-increasing amounts of data, the challenge of storing, referencing and retrieving data is becoming a huge challenge, which aDORe addresses well. At the end of the talk, Ryan Chute presented djatoka, an impressive open source JPEG 2000 image server developed for, but independent of, aDORe. The djatoka server provides on the fly dissemination of zoomed, scaled, and watermarked images, and much more.

The next presenter was Dave Tarrant (University of Southampton), speaking about "Applying Open Storage to Institutional Repositories". He presented an architectural view on preservation. Rather than looking at individual pieces of software, the Preserv2 project tried to look at an overall architecture that facilitated preservation. This has led to the concept of "Open Storage", a self-checking, self-healing file-system, based on open components, both for hardware and software. Having Open Storage under your repository, whatever software you choose, will provide the benefits of a reliable storage component, and will give you a single point where you can do your storage-related preservation tasks.

The third speaker was Asger Blekinge-Rasmussen (The State and University Library, Denmark) on "Digital Repositories and Data Models". He presented work done on Fedora to be able to describe content models in a precise and machine-readable manner. Fedora provides great possibilities for having data of very different kinds in the same repository, but has only just in Fedora 3.0 scratched the surface of having a uniform way of describing what kinds of data you actually have. Asger presented work on providing a description based on the OWL and XML Schema standards, which extends the Fedora description of Content Models. Use cases are validation, automated applications (like the user interface described by Elsebeth Kirring in a previous session) and exchange between institutions of Content Model Descriptions.

As the last speaker in the second session, Alex Wade (Microsoft Research) presented "An Introduction to the Microsoft Research-Output Repository Platform". The Microsoft Research-Output Platform is software in development for registering, describing, and accessing research output. Research Output was broadly defined, and included, for instance, publications, files, and datasets. Perhaps the most interesting part of the software was the widespread use of RDF (Resource Description Framework) for describing relations of various kinds - citations, authoring, etc. This was indexed in a Microsoft SQL Server with RDF-extensions (Famulus) that was reported to perform much better than traditional triple stores, but providing much of the same functionality. This was used for interesting browsing of data and their relationships in a graphical fashion. The platform implemented many standards to facilitate interoperability, and implementation of more standards is planned, including OAI-ORE.

The third session of the workshop ("Search") began with the invited speaker Robert Tansley (Google Inc.), who talked about "Science in the Cloud: Google and Sharing Huge Datasets". He discussed how to bring large scale results from e-science to the attention of a wider audience, with an approach similar to cloud computing.

Christian Kohlschütter (iSearch IT Solutions GmbH) talked about "Enhanced Federated Search for Digital Object Repositories (DOR)". The aim of this project is to enable users to perform searches in multiple DORs simultaneously, with all the features of a single DOR system, and to enable search between heterogeneous/incompatible DORs without changing the underlying workflow. A reference implementation based upon Lucene has shown good results with efficient faceted browsing functionality.

Finally, Gert Schmeltz Pedersen (Technical University of Denmark) talked about solutions to filtering of search results by access constraints, as defined by XACML policies, in order to show only those search hits that the user is actually permitted to read. Post-search filtering requires a request to the XACML mechanism for each hit, and the total number of permitted hits is only known at the end, a costly procedure especially when few hits are permitted out of a large number. In-search filtering requires additional index fields and query rewriting, that is, a logical partitioning of the index. Pre-search filtering requires a physical partitioning of the index and selection of the pertinent index at query time. Both in-search and pre-search filtering face the challenge of exact correspondence between the filtering mechanism and the XACML policies. A preliminary implementation within the Fedora Generic Search Service ("GSearch") facilitates further evaluation.

The concluding discussion

Repositories evolve from stand-alone systems to nodes in an academic knowledge network. Interoperability not only on the metadata level, but on the object level is becoming more and more important. Protocols like OAI-ORE and standardized deposit interfaces will allow for the tight integration of repositories with known products.

Going from publications and dissertations to research data sets and other, more demanding data types, as well as large-scale repositories requires a more flexible storage strategy. Content-addressable storage systems (e.g., Sun's ST5800 "Honeycomb"), the Grid, and cloud computing (e.g., Amazon's EC2 and S3, but as well your local cloud in your university) are surfacing as options for repository setups. In the future, a repository may just be a service overlay to a distributed storage architecture.

At the same time, it is getting more and more important to link objects across institutional and repository boundaries. The publication may be deposited in your institutional repository, whereas your visualized data is stored in a specialized image repository, and the raw data resides in the grid. Allowing users to seamlessly navigate from one repository to another while maintaining the context of the objects and the meaning of the relations will be one of the next big challenges.

Digital Object Repositories have a bright future. However, tomorrow's systems will differ substantially from the systems we know today.

Copyright © 2008 Gert Schmeltz Pedersen, Kåre Fiedler Christiansen and Matthias Razum
spacer
spacer

Top | Contents
Search | Author Index | Title Index | Back Issues
Previous Conference Report | Next Conference Report
Home | E-mail the Editor

spacer
spacer

D-Lib Magazine Access Terms and Conditions

doi:10.1045/november2008-pedersen