Size Isn't Everything: Sustainable Repositories as Evidenced by Sustainable Deposit Profiles

Search | Back Issues | Author Index | Title Index | Contents

D-Lib Magazine
July/August 2007

Volume 13 Number 7/8

ISSN 1082-9873

Size Isn't Everything

Sustainable Repositories as Evidenced by Sustainable Deposit Profiles

Leslie Carr and Tim Brody
{lac, tdb01r}@ecs.soton.ac.uk
University of Southampton

Abstract

The key to a successful repository is sustained deposits, and the key to sustained deposits is community engagement. This article looks at deposit profiles automatically generated from OAI harvesting information and argues that repositories characterised by occasional large-volume deposits are a sign of a failure to embed in institutional processes. The ideal profile for a successful repository is discussed, and a new service that ranks repositories based on these criteria is implemented.

The Problem of Evaluating Repositories

The definition of an institutional repository as "a set of services that a university offers to the members of its community for the management and dissemination of digital materials created by the institution and its community members" (Lynch 2003) has remained an accurate reference point for technical researchers and IT managers alike in the four years since it was coined. Whether the objective is facilitating open access to research publications, building scholarly collections, creating learning objects, archiving scientific data or preserving content for the long term, the key is to offer these services to the members of the university community. One of the measures of repository success should therefore be the university community's take-up of these services.

However, at the time of writing, the most common way to measure the relative success of repositories is to compare the gross number of items that they hold. Registry services such as ROAR (the Registry of Open Access Repositories, roar.eprints.org) and OpenDOAR (Directory of Open Access Repositories, www.opendoar.org) record various attributes of repositories (their location, scope and platform), but the most obvious attribute to measure success is the number of items in a repository.¹ Davis and Connolly (2007) identify a problem with this strategy: a repository can exhibit respectable overall growth that is attributable mainly to special-case batch imports.

If it is true that community take-up is the foundation of the repository (without staff using the repository's services there would only be an empty repository), then it would be preferable to find a simple way to measure and report that take-up, a way that is achievable automatically and from outside the institution (so that it can be easily and frequently applied to all repositories). Deposits must be fundamental to this measure, as take-up is evidenced by members of the community depositing their materials (be they publications, lecture notes, scholarly items, scientific datasets...), whereas a lack of engagement is evidenced by an absence of deposits. Although a lack of deposits is frequently discussed in the context of an Open Access agenda (e.g., as a failing of the Self-archiving methodology), it is an equal problem for any repository, whether or not the repository is primarily intended to deliver Open Access.

Xia and Sun (2007) attempt to develop such an evaluation of repositories, but they base it on depositor identity (which conflates author and editorial processes) and full text percentages (difficult to determine), and they selectively apply these criteria to a small number of repositories. This article attempts to develop some simple metrics of "community take-up" that are available to external observers by analyzing the results of OAI-PMH harvesting. The metrics are demonstrated by embedding them into the ROAR registry of Institutional Repositories.

Large Repositories

Figure 1 charts the number of items in institutional repositories over a threshold of 10,000 records, as listed by ROAR on February 1, 2007. The largest (Cambridge University, UK) contains almost 180,000 digital items. These are all repositories that have achieved an obvious measure of success, featuring in the top 11% (by number of items held) of the institutional repositories catalogued by that registry.

Bar chart showing repositories containing more than 10,000 records

Figure 1: Repositories containing more than 10,000 records
For a larger view, click here.

ROAR takes its data from Celestial, an OAI-PMH harvesting proxy that caches the latest version of every metadata record that is harvested from each repository in the world, including information about when each record first appeared.² It is possible therefore, not only to determine the size of each repository at any instant, but also to build up a picture of its growth over time. In particular, the pattern of daily deposits can be analysed for each institution, and from that information some understanding of faculty-repository engagement can be determined.

Bar chart showing the number of days in 2006 in which any items were deposited in large repositories

Figure 2: Days in 2006 in which any Items were Deposited in Large Repositories
For a larger view, click here.

In Figure 2 the ordering of repositories along the horizontal axis is the same as in Figure 1 (largest to the left) while the vertical axis shows the deposit activity in terms of the number of days that deposits are made into the repository between January 1 and December 31, 2006. This graph reveals a big disparity between the use of these repositories for deposit – some of those with the biggest headline numbers are used relatively infrequently for deposit. In fact, half of these large repositories are used for deposit less than half of the year (100 days or fewer). Comparing all 236 institutional repositories rather than just the largest (Figure 3), we can see that many of the smallest repositories are as active as some of the largest although there is a general trend for smaller repositories to be used (i.e., receive deposits) on fewer days. Of course, if they had more deposits on more days then they would be larger!

Bar chart showing the days in 2006 in which items were deposited in all repositories

Figure 3: Days in 2006 in which items were deposited in all repositories
For a larger view, click here.

But Figure 4 shows that it is not the case that larger repositories are necessarily receiving deposits more often. Each chart shows a separate repository with the days of the year across the horizontal axis, and the number of deposits received per day on the vertical axis. In these charts the deposit size is plotted in log form on the vertical axis so that the occasional huge deposits don't swamp the more frequent small ones. Two of the repositories have very 'gappy' deposit records, indicating many days of inactivity between (often numerically high) deposits, while the others have more continuous daily deposit activity.

daily deposit in DEEPBLUE.LIB.UMICH.EDU

Figure 4-1
For a larger view, click here.

daily deposit in DSPACE.LIBRARY.UU.NL

Figure 4-2
For a larger view, click here.

daily deposit in DSPACE.MIT.EDU

Figure 4-3
For a larger view, click here.

daily deposit in EPRINTS.SOTON.AC.UK

Figure 4-4
For a larger view, click here.

Figure 4-1, 4-2, 4-3, and 4-4: Daily deposit rates in four large repositories

Repository Deposit Activity

Some repositories receive infrequent but high-bandwidth deposits (many hundreds or thousands in an individual day), whereas others benefit from more regular but less high-volume inputs. Is there any significant difference in the two cases? Does it matter if a repository receives a daily fillip or a monthly boost if the numbers in both cases average out to provide a healthy year-on-year growth? Is there any significance in the fact that deposits appear only intermittently?

Since individuals do not create lectures or papers to fit in with repository timetables, it is likely that deposits would naturally come in an apparently random schedule. If we accept the Lynch 2003 definition of a repository – a set of services offered to the whole community within an institution – then we would expect to see evidence of whole-community engagement within the daily deposits. So unless some behind the scenes scheduling were controlling users' interactions with the repository (e.g., physicists devote Mondays to the repository), deposits would also appear randomly spread across the whole community and the whole subject range of the repository.

It is possible to make up some back-of-the-envelope estimates for the expected deposit rate for an ideal 'average' institutional repository: an institution will have on the order of 1,000 faculty,³ each of whom might create 10 items per working year, e.g., four articles, two presentations, a poster, a set of research data and two teaching resources. That makes a not-unreasonable figure of 10,000 items to be deposited into the institutional repository over the course of a whole year. If there are approximately 220 working days per year, then an average of 50 items would need to be deposited per day to achieve the target of 10,000 items per year. (In fact, many repositories seem to attract deposits on almost every day of the year, whether a weekend, a national holiday or part of a seasonal break.)

Without an intimate statistical knowledge of institutional staffing and management practices across the world, it may be difficult to come up with a more concrete estimate for an expected deposit rate. Such a figure could be determined for a specific institution, but without global agreement on terms, like 'faculty', these measurements would be difficult to compare meaningfully. In a well-known science fiction comedy (The Hitchhiker's Guide to the Galaxy) the author Douglas Adams coined a similarly vague unit of measurement: "R is a velocity measure, defined as a reasonable speed of travel that is consistent with health, mental wellbeing and not being more than say five minutes late". In the same spirit we offer the following: D is a deposit measure, defined as a reasonable rate of ingest that is consistent with capturing the community's scientific and scholarly output. Given the very approximate estimates used to come up with a figure for D, we can make some broad statements about the expected properties of an active repository, one that is embedded into institutional processes and used by a broad range of staff. Such a repository should exhibit daily deposit activity whose graph (above) has the daily bars mainly concentrated in the central (10-100 deposits/day) region on the vertical axis. If the repository had reached the state of maturity where a thousand individuals were randomly depositing items independently of each other, and each depositor had a probability of 10/220 of depositing an item on any given day, then the Poisson distribution would predict extreme daily deposits outside the range 25-75 only once per decade.

To complicate this simple model, repositories based on software such as DSpace and EPrints are designed to receive individual deposits and then marshal them into a workflow for editorial inspection and acceptance. Not all EPrints repositories insist on this; some institutions adopt the policy that visible responsiveness to faculty submissions is more important than editorial oversight that can be applied after the fact (or not at all). It may be that any system of editorial management means that deposits are inevitably going to be "batched up" to give a less-than-continuous profile in which daily deposits are dominated by one or another editor's subject specialty. This is a potential explanation for the difference between a continuous and 'gappy' deposit profile. A repository may be partitioned into a number of communities, each of which has its own editorial processes. But in a well-embedded repository, the deposits will be randomly spread across the whole institution and the whole year; that is, shared out across all the individuals and departments in an institution, and hence all the communities and collections in the repository. As such, the overall total would not be subject to the delay of any one editor in particular or to any one school's processes. Of course, each component of that total will be subjected to some delay or frustration, but taken together the repository will be subject to a range of unpredictable workflow timings whose net effect is to mitigate against very short, very high peaks (that are dozens of times greater in size than a normal day).

By contrast to the effects of 'normal' repository operation, batch inputs of legacy collections (for example, existing multimedia collections or historical sets of pre-digitised Ph.D. theses) may inflate the daily figures. These pre-digitised and pre-catalogued resources can be easily adapted for high-throughput ingest and are often thought of as "low hanging fruit" as they give a repository the opportunity to easily gain in size. Such opportunities are a positive encouragement for users and managers of the repository, but they are not a replacement for genuine, broad-spectrum self- or mediated-deposits from a wide range of schools, departments, topics, and users. Infrequent, high volume deposits may make up the numbers in the early stages of a repository, but they expose potential weakness if, as special cases (existing digitised collections), they substitute for (or occlude the need for) popular (self- or mediated-) deposit on a regular basis.

Self-archiving is a term commonly associated with Open Access, but even if the agenda that motivates a repository is Scholarly Collections (or Preservation, Teaching or Data Archiving), then a broad-spectrum buy-in by the faculty and research staff is a necessity to fulfill the objectives of the repository. Collecting the intellectual output of an institution's staff requires a focus on their current activities and current output, and an engagement by the staff to use the repository services to start curating and depositing their current work on a systematic basis.

Monitoring Repository Deposits with ROAR

In order to examine the performance of repositories according to the criteria established above, ROAR has been extended to allow examination of the daily activity of any of its registered repositories. Figure 5 shows the most main adjustment, a histogram of instantaneous daily deposits (blue) superimposed on each graph of cumulative repository sizes (green) on the main repository listing pages.

Image of a histogram with instantaneous daily deposits (blue) superimposed on each graph of cumulative repository sizes (green) on the main repository listing pages

Figure 5: ROAR reports enhanced with daily deposit data
For a larger view, click here.

As well as linking to each repository's cumulative data as a graph or table, the user is now offered various ways of finding out the deposit activity. First, a six-year history bar chart is superimposed on the cumulative graph (as described above). Second, the number of days' deposits from the previous year are listed under three categories: counts of those days with 1-9 deposits, 10-99 deposits and 100+ deposits respectively. These three categories roughly correspond to "weak", "healthy" and "batch imports" as discussed above. These three categories have also been added to the repository-ranking menu (Figure 6), to enable a comparison of repositories on these bases. (Note that cross-institutional, thematic and departmental repositories serve communities of different sizes and should not be judged in the same way.)

Image showing a sort by deposit activity

Figure 6: Sort by Deposit Activity
For a larger view, click here.

Further links provide access to a static histogram of the deposit profile for the previous year (with enough space for individual days to be clearly seen and weekend breaks to be noticeable) and to a table listing each deposit on each day in the last year (together with the OAI sets in which it appears) in tab-separated text format for further analysis as a spreadsheet.

Image of a clickable SVG graph showing an individual day's deposit breakdown

Figure 7: Clickable SVG graph showing an individual day's deposit breakdown
For a larger view, click here.

Finally, there is a link to a separate page containing an interactive graph that allows the user to select an individual day to see its OAI records and containing sets (Figure 7). On that page, each OAI identifier is linked to its harvested OAI record and also to the repository abstract page describing that OAI resource. This information is provided by Celestial, the proxy OAI-PMH harvesting service (celestial.eprints.org) that maintains the databases of OAI holdings upon which ROAR, Citebase and other services are built. Celestial has previously been used as an invisible part of the OAI infrastructure for these services, but the data that it holds is very valuable. Thus far, ROAR has relied on Celestial to create the graphs of repository sizes, and now it has been extended to allow examination of these collections of deposits in ways not normally provided by the repositories themselves.

The report in Figure 7 shows that on October 19, 2006, 8 records were added to the 'CSAIL Technical Reports' set in the MIT DSpace repository. It further shows that before the start of this year there were 213 items already deposited in this set, and that during this year 83 further items were added to the set, of which 8 were added on this specific day.

A Note on OAI sets

Most repositories provide a mechanism for showing subject classifications or the institution's organisational structure as a prominent part of the user interface. By contrast, the OAI-PMH protocol allows a repository to divide its total collection into named 'sets' that can been seen by software harvesters (OAI service providers). The meaning of these sets is not defined by the OAI protocol, and developers are free to interpret them as they wish. Particularly, individual items may appear in many sets, or in no sets. DSpace repositories tend to use sets to reflect their collections structure, while EPrints repositories expose both the subject classifications and institutional structure. Other repositories simply maintain sets of 'published' or 'fulltext' deposits. Although sets are not a conclusive indication of the spread of deposit items, with some care in interpretation they allow the stories behind deposit peaks and troughs to be investigated, helping to determine common practice in large repositories. For example, they reveal when a large peak (or repeated peaks) results from importing items into a single (or narrow range of) topic(s) or collection(s).

Using Deposit Measures to Understand Repositories

We applied the deposit criteria factor presented above to the twenty largest institutional repositories listed by ROAR to determine whether there is evidence of double-digit daily deposits that were spread across the whole institution during the twelve months from March 2006. In doing so, we augmented the automated statistics provided by ROAR with a manual inspection of the repositories, particularly listings of their collections (or equivalent). Each repository is categorised against double-digit daily deposits (DDDD values are Yes, No or Partial) and topical spread criteria. (SPREAD values are Yes, No, Partial or Unknown.) The results are presented in the extended table below.

Table 1

Location and Assessment

Deposit Graph

Comments

DSpace at Cambridge


DDDD:	N
SPREAD:	N