Robert H. McDonald
This article summarizes findings from a study of author/depositor distribution patterns within scholarly digital repositories. At the moment, evaluative frameworks are in short supply for institutional and disciplinary repositories (Kim & Kim, 2006). After a review of issues of scholars participating in digital repositories, author/depositor distribution is analyzed as one possible technique that might be used to judge the success of a repository. This statistical technique was used to evaluate participation patterns among more than 30,000 author/depositors whose works were found in various categories of digital repositories. Findings from this analysis, including comparisons of participation patterns across three categories of scholarly repositories, are presented along with an explanation of the questions and challenges that arose during the study. The article concludes with an evaluation of the analytical technique and its potential as one metric for judging a repository's success.
Digital repositories may assume many forms. The Andrew W. Mellon Foundation's operating definition of a repository is a good starting point:
"A repository is a networked system that provides services pertaining to a collection of digital objects. Example repositories include: institutional repositories, publisher's repositories, dataset repositories, learning object repositories, cultural heritage repositories, etc." (Mellon, 2006)
The Coalition for Networked Information's Executive Roundtable characterized repository categorization as "very problematic" (CNI, 2003). While content within them varies substantially, repositories can be categorized in terms of who funds and administers them. The two main categories in this classification are "disciplinary" and "institutional" (Ibid). Within the realm of institutional repositories, a further important sub-division is emerging rapidly those whose sponsoring institutions do not mandate faculty/scholar participation (referred to as "voluntary-deposit institutional" in this article) and those that do (referred to as "mandatory-deposit institutional" in this article). Regardless of a repository's category, securing participation by scholars, meaning their willingness to deposit copies of their research output, cannot be taken for granted. In fact, achieving significant participation rates, particularly in institutional repositories, is cited repeatedly as the most challenging aspect of establishing scholarly repositories (Lynch & Lippincott, 2005).
Participation by contributors is one of the most important indicators of a scholarly digital repository's success. Some repository sponsors try to recruit contributors by emphasizing the need to preserve and measure use of research output (Day, 2004), but the factors that actually motivate scholars to deposit their work are more complex. Recent studies (Swan, et al., 2005; Foster & Gibbons, 2005; Kennan & Wilson, 2006) confirm earlier suspicions that a desire to enhance an institution's prestige or to enable more systematic and automated assessment of scholarly productivity within an organization is most certainly not a motivating factor for repository participants. Instead, the choice to deposit research output usually stems from desires for personal recognition and impact among one's peers (Ibid). Unfortunately, gauging the impact of a deposited paper, report or similar work is not easy for either an individual scholar or a repository manager. Yeomans (2006) reminds us that even the most mature repositories, such as the physics community's arXiv, may generate impressive statistics, but offer little to help anyone know what kind of "success" those figures measure. Some have suggested utilizing the criticized but widely-employed personal "impact analysis" techniques that grew from citation analysis theory (Day, 2004). However, even the creator of modern citation analysis theory warned such techniques "are easily misinterpreted or inadvertently manipulated for improper purposes," and should take into account differing publishing and citation customs across disciplines, the practice of self-citation, and reasons why individual works are cited (Garfield, 1983).
Furthermore, desires for recognition and impact still are not enough to ensure scholars will participate in institutional repositories. Even institutions with mandates requiring faculty deposits face the enduring task of encouragement and mandate enforcement (Sale, 2006). Sale (2007) estimates only 15-20% of faculty will ever choose to participate in voluntary-deposit repositories of any type. Imagine, therefore, the difficulties involved in encouraging or predicting participation in voluntary-deposit institutional and disciplinary repositories!
Because scholarly contributions are a vital part of successful repositories, finding a meaningful measure of participation is an important step in developing comprehensive repository evaluation frameworks. One such measurement is to compare the actual number of contributors and their actual numbers of deposits against the total universe of possible depositors and their total research output. Sale (2006) employed this technique to track one university's successes in requiring faculty to deposit copies of each paper they publish. Ongoing measurement in this manner presents challenges. It requires detailed tracking of individual identities and each scholar's output. Additionally, comparing actual participation rates against the benchmark of possible participation rates is very likely to yield discouraging results, particularly for repositories that cannot leverage an institutional or other mandate to gather content.
Another measurable facet of participation is the distribution of a repository's total content, per contributor. By applying simple analysis of frequencies and distributions of authors and papers contributed, would such a perspective on a digital repository be useful? Because this measurement involves only what is already in a repository, would the data for analysis be simple to obtain from many repositories? The study described in the remaining portion of this article addressed the following three research questions:
A simple research methodology was devised, as described below.
The study was conducted using the following definitions, processes and guidelines:
Repository: The study focused on repositories containing mainly research papers suitable for publication in serial literature like scholarly journals, professional society newsletters, or related modes of dissemination. Repositories containing significant quantities of learning or instructional objects, institutional records, or other ancillary components of the research and teaching process were disqualified from analysis. Additionally, because Electronic Theses and Dissertations skew author:items ratios toward a 1:1 value, repositories containing Electronic Theses and Dissertations (ETDswere not considered.
Depositor: Though one cannot always assume the author of a paper is the same person who actually deposits it into a digital repository, this study assumes creators of research output generally self-deposit or authorize a proxy to do the task for them. The terms author, depositor and participant are used interchangeably in this article.
Items: Discrete manifestations of intellectual creation as described for the term "Repository" above. Typically, an item is equivalent to a research paper, technical report, or similar object.
Participation: Allowing one's intellectual creation to be deposited and made available through a disciplinary or institutional repository. This is assumed to be a conscious choice for all depositors.
Processes and Guidelines
1. Select a group of repositories for analysis.
2. Classify each repository for analysis as "voluntary-deposit institutional", "mandatory-deposit institutional" or "disciplinary".
3. Obtain reports from each repository on the total number of items in the repository, the total number of contributing authors, and the corresponding number of items created by each author.
4. Calculate author:contribution frequencies by repository.
5. Aggregate repository frequencies into voluntary, mandatory, and disciplinary groupings.
6. Analyze distribution patterns within and across categories
During this study, the following issues had to be considered:
1. Identifying the scope, content and context of individual repositories is an imprecise and subjective exercise. Categorizing the types of content within any repository is difficult. Though the content of both institutional and disciplinary digital repositories certainly will be an important part of future "distributed libraries" (Brogan, 2006), even the most mature sites have trouble providing clear categories and indicators for measurement (Yeomans, 2006). The initial selection of archives for evaluation was one of the most time-consuming aspects of this study.
2. Most repositories do a poor job of maintaining standard forms of names for contributing authors, so the same author may be listed under multiple name variants and treated as separate people. Responsibility for name consistency in most repositories seemed to rest with the depositors themselves. This study could do little to correct for such variations within the 29,388 contributors listed by the repositories analyzed. However, as the results of this study will show, a large number of authors only have contributed one item to any particular repository, so perhaps the issue is not yet important enough for most repository managers to notice.
3. Multi-author papers are common in all of the repositories analyzed. Most repositories could provide a list of the total number of items deposited, and the total list of authors, but none were able to easily identify or count the total number of multi-author papers. This is a significant shortcoming in the reporting functions for repository software, because knowing such publishing patterns is key to understanding the impact of organizations and individual scholars. Customized reports would have been necessary to determine what percentage of each repository's total deposits are multi-author papers. Fortunately, this issue did not affect the research questions asked in this study, for an author's total number of contributions to a repository is still available, whether as sole author or as part of a group. However, knowing whether a paper has a sole creator or was authored by a group would be a useful metric for evaluation. Sale (2006) also identified this issue as a complication when the creators of a multi-author paper must decide where it should be deposited, especially if authors work for different organizations.
4. Only a small number of institutions require scholars to deposit copies of all their research output like papers and technical reports. Of five mandatory-deposit institutional repositories considered for this study, two were too small and another contained dissertations and other content that would have skewed distributions toward a 1:1 author:items ratio. Only two mandatory-deposit institutional repositories were consequently analyzed. The patterns identified within these two repositories indicate a possible and intuitively logical difference from disciplinary and voluntary-deposit institutional repositories, but the sample size is too small to be reliable.5. The age of a repository, and the age of items it contains, can significantly confuse any analysis and comparisons of scholarly repositories.
Analysis of the selected repositories yielded the following results:
1. Cumulative Findings
In the core group of 11 repositories, aggregated data showed:
The following sections detail the study's findings by each repository category, with separate analysis devoted to the AgEcon Search and arXiv.org repositories.
2. Voluntary-Deposit Institutional Repository Findings
Data from 6 repositories, collectively containing 14,829 deposited items by 18,326 authors, were analyzed. On an aggregated basis:
Nearly three-fourths (74%) of the authors listed in these repositories each had contributed only one paper to an individual digital archive. The remaining 26% of authors were each responsible for from 2 to 156 papers.
3. Mandatory-Deposit Institutional Repository Findings
Data from 2 repositories, collectively containing 5,920 deposited items by 4,167 authors, were analyzed. On an aggregated basis:
61% of the authors in these 2 repositories were responsible for only one title each. The remaining 39% contributed from 2 to 215 items. Compared to the data from voluntary-deposit institutional repositories, Figures 2a and 2b indicate a significantly greater tendency for authors listed in mandatory-deposit institutional repositories to have contributed more than one item to the repository. Expressed another way, these authors are much less likely to have only one item in an individual repository, and are more likely to be represented in the positive skew of the distribution. The distribution curve in Figure 2b shows more authors distributed on the positive side of the curve.
4. Disciplinary Repository Findings
Data from 3 disciplinary repositories, collectively containing 6,773 deposited items by 6,895 authors, were analyzed. On an aggregated basis:
Data from the three analyzed disciplinary repositories show the vast majority of authors (nearly 74%) are responsible for only one item. In a manner to similar to the other repository categories discussed earlier in this article, the remaining 26% of listed authors contributed from 2 to 96 items, with the number of contributions per author dropping off rapidly after two papers to a "long-tail" distribution (Anderson, 2004).
5. Additional Analyzed Disciplinary Repositories Findings
In addition to the 11 repositories described in sections 1-4 of this study's findings, the authors collected and analyzed data from two other large, well-established disciplinary repositories. Because of each repository's size, longevity and other distinguishing characteristics, data from each archive was analyzed separately using the same techniques applied to the 11 core repositories. Following are the results of each analysis.
AgEcon Search Repository Analysis
The AgEcon Search repository contains research papers and reports in the broad field of agricultural economics. Author and contributions data was obtained with the cooperation of the repository's managers. On the date this repository was analyzed, it contained 24,569 deposited items by 19,700 authors. According to this data:
As with other digital archives examined for this study, the majority (nearly 65%) of authors listed in AgEcon Search have only 1 item in the repository. The remaining 35% of authors were responsible for between 2 and 166 items. Compared to the other disciplinary repositories described in section 4, however, AgEcon Search authors were significantly more likely to have 2 or more contributions listed.
arXiv.org Repository Analysis
The arXiv.org repository contains current research papers and related materials from the communities of researchers in Physics, Mathematics, Computer Science and Quantitative Biology. Author and contributions data was obtained with the cooperation of the repository's managers. Data from this repository were extensive and represented 406,857 deposited items by 105,131 authors. According to this data:
Administrators of arXiv.org were very responsive to requests for data, but warned us that categories such as "author", "depositor" and "item" are less clear in the complex group of research materials and depositors represented within this repository. Comparing arXiv.org data with other repositories introduced new areas of uncertainty into this study, and therefore should be considered with caution.
Keeping this caveat in mind, one can see in the arXiv.org Figure 4b a pattern noticeably different from other analyzed repositories, with only 40% of listed authors having only 1 item in the repository. Nearly one-fourth (24%) of listed authors, in fact, are each responsible for 5 or more titles in arXiv.org, with the number of contributions per author ranging from 1 to 446 items.
Discussion & Future Work
The findings presented in the preceding sections would tell a manager several useful facts about a repository, including:
According to data from the 11 core analyzed repositories, contributions to non-mandatory institutional repositories and disciplinary repositories can best be characterized as widespread but shallow. The distributions patterns for both of these datasets were surprisingly similar, and raise more questions. Would other disciplinary repositories show similar results? Unfortunately, obtaining this kind of data for analysis was very difficult; many repositories provide automated reports for online users to browse repository contents by year, topical group, or even author, but seem to have taken extra efforts to not provide correlations between individual authors and their total number of papers deposited. Responses by many repository managers to requests for author:items reports indicated their concern over releasing such information. Nonetheless, examination of a greater range of disciplinary repositories was warranted, and led the authors to acquire and analyze the data described in this article for the AgEcon Search and arXiv repositories. Data from Ag-Econ Search tend to support the validity of the patterns discovered in the 11 core repositories. Data from arXiv.org shows a very different pattern, but the reasons for this variance are likely explained by one of the many issues identified by the authors with counting and comparing figures without the benefit of common categories, terminology, or reporting standards among digital scholarly repositories.
As for mandatory-deposit repositories, the limited available data indicate authors represented in such repositories tend to contribute more of their intellectual output. Sale (2006) predicted institutions establishing deposit mandates were likely to see such results within three years of implementing these policies. Harnad (2006) cited surveys showing 95% of scholars comply if their university mandates depositing in an institutional repository. This study's findings only reinforce such predictions and arguments favoring institutional mandates. As the data in this article show, a mandate is arguably the "tipping point" described by Gladwell (2000) that can make depositing behavior among scholars not just widespread, but also more of an ingrained and complete behavior.
Mandates for Open Access and deposit are proliferating internationally, and are sure to create a noticeable impact upon institutional and disciplinary repositories in terms of the characteristics analyzed in this study. Intuitively, one would expect the average number of contributions per author to increase in many repositories. However, the overall number of participants is likely to increase also. What effect will this have in overall distributions and relative measures of participation? This question will be a fertile area for evaluation in the future. These studies, along with recent articles like Carr & Brody's (2007) report on deposit profiles among digital repositories, affirm the need for continued research such as:
This study revealed a heterogeneous universe of repositories. Many of these repositories defy easy classification into the groups defined for this study. Of the repositories listed in the OpenDOAR registry, many are less transparent than one might prefer when trying to analyze characteristics such as participant distributions. The large number of complications encountered during this study indicates the Open Repository community might benefit by endorsing some sort of standard set of harvestable reports for all repositories, similar to those emerging for other scholarly databases.
Despite the difficulties in categorizing repositories and their content, and obtaining needed datasets, the participant distribution analytical techniques used in this study were valuable for the new perspective they provided on individual and grouped repositories. At a time when too few measures of success are available for repository implementers, the patterns shown here should be of great interest to local repository managers. One also can imagine the additional precision and certainty that would be gained if these techniques and statistics were not tabulated manually for a limited group of repositories, but instead were part of routine analysis reports run on massive repository metadata harvests using the OAI-PMH protocol. However, for such efforts to be possible, many of the uncertainties involved in comparing repositories need to be addressed first.
Anderson, C. (2004). "The long tail." Wired (Oct.).
Day, M. (2004). "Institutional repositories and research assessment: A supporting study for the ePrints UK Project. <http://eprints-uk.rdn.ac.uk/project/docs/studies/rae/rae-study.pdf>.
CNI, Executive Roundtable. (2003). "Summary report of the December 8, 2003 CNI Executive Roundtable on institutional repositories." <http://www.cni.org/projects/execroundtable/fall2003summary.html>.
Garfield, A. (1983). "How to use citation analysis for faculty evaluations, and when is it relevant? Part 1." Current Contents 44 (Oct.), pp. 5-13.
Gladwell, M. (2000). The Tipping point: how little things can make a big difference. (NY: Hachette Book Group).
Kennan, M.A. & Wilson, C.S. (2006). "Institutional repositories: review and an information systems perspective." Library Management 27:4/5, p. 236-248.
Kim, H. & Kim, Y. (2006)."An Evaluation model for the National Consortium of Institutional Repositories of Korean Universities." Presentation at ASIS&T Annual Meeting, Austin, TX.
Lopez-Fernandez, L., Robles, G., Gonzalez-Barahona, J.M. (2004). "Applying social network analysis to the information in CVS repositories." MIT Free/Open Source Research Community . <http://opensource.mit.edu/home.html>.
Sale, A. (2006). "The Acquisition of open access research articles." First Monday 11:10 (Oct.). <http://www.firstmonday.org/issues/issue11_10/sale/index.html>.
Swan, A., Needham, P., Probets, S., Muir, A., Oppenheim, C., O'Brien, A., Hardy, R., Rowland, F., and Brown, S. (2005). "Developing a model for e-prints and open access journal content in UK further and higher education." Learned Publishing 18:1, p. 25-40.
Copyright © 2007 Chuck Thomas and Robert H. McDonald