Technical Debt as an Indicator of Library Metadata Quality
Metadata production is a significant cost for any digital library program. As such, care should be taken to ensure that when metadata is created, it is done in such a way that its quality will not be low enough to be a liability to current and future library applications. Agile software development uses the concept of "technical debt" to place metrics on the ongoing costs associated with low-quality code, and determine resource allocation for strategies to improve it. This paper applies the technical debt metaphor in the context of metadata management, identifying common themes across the qualitative and quantitative literature related to technical debt, and connecting it to similar themes in the literature on metadata quality assessment. It concludes with areas of future research in the area of technical debt and metadata management, and ways in which the metaphor may be integrated into other current avenues of metadata research.
Keywords: Metadata Quality, Technical Debt, Agile Development
Metadata production is a significant cost for any digital library program. As libraries and archives manage more digital content, maintenance costs of metadata describing that content, in the form of cleanup of purchased vendor records, conversions of metadata to different representations, and migrations across systems and applications, constitute a more significant share of these costs. Exacerbated by changes in metadata standards, tools for preservation and access, and institutional changes, over time these maintenance costs may begin to have a detrimental effect on the ability of metadata to meet the needs of discovery, access, and management functions of digital library operations.
Within the software development community, a metaphor of "technical debt" has emerged as a way of explaining these issues to non-technically-oriented stakeholders. Generally, technical debt refers to the deferment of long-term development priorities in order to enable short-term gains: for example, quickly pushing a demanded feature into production at the expense of building a sustainable, well-documented code base for the application. This debt can be paid back with future feature development or code re-writes; it also accrues interest over time, through maintenance of obsolete or poorly documented code, imposing costs on the ongoing development of a product. Parallels may be drawn between this and the work of building and maintaining metadata in a library catalog, as resources for technical services are tightened and metadata services units consider adopting agile software development methodologies.
This paper investigates the concept of technical debt as it relates to the management of metadata, particularly in the library environment. Definitions of the concept of technical debt will be presented and analyzed within the context of library technical services. Metadata can be thought of as the "codebase" necessary for properly functioning library discovery systems, institutional repositories, and collection management systems; the labor, or lack thereof, required to ensure sufficient metadata for a properly functioning system can be thought of as a down payment toward the relief of that technical debt. It is hoped that, through this analysis, functional approaches to resource allocation and advocacy for cataloging and metadata units in the library may be developed.
2 Literature review
2.1 Technical debt
The technical debt metaphor was first developed by Cunningham (1992):
Shipping first-time code is like going into debt. A little debt speeds development so long as it is paid back promptly with a rewrite ... The danger occurs when the debt is not repaid. Every minute spent on not-quite-right code counts as interest on that debt...
Over time the metaphor was broadened and refined to include specific aspects of technical debt. McConnell (2007) identified two types of technical debt. Unintentional debt refers to code of low quality which does not function or functions inefficiently; this debt is repaid later via refactoring or other improvements. Intentional debt refers to code known at the time by its author to be sub-standard, but written into a project to meet a deadline or to fit within particular resource allocation parameters. Fowler (2009) expanded McConnell's distinctions further, additionally identifying prudent debt (debt incurred with self-awareness about either the risks of shipping sub-standard code or the un-applied knowledge gained with experience) and reckless debt (debt incurred due to ignorance of, or carelessness about, factors or variables going into a project).
More recent technical debt literature has attempted to apply the technical debt metaphor in practice, building metrics for its ongoing measurement in both academic and practical contexts. One measure of technical debt (Nugroho, Visser and Kuipers, 2011) focuses on what its authors term "repair effort," that is, a function of the lines of code and the amount of time required to rebuild an application using a particular technology or framework. This metric quantifies the quality of the code underlying an application, and assists developers in determining whether reworking existing code or re-writing it entirely is the best strategy given an existing level of technical debt within that application.
Two recent comprehensive literature reviews on technical debt shed light on attributes and types of technical debt emerging from the literature on the subject. Tom, Aurum and Vidgen (2013) identify and define five primary types of technical debt; these types of debt and their definitions are shown in Table 1. They further identify underlying reasons driving the growth of technical debt in an organization; these include intentional reasons such as prioritization of one task or feature over others, as well as unintentional reasons such as inefficient workflows and processes, and oversight or ignorance of important dimensions of code or its environment, that lead to technical debt in the long-term.
Table 1: Types of technical debt with their definitions, as identified by Tom et al. (2013).
In a later literature review, Li, Avgeriou and Liang (2015) expand on the types of debt identified by Tom et al., and additionally identify five "notions" of technical debt, to which they ascribe various properties of the concept. The Li et al. typology of technical debt is shown in Table 2. Below are the five "notions" of which they write:
The study further identifies eight activities related to the management of technical debt in an organization, as well as surveying tools in use in the software development community to manage these activities. They are, in descending order of level of tools support: identification, measurement, communication, monitoring, prioritization, repayment, representation, and prevention. Identification is by far the most supported activity. The study also surveys these tools in the context of the types of technical debt for which they support management; by this measure, code and design debts are the most well-supported types of technical debt.
Table 2: Types of technical debt and their definitions, as identified by Li et al. (2015).
Both studies identify the need for further research into a taxonomy of technical debt items and their relationship to one another, as a way to better identify and address it in an organization. They also acknowledge the need to take an expansive (though not too expansive, as Li et al. note) view of the technical debt phenomenon; though it originated as a specific metaphor of code quality (Cunningham's "not-quite-right code"), it may also include infrastructure issues and issues relating to people, development processes, and documentation; these issues, when viewed in relation to code quality, are under-represented in the existing software development literature on technical debt.
2.2 Cataloging backlogs and metadata quality
Two measures of technical debt are recognizable in the literature on library metadata management. The first is a library's cataloging backlog, generally constituted by the amount of uncataloged library materials as a percentage of the total number of materials in the library collection. Also termed "arrearages," the concept first appeared in the literature in 1951, but it was Piternick (1969) who first diagnosed a cause of backlogs as increases in expenditure on library materials without a corresponding increase in resources allocated for processing, and identified two types of backlog: materials delayed in processing for unknown reasons and those "deliberately segregated for deferred treatment." In a survey of Association of Research Libraries (ARL) members regarding attitudes toward backlogs, Agnew et al. defined an arrearage as "a backlog of uncataloged monographs/books" (Agnew, Landram and Richards, 1984); this survey is also one of the earliest mentions of the role of technology in significantly reducing or even eliminating cataloging backlogs. Camden and Cooper (1994) further distinguished between "active" backlogsthe result of materials receipt rates outpacing the ability of cataloging units to describe them, and from which materials are regularly withdrawn for catalogingand "inactive" backlogs, "from which no items are removed ... that grows excessively ... [and] should be a concern of cataloging managers."
Recent studies have indicated that advances in technology have, in general, reduced the reported volume of cataloging backlogs in academic and research libraries. However, the quality of that metadata, particularly as more library metadata production is outsourced to vendors over whom the library has little control, is another measure of technical debt. Metadata quality has a number of definitions. A discussion of these definitions begins with Bruce and Hillmann (2004), who identified the "continuum" of library metadata quality. Guy, Powell and Day (2004) define metadata quality as "fitness for purpose," that is, "support[ing] the functional requirements of the system it is designed to support." Recognizing the increasing abundance of metadata in library systems not created by catalogers (vendor-supplied batch imports, metadata entered directly by content creators), Guy et al. call for an increased role for cataloging and metadata librarians in evaluating metadata quality and undertaking enhancement and maintenance activities based on these metrics. Bruce and Hillmann (2004), in a contemporary effort at defining metadata quality, measure it against seven characteristics and at varying "tiers" of quality assurance, beginning with simple validation against a schema and scaling upward to consistent use of controlled vocabularies, adherence to agreed-upon community standards, and indicators of provenance.
Stvilia et al. (2007) developed an Information Quality (IQ) assessment for library metadata based on 32 representative items, organized into 22 dimensions categorized into three high-level types of information quality: intrinsic (quality inherent to the metadata), relational (quality reliant on the situational context in which the metadata is used), and reputational (quality reliant on the source of the metadata). This framework was later expanded to quantitatively measure the value of metadata quality itself, linking changes in the quality of metadata with changes in its value to different user communities (Stvilia and Gasser, 2008). Park (2009) provides a summary of the concurrency of the Bruce-Hillmann and Gasser-Stvilia frameworks. Park further identifies completeness, accuracy, and consistency as foundational criteria in these and other studies of metadata quality in the library and information science literature, and highlights clear metadata guidelines (both in the form of data dictionaries and instructions for metadata creators) and processes for automated metadata creation as means to improve metadata quality in digital repositories.
3 Applying a technical debt analysis to library metadata management
Before building a portfolio of technical debt to be managed in library metadata services, it is necessary to identify properties of this debt, how they may be incurred (both intentionally and unintentionally), and how they may be measured. This analysis uses the Tom et al. literature review as a starting point, with the additional inclusion of the requirements debt typology from the Li et al. literature review. These are considerations to take into account when identifying technical debts in a metadata management environment, as well as strategies for limiting their effects and avoiding unintentionally incurring more debt. Examples of debts taken on intentionally and unintentionally will be provided for each of the debt types.
3.1 Code debt
Tom et al. define indebted code as poorly-written, duplicated elsewhere in the application, illogical, or otherwise written as a workaround of underlying architectural limitations. Li et al. extend the definition to include code written in violation of existing rules or best practices. Code debt in the metadata context may include cataloging done outside of the confines of our cataloging rules (RDA, DACS, CCO, etc.), or in violation of established metadata specifications such as MARC21. These practices may be due to limitations imposed by the library cataloging systems in place; a practical example includes genre headings outside of the Library of Congress Genre/Form Terms which do not index properly when cataloged with the source indicated in 6552.
Intentional metadata code debt may be taken on with the intention of correcting it later. Cataloging around known issues in a metadata management application, in which out-of-spec metadata decisions are made in order to put resources in circulation, may be noted and corrected later when the vendor pushes a fix for the issue into production. Unintentional metadata code debt includes errors made unknowingly in the course of cataloging; this could be as simple as a typo found in a text note field, or as complex as a vendor record load in which the same mistake appears hundreds or thousands of times throughout the batch. Often these debts are not noticed until well after the records are pushed into production. These debts may evolve over time, due to changes in the cataloging policy of the vendors supplying the records; more or less metadata may be created and supplied to libraries, or certain data properties may be included in different MARC data fields or subfields.
3.2 Design and architectural debt
Tom et al. define this type of debt to include code written to fit the public display of an application, without thought to long-term sustainability of the code base. Li et al. distinguish between these two types of debt, defining "design debt" as shortcuts in underlying application design, and "architectural debt" as decisions which adversely affect maintainability of the application or sustainability of the codebase. Such debts, in a metadata management context, often have little to do with day-to-day work; it instead manifests at the level of the systems and applications used to manage metadata for library resources, and/or at the level of standards development and data modeling for the cultural heritage domain (including libraries, archives, and museums). Because these debts are frequently incurred at such a high level, their causes are complex, and frequently out of the control of the day-to-day work of metadata specialists.
A library may choose to take on intentional metadata design and architectural debt when they make metadata decisions based on the discovery system in use at the institution, rather than designing the discovery system to fit the underlying data. The reasons for this are similar to those for taking on "code" debt. Unintentional design and architectural debt is often the result of system migrations or changes to the underlying content standard in use in a metadata application. The underlying data model in the new system to which a library is moving its cataloging activity may not exactly match the underlying data model of the old system, causing metadata specialists to spend time determining how to match data properties across systems when they are not represented the same way. Alternately, changes in content standards may require expensive manual maintenance and remediation, such as moving from general material designators in Anglo-American Cataloguing Rules (AACR2) to Resource Description and Access (RDA) content, media, and carrier type fields.
3.3 Environmental debt
Tom et al. define environmental debt as that which appears "in the application environment, and/or in the processes and applications leading to code development." They specifically cite workflows, accountability (or its absence), and postponement of software upgrades as factors contributing to environmental technical debt. Li et al. separate environmental out into build, infrastructure, versioning, and defect debt, all contributing in one way or another to a sub-standard environment in which software development happens (rather than the software development itself).
In a library metadata management setting, environmental debt is typically due to breakdowns in communication between the technical services unit and its collaborators throughout the institution. Between technical services and systems, breakdowns in communication of cataloging user needs may occur, or a failure to communicate how indexing, search, and display features in the discovery application interact with the underlying metadata in the catalog, resulting in sub-standard search and discovery experiences for both catalogers and end users of the system. Between technical and public services, there may be misunderstandings of the role of one another's work in developing a discovery and resource management infrastructure, leading to over-emphasis of one at the expense of the other in the metadata management system. This type of debt can happen unintentionally through time constraints preventing regular communication among library teams (particularly if significant responsibility for system management is in the hands of an offsite vendor), or intentionally through insufficient allocation of staff time and resources to systems, technical services, and/or user experience research.
3.4 Documentation debt
Tom et al. define documentation debt as "the ability to pass knowledge of the code down as developers leave or move on to other projects." Documented code and workflows contribute to a lack of documentation technical debt; its absence contributes to its presence. Documentation of metadata practices in libraries is a two-fold issue: the standards themselves, and additionally local practices and rules which complement these standards, may be documented to varying degrees.
3.5 Requirements debt
Li et al. define requirements debt as "[the] distance between the optimal requirements specification and the actual system implementation, under domain assumptions and constraints" (the term is absent from the Tom et al. literature review). This type of debt is perhaps the most complicated to identify, measure, and manage in a library metadata setting, due not only to the number of use case a library metadata system must address, but the number of different user classes that may be expected in a system. Libraries encounter users with varying levels of familiarity with our catalogs and databases; with a wide variety of backgrounds, from tenured professors to grade school students and everything in between; and a broad spectrum of desired outcomes for their resource discovery behaviors. Metadata specialists themselves are a user class in their own systems, from the standpoint of having metadata creation, maintenance, and management tasks they need to complete in the course of their daily work. A comprehensive metadata technical debt management approach must take all of these factors into account in order to meet the "optimal requirements specification" for the metadata management system.
3.6 Further discussion
Several factors come into play when considering how a technical debt analysis may be made in a code production environment:
Analysis of technical debt in any domain requires constant monitoring of the existing codebase in order to ensure that it continues to meet the needs of all stakeholders (both end users and code managers). To this end, libraries wishing to limit their technical debts in metadata management should consider an ongoing program of metadata reviews for quality, ability to meet user needs, and suitability for various maintenance tasks (migrations, conversions to new data models and/or standards, etc.)
4 Limitations of the technical debt metaphor
Li et al. noted a handful of contributions to the software development literature pointing out limitations of the technical debt metaphor as a useful analytical tool. Allman (2012) notes that because technical debt is incurred in a group or organizational setting, often the person making a decision to incur the debt is not the person who will ultimately pay it back; technical debt payments are frequently paid by later developers, other units within an organization, or the end users of a service or application. Consequently, there are incentives to take on technical debt that are missing from financial debt. Furthermore, Schmid (2013) notes that there is not a standard unit of measurement for technical debt, comparable to the principal or balance of a financial debt, a problem exacerbated by the variety of facets of technical debt and the indirect methods of measuring them (e.g. when making a short-term decision to incur technical debt, it is impossible to know how much debt is being incurred without knowledge of future conditions of the project).
One potential limitation of a technical debt approach to managing metadata in libraries and archives is the difference in ultimate goals between the activities of software development and metadata management. Much of what has been written on the subject of technical debt in the computer science and software development literature is predicated on the software release as the single, discrete event around which debt accumulates. A high principal on that debt, or excessive "interest payments" made on that debt through code re-factoring or re-writes, can postpone the release of a software product, or even prevent it entirely. Metadata management is not built around such events; the metadata management workflow is commonly understood as a continuum of activities meant to gradually increase the size of the graph of bibliographic data managed by an institution. A "technical debt" in this case may be thought of as a cataloging backlog, or an excess of sub-standard metadata requiring maintenance on the part of a cataloger.
5 Future directions for research
The most prominent short-term trend in technical services will be the move to BIBFRAME as the dominant cataloging standard, and to RDF as the underlying data model for library resource description. Conversion of library metadata to BIBFRAME and RDF represents an opportunity to clear existing technical debt (through developing cataloging efficiencies by adopting a linked data approach to metadata management); however, it can also be a potential source of more technical debt, if processing workflows are not well-documented and if knowledge of the new models and standards is not evenly distributed across the library. Future research regarding BIBFRAME/RDF migrations and conversions in libraries will present interesting and valuable case studies in how large-scale metadata conversion and remediation projects can take a technical debt approach to limit backlogs, maintain and improve library quality metrics, and promote greater efficiencies in metadata creation and management.
An emerging trend in both library technology and technical services is the embrace of project management strategies and philosophies, including Agile, for the roll-out of bibliographic cataloging production (Thompson, 2015). As case studies and literature emerge from this trend, its measured impact on cataloging backlogs and metadata quality will be of interest to metadata managers, and will shed light on the relevance of the technical debt metaphor as a metric for these variables. Study and analysis of workflows, staff training, and resource allocation in technical services units in libraries may also identify risk areas for incurring various types of metadata technical debt, such as metadata management and resource discovery goals out of alignment, or communication breakdowns between stakeholder groups.
Finally, there is a lack of tools and applications able to holistically identify, measure, and address technical debt in any domain. Within computer science and software development, there are strategies and applications to address specific aspects of technical debt (code QA, pair programming, etc.), but nothing can yet do so in a systematic way. A technical debt approach to managing metadata in librariesidentifying types of technical debt, examples of these types in practice, and developing metrics for their measurement and strategies to pay them downcan provide insights, in the form of use cases and variables to take into consideration, toward the development of a metadata technical debt application to make the process of managing such debt easier for a wide variety of technical services units.
The author would like to thank Jason Clark, Mark Matienzo, and Anna Neatrour for comments and revisions which have improved the quality of this paper.
About the Author
Kevin Clair is the Digital Initiatives and Metadata Librarian at the University of Denver. He manages technical services for Special Collections and Archives, including metadata management, digitization, and digital asset management. He has nine years of experience in metadata management for digital repositories in academic libraries, first at Penn State University and now at Denver. His research interests include building linked data infrastructure for small cultural heritage institutions. He holds an MSLS from the University of North Carolina at Chapel Hill and a BA from Carleton College.