Volume 22, Number 7/8
Table of Contents
Preservation Challenges in the Digital Age
Deakin University, Geelong, Australia
The digital preservation field is evolving rapidly. Focal areas are changing and best practices are still under debate. Preservation efforts must address not just preservation of the technologies of the past, but also those of the future. The rapidly increasing volume of data requiring preservation makes other digital preservation challenges inherently larger and more complex. The shorter lifespan of digital materials also makes the need for timely and effective preservation action more urgent. This article describes what the author sees as the current major challenges in digital preservation, and covers a range of technical, administrative, legal and logistical aspects.
The growing pervasiveness of digital media brings with it a number of preservation challenges. Digital materials are more at risk than analogue due to their shorter lifespan. Preserving digital materials is not just a matter of preserving files, but also providing access to the material and ensuring that the infrastructure that renders the file is preserved or replicated in some way. There's also a growing recognition that for some kinds of material, such as games and digital artwork, preservation should attempt to replicate not just the material itself, but the original user experience (Rieger, 2015).
The shorter lifespan of digital materials makes the need for timely and effective preservation more urgent. One of the major differences between digital and analogue preservation is that digital requires more active intervention throughout a material's lifecycle, and at a much earlier stage. You can't just "set and forget" with digital media, as you largely can with analogue. This applies to both the files themselves and the infrastructure to render the files.
The digital preservation field continues to evolve rapidly, with focal areas changing and best practices still under debate. While relying on currently accepted best practices and established standards is indeed a good practice, it is not an infallible preservation strategy (MIT Libraries, 2012a). Five years ago, migration was possibly the most commonly used preservation strategy; but more recently, emulation has been gaining more traction, particularly with the Internet Archive's browser emulator for computer games (Scott, 2016). Normalisation is also becoming more common, although it is not recognised as a permanent solution (MIT Libraries, 2012b). We can only guess what the future might hold.
The optimal preservation strategy for individual organisations will differ according to their requirements, resources and data type. Each strategy comes with its own set of challenges, many of which are dependent on, or impacted in some way by, other challenges. This article will cover what the author sees as the major challenges for digital preservation at this point in time, covering a range of technical, administrative, logistical and legal aspects.
2.1 Data volumes
The world is becoming increasingly digitised, with more and more sources of data needing preservation. The sheer volume of data is a challenge that impacts most of the other challenges mentioned in this article, thus making them inherently larger and more complex. The costs of storage, downloading and ongoing maintenance, for example, are much more than they'd be with smaller volumes of data.
While storage is becoming cheaper, this doesn't mean every file and every version of it can and should be stored. Deciding what should be preserved and when to take preservative action becomes more complex with a larger volume of data and a wider range of storage media. This in turn increases the risk of failing to preserve items that will one day turn out to be of historical value. There is also a higher risk of data becoming indiscoverable due to the associated risk of poor metadata.
Large volumes of data cannot be handled on a one-by-one basis, as can analogue materials. While the extent to which digital material can be handled at an individual level will vary according to purpose, funding and so on, in general, the preservation methodologies applied to large amounts of data will need to be scaled to fit the needs and objectives of the preserving institution. More data will inevitably mean increased reliance on automation and the development of new workflows to handle the data. Along with that comes the need for increased computer power and storage.
Large volumes of raw data require tools to scale or present it in more manageable forms. This is likely to lead to expectations that the preserving institution will also provide the infrastructure to mine or manipulate the data, much as they already do with existing digital materials, such as newspapers. Provision of such infrastructure may or may not be a viable scenario for the preserving institute, but its provision will contribute to increased costs and resourcing requirements.
One of the current realities of increasing data volumes and storage expectations is that preservation software, front-end software or the storage system itself may be unable to cope with large files, multiple versions of the same file, or just with large volumes. This may require new infrastructure and software.
One of the most fundamental challenges in archiving is determining what should be preserved. It is not feasible to preserve everything. Nor is it even possible to preserve some materials at all, due to their poor condition, security classification or culturally sensitive nature (Driscoll, 2006). How long should items which are considered worthy of preservation be stored? The future is largely unknown, and material thought to be trivial and useless today may in the course of time come to be considered not so, and by then those now valuable materials and possibly any trace that they ever existed may be long gone. Conversely, materials may be preserved that in the course of time turn out to be irrelevant, in which case all the time, effort and resources expended to preserve them becomes wasted (Lavoie, 2004). So, archivists can be damned if they do and damned if they don't!
A related challenge is the extent of preservation. Should just the intellectual content of a work be preserved? The look and feel as well? And when it comes to providing access to the preserved material, should the aim be to replicate purely the original user experience? Or the original experience enhanced with current technology advances? The answers to these questions will vary according to the institution, but decisions in this regard need to be make with due regard given to possible future requirements and expectations.
We won't know if selectors have 'guessed right' for years; and may never know all that has been lost.
Materials born digital today are likely to have multiple copies in multiple versions stored in multiple locations, possibly under multiple filenames and in multiple file formats. This is especially so for images and videos. Photos taken on mobile devices, for example, may be automatically stored in iCloud, copied to Facebook, perhaps Flickr or Instagram; and later transferred to a desktop computer from where they are automatically copied over to Google Photos. These are just a small sample of the possibilities. While best practice would be for the original file to be considered the primary version and preservation master, in some situations it may not always be clearcut which file is the original. Making the wrong choice may result in having archival images of poor quality.
Uploading photos and videos to external sites often results in some loss of quality if the uploaded files are saved in a compressed, lossy format. If those compromised files then subsequently get loaded to other sites, quality may be compromised even further. Again, this increases the risk that the files chosen as archival files may be the poorer quality files. While these photo and video examples are more likely to happen in personal settings, they could also occur in institutional settings.
A similar challenge faces institutional repositories. Research papers written by multiple authors may be stored in different repositories in different versions (e.g. pre-print, published version, postprint). Versions may be labelled inconsistently in different repositories, and there is also a risk that changes made to one version of a file will not be copied across to all repositories. Again, this increases the risk of data loss and undermines the authenticity of data held in repositories. The larger volume of digital materials, and the existence of diverse copies also have resource ramifications.
Multiplicities do have some advantages; there's less chance of an item being lost forever when other copies exist, albeit of a lesser quality or of dubious authenticity.
2.4 Hardware & storage
Like software, hardware is prone to obsolescence, but also to mechanical failure. Hardware may be damaged by carelessness, neglect, overuse, or inappropriate storage. Batteries may be left in place during storage and cause unintended damage, not just to the hardware itself but also to any media which may have been left in the hardware.
Digital media such as floppy disks, USBs and hard disks, are more vulnerable to deterioration and obsolescence than analogue media. Research into the average life of digital and analogue media indicates that digital media has an average lifespan of 3-50 years, while analogue has a range of 10-2200 years (Atos, 2014). This means that preservation action needs to happen much earlier in the lifecycle for digital media than it does for analogue.
The care of digital media provides a further challenge. Obviously the files themselves should be copied onto sustainable media, in line with good practice; however, if the original media itself is retained (as best practice also dictates it should), it will need ongoing maintenance in terms of storage, cleaning, protection from magnetic fields, etc. This is the case whether or not the media will be in active use.
An increasingly pervasive means of data storage today is the cloud. While the cloud was not designed for archiving, the reality is that it is used for this purpose by some smaller institutions. Preserving items via third-parties does pose a higher risk of loss should something go wrong. The third-party may go out of business Nirvanix and Megacloud are two such examples so an exit strategy needs to be set up for such scenarios (Beagrie, 2014). There are additional copyright, licensing and security issues with cloud storage, and privacy of personal data may also be more at risk.
Rapid developments in application software and their underlying operating systems pose a challenge for digital preservation on several levels. Files may not render correctly on versions of software other than those they were designed to work on. While backwards compatibility is usually built into new software versions (at least for the most recent versions), this isn't a given. Older and deprecated software is also more prone to security vulnerabilities.
While software may still be in use for some years after its deprecation, files relying on that software may become increasingly inaccessible. This is less of an issue with open formats, which tend to be more frequently supported by multiple software applications.
Open file formats do make it easier to create software to render those formats, but the reality is that files don't always comply exactly with their file format specifications (de Vorsey, 2010). Software may also not necessarily support every feature of a particular file; for example, a word processing document may rely on a font that hasn't been embedded within the document and is not available with the rendering software. An online exhibition may rely on a plugin that hasn't been updated to work on the underlying software. For some types of content, such as interactive art installations, the execution environment needs to be preserved to render the content as intended (Atos, 2014).
Software needed to render digital materials may itself require preservation. A particular challenge here is that the underlying code may be unavailable. This makes it more difficult to port the software to current hardware or operating systems.
It may also not be possible to provide online access to some digitised works, depending on the supporting hardware and software requirements. Which means that the work may only be accessible on-site at the preserving institution.
2.6 File formats
File formats have long been considered one of the biggest risks in digital preservation. However, this has not proven to be the overwhelming danger that it was initially perceived to be (Digital Preservation Coalition, 2015). In large part, this is due to the availability of open file formats, resulting in the formats being supported by more software applications.
Proprietary file formats continue to pose a challenge, as their specifications are less likely to be openly available. Making software compatible with such formats, or converting the files themselves into a more open format can only be done with permission from the patent holder. This complicates long term preservation of such files, as the files may not be able to be migrated or normalised to a more accessible format. To keep such files accessible, the software that renders the file may also need to be preserved; which in turn brings its own set of issues.
Many files deviate in some way from their official specification (de Vorsey, 2010), so even if the specification is available, it may not necessarily be possible to convert the file to an open format. Additionally, not all file formats are suitable for long term preservation, even if they have an open specification. Some lossy and compressed file formats pose a higher risk of total loss if even a single bit is lost.
Some types of digital media have a generally agreed archival format; TIFF is the accepted format for images, for example. However, not all media types have an archival format, including videos (Library of Congress, 2015). While this issue will likely resolve over time, preserving institutions must in the meantime use their best judgement about what preservation file formats to use in such cases.
Metadata is probably the most important aspect of digital preservation. Materials with poor metadata may be undiscoverable, their authenticity unverifiable and their context unclear. Thus, they may not be as usable as they otherwise would. Inadequate or missing structural metadata will also impact on rendering. Preserving materials without good metadata is pretty much the same as throwing them away; along with all the resources expended in 'preserving' them.
It is important to gather metadata at time of creation if at all possible some context may be lost over time. The importance of this has been recognised within the research field, with more effort now being put into encouraging researchers to create data management plans at the start of projects (Australian National Data Service, [n.d.]).
Some file formats support inbuilt metadata, e.g. TIFF and PDF. However, often only a very limited set of descriptive fields are supported, and sometimes the metadata is incorrect or absent altogether. PDFs, in particular, are at risk of containing poor or inaccurate inbuilt metadata. A very common issue, in my experience, is the 'author' name being that of the person who converted the file to PDF rather than the actual author of the content. Bad metadata like this may have a detrimental impact on the long-term findability of the file. Having said that, my experience has been that Google is still able to retrieve such files, as long as the PDF itself is searchable.
The act of preserving a file may alter it in some way that ultimately impacts its rendering or authenticity. It is important that any changes of any kind made to a file or its derivatives are well documented in the metadata, e.g. via a PREMIS datastream (Digital Preservation Coalition, 2015).
Digital preservation presents some complex legal issues, well beyond those that apply to analogue material. Generally, preservation of analogue material doesn't involve the exercise of the copyright owner's right of reproduction. That's not the case with digital material (Driscoll, 2006). Additionally, most analogue materials are "owned" by institutions, whereas for digital materials, institutions may have access rights only, and only for the period for which they subscribe. Such is the case with external databases, for example.
Laws have not always kept pace with advances in technology, and may vary according to country. In most cases, digital materials may be subject to several levels of restrictions. Laws may apply restrictions on copying, storage, access, modification of content, and its use or re-use. Donors and funders may also place restrictions on the management of, preservation of, or access to the materials (Digital Preservation Coalition, 2015). Also, different laws may apply to different types of media. Web sites, for instance, are not always covered by the same laws as other types of electronic content (Grotke, 2012).
The very act of preserving some digital items may require violation of some of the copyright owner's rights. For example migration may violate the copyright owner's right to create a derivate work. Making a digitised file widely available may impinge on the copyright owner's distribution, performance and display rights. (MIT Libraries, 2012b). Rights in content and any associated software may belong to different individuals or organisations (Driscoll, 2006) making it a long and complex task to obtain the required permissions for preservation, or to port software to other operating systems or hardware.
Digital material may be password-protected or include some form of Digital Rights Management protection. Not only does this pose problems for ongoing access, but that access may require some kind of validation by third-parties that may be unavailable in the future (Atos, 2014).
Some countries provide exceptions to their copyright laws for certain cases. For example, in Australia there are exceptions that allow libraries and archives to copy damaged materials in some circumstances, as well as materials not otherwise commercially available (Driscoll, 2006). Copying for preservation is also allowable under some circumstances.
Because of the complexity and extent of potential legal issues, digital preservation coordinators need to work in close cooperation with their institution's legal advisors to minimise the risk. These risks can arise in the most unexpected of places. In my own institution, we've encountered issues with books that would otherwise be out of copyright, except for a couple of drawings by third parties.
Material chosen for preservation may contain private and confidential information, and its unauthorised release may lead to legal action. Consequently, it is important for preservers to anonymise information prior to making it available. This could involve blacking out names and identifying information, or replacing identifiers with generic names, such as 'Person1', depending on the type of media and data. Alternatively, access restrictions can be applied and legal agreements required, depending on the situation.
Social media and online communications pose a particular challenge in terms of maintaining privacy. It is a fair assumption that many people who communicate online are unaware of just how pervasive and public their communications may be. Many online discussions, for example, are replicated in full on other sites. Tweets get saved and archived by third parties, blogs get harvested by the Internet Archive, and Facebook posts may be unintentionally public. How privacy may be maintained for such data is a challenge yet to be fully explored.
Preservation costs involve not just the actual digitisation, but also storage, infrastructure, staff resourcing and training, ongoing maintenance and auditing of the digitised materials. There are also costs associated with providing access to digitised materials.
Most institutions have limited resources to spend on preservation efforts, so the challenge is to expend these resources on preserving the most worthy materials, using the most cost-effective and efficient strategies. Some preservation strategies are more resource-intensive than others. Emulators, for example, require preservation of not just the media files themselves, but also the emulators. As technology changes, the emulators must themselves be adapted to work with the new technology; and there may also be more emulators to maintain. In effect, this means that emulation may consume more resources over time. Of course, there is the possibility that as technology advances, the creation of emulators may itself be automated at some stage.
Choosing not to preserve materials involves costs as well; not just in the potential loss of valuable data or cultural heritage, but the possibly higher costs associated with preserving items once they are categorised as high risk (Ranger, 2014). Preservation efforts that turn out to be unnecessary can also be considered to be wasteful (Lavoie, 2004). Unfortunately, the future is not always predictable, so this is a difficult issue to avoid. Additionally, the beneficiaries of preservation programs are mostly future generations, making it more difficult to justify expenditure on digital preservation, as the benefits may not be immediately apparent (Blue Ribbon Task Force on Sustainable Digital Preservation and Access, 2010).
The challenges in digital preservation involve dealing with not just the technologies of the past, but also those to come. The field is developing swiftly, and custodians of digital materials need to keep abreast of changes. One of the biggest challenges is to avoid being pulled onto a preservation path that turns out to have been a waste of time, energy and money. File format obsolescence has not turned out to be the overwhelming danger it was initially perceived to be (Digital Preservation Coalition, 2015), and similar miscalculations may apply to other current and future technologies. As De Vorsey says, "the best that the preservation community can do with digital material is to make educated guesses based on a few decades of mostly anecdotal experience" (De Vorsey, 2010).
||Atos, 2014, Digitial preservation in the age of cloud and big data. Atos SE.
||Australian National Data Service, [n.d.], File formats.
||Beagrie, Neil, Andrew Charlesworth, Paul Miller, 2014, How cloud storage can address the needs of public archives in the UK. National Archives, Kew, England.
||Blue Ribbon Task Force on Sustainable Digital Preservation and Access, 2010, Sustainable Economics for a Digital Planet: Ensuring Long-Term Access to Digital Information. Final Report, La Jolla, Calif.
||Butler, Brandon, 2014, "Cloud's worst-case scenario: what to do if your provider goes belly up". Network World, Weblog Post, 2 June 2014.
||Clark, Karin, et al., 2015, Guidelines for the ethical use of digital data in human research. University of Melbourne.
||Dappert, Angela, Markus Enders, 2010, "Digital preservation metadata standards". Information Standards Quarterly, vol. 22, no. 2, pp. 4-13.
||De Vorsey, Kevin, Peter McKinney, 2010, "Digital preservation in capable hands: taking control of risk assessment at the National Library of New Zealand". Information Standards Quarterly, vol. 22, no. 2, pp. 41-44.
||Digital Preservation Coalition, 2015, Digital preservation handbook. 2nd ed., Glasgow.
||Driscoll, Erin, 2006, Copyright and legal risks in digital preservation.
||Grotke, Abbie, 2012, "Legal issues in web archiving". The Signal: digital preservation, Weblog Post, 30 May 2012.
||Lavoie, Brian, Richard Gartner, 2013, Preservation metadata. 2nd ed., Digital Preservation Coalition, Great Britain.
||Lavoie, Brian, Lorcan Dempsey, 2004, "Thirteen ways of looking at ... digital preservation". D-Lib Magazine, vol. 10, no. 7/8. http://doi.org/10.1045/july2004-lavoie
||Library of Congress [n.d.], Recommended formats statement.
||Library of Congress, 2015, Sustainability of digital formats: planning for Library of Congress collections.
||McNealy, Jasime, 2010, "The privacy implications of digital preservation: social media archives and the social networks theory of privacy". Elon University Law Review, vol. 3, no. 2.
||MIT Libraries, 2012a, Digital preservation management workshops and tutorial: digital preservation strategies.
||MIT Libraries, 2012b, Digital preservation management workshops and tutorial: legal issues.
||Murray, Kate, 2016, "O email! My email! Our fearful trip is just beginning: further collaborations with archiving email". The Signal, Weblog Post, 10 May 2016.
||National Archives (U.S.), [n.d.], What do you want to preserve?.
||National Archives of Australia, 2011, Digital preservation policy: preserving archival digital records transferred from Commonwealth agencies.
||National Archives of Australia, [n.d.], Preservation file formats.
||National Library of Australia, 2013, Digital preservation policy. 4th ed.
||National Museum Australia, 2012, Digital preservation and digitisation policy. Version 2.2, Canberra.
||O'Keefe, Christine, 2015, "Big Data is useful, but we need to protect your privacy too". The Conversation, Weblog Post, 8 May 2015.
||Phillips, Megal, 2013, The NDSA levels of digital preservation: an explanation and uses. Library of Congress.
||Ranger, Joshua, 2014, "Three views of digital preservation". AVP, Weblog Post, 5 February 2014.
||Rashid, Fahmida Y., 2016, "The dirty dozen: 12 cloud security threats". InfoWorld, Weblog Post, 11 March 2016.
||Rieger, Oya Y., et al., 2015, Preserving and emulating digital art objects. Cornell University Library.
||Scott, Jason 2016, "Saving 500 Apple II programs from oblivion". Internet Archive Blogs, Weblog Post, 4 March 2016.
About the Author
Bernadette Houghton is the Digitisation and Preservation Librarian at Deakin University in Geelong, Australia. She has a strong background in systems librarianship and cataloguing, as well as four years as an internal auditor.