Report on the First iPres Workshop on Practical Emulation Tools and Strategies
Dirk von Suchodoletz
The International Conference on Digital Preservation (iPRES) is an established annual event bringing together practitioners and researchers in the digital preservation community, engaging individuals, organisations and institutions across all disciplines and domains involved in digital libraries, archiving and preservation of digital artefacts. The 9th conference, the most recent iPRES, took place from October 1-5, 2012 at the University of Toronto's Chestnut Convention Centre. Before the main conference track, two days of workshops were scheduled. The workshop on emulation "Towards Practical Emulation Tools and Strategies", was one of the full-day events held on October 1st.
The workshop on emulation "Towards Practical Emulation Tools and Strategies", a full-day event held on October 1, 2012 in conjunction with the 9th International Conference on Digital Preservation (iPRES), looked at emulation as a digital preservation strategy from different angles, and discussed new directions and challenges for the preservation of new types of complex digital artifacts. It brought together nearly 50 practitioners and researchers from a broad range of institutions active in the domain of digital preservation (DP) and long-term access. The participants' backgrounds ranged from the museum world to several national libraries and various archiving institutions, including people from consulting businesses, data curation and DP services. Many of them are not using emulation yet, but do see a rising demand for new strategies triggered by the influx of a wide range of new types of digital artefacts which are not covered by the standard preservation and access strategies. The workshop's large audience underpinned the rising importance of emulation as an access strategy in digital preservation and identified open challenges which need collaborative and interdisciplinary effort.
Initially, the workshop organizers collected use cases and examples from practitioners, and grouped researchers from various past and ongoing emulation-related projects around these examples. Productive discussions in four workgoups followed, dealing with the domain of metadata and context, digital art and computer games, webarchiving and big data, as well as sustainable emulation and community involvement.
This workshop report presents a short overview of the current issues and challenging use cases at various memory institutions as identified at the workshop as well as key outcomes of the working groups. These outcomes are further structured to identify major upcoming challenges to be tackled by the research and development community in cooperation with practitioners.
The introducing keynote "Beyond the controversy: a quick overview of emulation as an approach to digital preservation" was given by Titia van der Werf (OCLC) presenting a short summary of emulation as a technological option in the context of digital preservation over the last decade. The presentation covered the development from the very beginning, with Jeff Rothenberg's now famous paper from 1995  until the recent initiatives within the EU-sponsored PLANETS and KEEP projects.
Figure 1: Titia van der Werf presenting her keynote speech at the workshop.
The interactive part of the workshop started with impulse talks by practitioners' from different institutions presenting their current challenges with different kinds of digital materials. It covered both challenges of material already being preserved but with shortcomings using traditional migration, as well as materials that are not handled by the digital preservation efforts in the institutions and for which direct migration does not seem to be a viable solution.
Due to legal deposit requirements, national libraries in many countries are faced with a wide range of new and diverse types of digital artifacts such as multimedia encyclopedias, computer games or software in general. For instance, the Act No. 1439 on Danish Legal Deposit requires the inclusion of digital content on physical media. Currently, the Danish National Library is focusing on the extension of their computer game collection in two ways: first, the development of coordinated video game documentation practices in order to ensure the relevance of their collection of game-related materials, and, second, it is working on definition and clarification of the technical issues involved with the collection and preservation of so-called "apps" for portable media, primarily iOS and Android.
Similarly, the French National Library is required to preserve nationally made software. The German National Library as a large research library has to provide and ensure appropriate access to multimedia and other non-standard objects. While the National Library of Australia is not pursuing a dedicated emulation strategy yet, it is one of a few libraries worldwide preserving physical objects for digital archeology like (removable) media drives, special cables and controllers plus computer software such as applications and operating systems of every kind. These objects will not only be useful for the reproduction of original environments but will also help to deal with non-standard material received by memory institutions in the future.
Furthermore, the digital revolution changes workflows in government departments and thus directly affects the kind of material received by the mandated national archives. For instance, the Austrian State Archive receives data created and managed by online/web applications such as digital tax reporting systems.
A rather new challenge faced by many memory institutions is providing access to archived content of the World Wide Web. Such repositories contain a wide range of non-standard object types depending on different rendering applications (i.e. proprietary plugins). To handle this challenge, a working group at the National Library of Australia has been working on a project which preserves generic but typical access environments for each year, starting in 1996. Virtualizing these environments could aid emulation of web page content of the respective access environment generations.
Another challenging domain for memory institutions is digital art. As the media and platforms digital artworks were made on are decaying, new ways to preserve access are required. Digital Art usually cannot be migrated since the user experience may depend on a certain software and hardware setup, e.g. to be rendered on a CRT screen in a certain resolution. Additionally, real time playback and non-standard types of human-device interaction (e.g., audio, camera) distinguishes digital art from digital objects of the office or scientific domain. Karlsruhe University of Arts and Design preserves access to CD-ROM-based artworks from the mid 1990s till mid 2000. Furthermore, the preservation of digital artworks created by master students in their final year at the university is another primary goal of the institution.
After the presentation of practitioners, an overview of recent research projects covering emulation and related topics followed. First, the issue of conceptually describing original environments and the lack of proper metadata schemas was raised. The technical registry TOTEM , a result of the KEEP project, provides the necessary base layer capturing individual hardware and software components as well as relations between them. Furthermore, a TOTEM RDF-based XML schema has been presented and an overview of the relevant discussions of the upcoming PREMIS 3.0 metadata standard  was given. The EU-funded TIMBUS project develops strategies to preserve complex business processes and, therefore, partly makes use of emulation for the technical and conceptual capturing of processes and their environments, focussing also on the evaluation of the captured and redeployed processes. The challenges of preserving research data were presented based on the example of the bwFLA-project, a project of the state Baden-Württemberg, Germany, focusing on scalable emulation strategies and the definition of emulation-based preservation workflows. The group of Geoffrey Brown at Indiana University is working on emulator and reproduction challenges of MS-Windows-based original environments.
To further focus the discussion on specific topics, the participants split into four working groups. Each group concentrated on a specific topic, first trying to identify common challenges, and then considering on how to tackle them. The four topics agreed upon in the plenary session and then given out to the discussion groups were "Metadata and Context" (what kind of metadata do we need for emulation; how do we capture the context information of digital objects in a specific environment), "Digital Art and Computer Games" (specific technical, legal and organisational challenges of dynamic and interactive objects), "Webarchiving and Big Data" (dealing with the challenge of preserving constantly changing huge amounts of online data) as well as "Sustainable Emulation by Community Involvement" (problems and solutions for making sure that emulation tools are taken up by the community and are actively supported).
Metadata and Context
Independent of a chosen digital preservation strategy, additional metadata is required to properly deal with complex digital artefacts. Especially if the original environment has to be used to provide a proper rendering environment for a digital object, all necessary software and hardware components need to be described in detail to allow for a reconstruction of the environment. As original environments should be re-usable for groups of similar artefacts, they should be characterized in an unambiguous way allowing a proper mapping of artefact requirements to environment capabilities. Additionally, information such as software licenses and documentation needs to be stored and tacit knowledge of operation has to be preserved. Traditional, existing registries are so far incomplete, as they do not describe standard software items and machines and do not honor object and software component dependencies. Thus, they are currently unsuitable to compute object view paths for original environment setups.
Traditional preservation strategies often lose the object's context, as they often extract objects from their original environments like the original filesystem and software environment. They might miss components linked to the object like other necessary artefacts or requirements such as font sets or codecs for multimedia objects. Business processes, research environments, software development systems usually consist of specially configured software-hardware-networked environments (e.g., web services, sensors), which need to be described and preserved for an authentic reproduction of the object's rendering. System and object configuration becomes a relevant topic for complex objects which are not simply composed of standard software components, but need certain configurations like databases, research environments or business workflows.
Web Archiving and Big Data
Browser environments and rendering characteristics change over time, which is obvious from simple checks when accessing Wayback Machine archived pages in a modern browser. It does not necessarily reproduce the original look and feel of the website. Emulation seems to be inevitable in this domain as it allows preservation of certain standard rendering environments presenting different eras of the web. Instead of programming special purpose browsers, emulation can be used to preserve old rendering environments including virtual network access. Access to deprecated formats like old documents, or e.g. RealMedia audio streams, would be provided through original software of that time. To mitigate rendering problems and evaluate the validity of accessed pages in future environments more metadata might be required to be gathered (e.g., screenshots of pages in popular browsers).
The other major field discussed in this group is Big Data, a term which is not well defined yet. Some memory institutions interpret the term as Datawarehousing and linked data; others define scientific primary research data as Big Data. Pioneering memory institutions started to look into large scale (research) data management, e.g. using storage strategies like Apache Hadoop technology. Beside ambiguities in term definition, all discussion participants agreed on the necessity of proper data description and metadata strategies for preservation, access and re-use.
The discussion evolved around the fairly early stage of dealing with Big Data; the main discussion point was what migration or emulation strategies could look like. Scientific and research data is often meaningless without a context like source application or associated processes. Emulation can help to keep the source environments complete and re-run original tools, thus replicating the process to create the data. This can be used as a building block for provable science and peer-reviewed data as well as data published along with the corresponding papers.
Digital Art and Computer Games
Digital art and games are usually more challenging than plain office documents or traditional applications due to their real-time rendering requirements, interactive elements and dynamic content. They generate domain specific challenges as the artefacts usually try to make the most of the hardware and/or use special effects which result from unique hardware. This includes special screen or visual effects achieved e.g. because of slow CRT refresh rates. Some applications made non-standard use of floppy drives e.g. for copy protection of the media. Of course, digital art and games used former cutting edge features like 3D functionality.
Another issue is introduced by the advances in computer screen technology: screen type, resolution and color depth are often crucial as a picture may look quite different on an old 4:3 (black and white) PAL TV set, 15" CRT in 1024x768, or on a modern 24" widescreen TFT offering 1920x1080 pixels resolution. Digital artworks might require non-standard types of input such as audio or video, and proper gameplay may depend on different types of joysticks. Some of the art works and games are very time critical in user interaction. Time lags between input and onscreen response, or too high execution speed of artworks or games, might render artefacts unauthentic or even entirely unusable. Unlike for traditional digital objects, additional technical metadata is required to describe the aforementioned setups.
Museums all over the world spend millions on digital art for their collections. These are often dynamic, interactive artefacts running on a wide range of different hardware and software platforms. Direct migration is seldom an option as the objects and the original environments are proprietary, or the effort of migration to a current platform would be too high. Museums and art galleries require versatile strategies to preserve and display digital assets for longer time periods. Further, many memory institutions receive the digital legacy of famous authors, politicians, researchers, or wealthy donors. This is a phenomenon found especially in the U.S., resulting in a much wider range of artefacts received in personal archives.
Besides preserving the artefacts themselves, the institutional challenges also include a mismatch in institutional funding for preservation which still focuses on traditional objects. Additionally, memory institutions complain that a gap between actual research and their specific needs for digital art still exists. Besides the technical issues, memory institutions miss guidance on what acceptable loss in digital art and game preservation is, and what future users expect to experience when accessing an artwork or a video game.
Another non-technical domain specific challenge is to preserve the "intent" of an artist. Does the artist intend his/her artwork to be "eternal" or "ephemeral", referring to technological obsolescence. Depending on the intent, the artist may or may not "allow" emulation (or preservation in general). Because of this, the acquiring institution might want to get a waiver of artist's consent to modifications for long term preservation. Questions about what kind of modifications are or are not acceptable will arise. This information might require additional metadata and documentation. To support later preservation, special technical equipment might be needed to migrate the original artefact. This can include adaptors to original input and output devices, as well as readers for the original medium.
Legal aspects are especially challenging for video games: rights on a game can be handed over from the publisher, but those might not include rights on contained music. Often it is difficult to identify all the different current rights holders on a certain game, as the video game industry is volatile and companies cease to exist, are taken over and individual copyrights get sold.
Sustainable Emulation by Community Involvement
Emulation-based preservation and access strategies do not depend only on the availability of suitable emulators or virtual machines. Archived versions of original environments composed of a wide range of software components are also required to maintain matching hardware driver sets, which is a significant challenge e.g. for the preservation of Microsoft operating systems. Additionally, certain hardware toolsets like cables and adaptors, e.g. for creating disk images, are needed if digital objects on original media become available at a later point in time.
Two categories of emulation toolsets exist. A broad range of open source hardware emulators in different states of quality and usability are one category. They are often driven by games communities. As they are usually hobbyist approaches, they are considered to be only "hacks" by some memory institutions. In the other category are commercially distributed emulators and virtual machines that companies usually prefer to utilize. In general, their scope does not include longterm preservation. Memory institutions are barely visible yet as stakeholders in this domain. At the moment, the application of emulation is best described by a lack of institutional interest. There are only a few memory institutions, such as the National Library and the National Archives in the Netherlands, as well as the Royal Danish Library, who are investing in this approach. Currently three different groups exist among them without much interaction and interlinkage to each other. A much better interaction between the communities is required, not only to agree on a clear set of characteristics and designated communities for quality assurance, but also to concentrate development effort for emulation tools.
Nevertheless, everyone is aware that emulation is a valid approach, but it cannot be a 100% replication of the original system and environment. The emulators available, especially from hobbyists, are the result of programmers' personal interests and passions and are not necessarily digital preservation aware, as they may not provide suitable APIs for integration in preservation frameworks, or lack long term stability. Many emulators rely on a very small number of people supporting them and are at risk if they lose interest in further maintaining them. Furthermore, the passion for certain computer platforms is not evenly distributed. There are many different emulators available for popular machines like the Commodore 64 but none or very few for rare and unpopular systems.
The discussions within the workgroups showed that the current, quite specific challenges are only a first taste of preservation challenges to come. Both the technical complexity of games and digital art and the structural and organisational challenges of big data will manifest in future "serious" objects like business processes, research data and governmental procedures. Solving the preservation of digital art and games using emulation should thus also solve most other emulation cases, or as a museum representative put it: "Digital art and games are the e-books of tomorrow".
Direct object migration as a strategy has been pushed to its limits. Emulation is able to extend them to a certain extent. At the very least emulation can act as a suitable fall back solution for objects which fail to be migrated directly. The antagonism of migration and emulation is an artificial one: emulation is just a very effective form of migration. Instead of migrating lots of single artefacts directly, the process is done layers below the objects of interest like application, operating system, or to the most extreme, on the machine layer. However, to enable emulation as a generic preservation strategy, today's complex legal issues with regard to copyrights, fair-use exemption, etc. have to be solved, ideally on a supra-national level. The discussion on applicable strategies to preserve objects' authenticity, i.e. allow for an authentic reproduction of the object, emphasized the need for standardized and comprehensive metadata schemata.
The attending practitioners saw the need for an emulation agenda, and to produce good use-cases to demonstrate the necessity and usefulness of emulation. The last decade has already seen a few research projects on emulation in the domain, but the creation of stable, usable tools and services is still missing. The quality of available tools and workflows in emulation still needs to be improved technically, and with regard to usabilty and economic terms, as sustainability of tools and services is crucial for their use in digital preservation and access.
 Jeff Rothenberg, Ensuring the Longevity of Digital Information, Scientific American, Volume 272, Number 1, 1995, pages 4247.
 Janet Delve and David Anderson (Eds.), The Trustworthy Online Technical Environment Metadata Database TOTEM, Kölner Beiträge zu einer geisteswissenschaftlichen Fachinformatik, Band 4, Verlag Dr. Kovač, 2012.
 Janet Delve, Leo Konstantelos, Antonio Ciuffreda and David Anderson, Documenting Technical Environments for Posterity: The TOTEM Registry and Metadata Schema, PIK Volume 35, Number 4, 2012, pages 227233.
About the Authors