Search   |   Back Issues   |   Author Index   |   Title Index   |   Contents

Conference Report

spacer

D-Lib Magazine
November 2004

Volume 10 Number 11

ISSN 1082-9873

Report on the 4th International Web Archiving Workshop (IWAW)

16 September 2004, Bath, United Kingdom

 

Julien Masanès
(IWAW organizer)
Bibliothèque nationale de France
<julien.masanes@netpreserve.org>

Andreas Rauber
(IWAW organizer)
Vienna University of Technology, Austria
<rauber@ifs.tuwien.ac.at>

Red Line

spacer

The International Web Archiving Workshop (IWAW) is the only regular international event in the domain of Web Archiving. The conference attracts participants from numerous countries, and this year there were more than 50 participants in attendance at the workshop. As they were last year, papers for IWAW 2004 were selected through a formal reviewing process, and they are now available on the workshop web site [1]. The diversity of on-going web archiving projects throughout the world is reflected in the variety of issues addressed this year.

In his introduction to the conference, Julien Masanès, presented the first results and work plan of the newly formed International Internet Preservation Consortium (IIPC) [2], which he coordinates. Led by the Bibliothèque nationale de France (BnF), the Consortium also includes the national libraries of Australia, Canada, Denmark, Finland, Iceland, Italy, Norway, and Sweden; The British Library (UK); The Library of Congress (USA); and the Internet Archive. IIPC's goals are to define and recommend standards, and to develop interoperable tools and techniques to acquire, archive and provide access to web sites. The Consortium's first achievements and current project (storage format and metadata standards, acquisition tools, etc.) were presented.

The rest of the morning session was dedicated to presentations and discussion focusing on tools and methodological issues.

Gordon Mohr, director of the Internet Archive development team, presented Heritrix [3], developed jointly by the Internet Archive and the Nordic National Libraries [4]. Heritrix is an open-source, extensible, web-scale, archival-quality web crawler. The Internet Archive started Heritrix development in the early part of 2003. The intention was to develop a crawler for the specific purpose of archiving websites and to support multiple different use cases, including focused and broad crawling. The software is open source to encourage collaboration and joint development across institutions with similar needs. A pluggable, extensible architecture facilitates customization and outside contribution. The Internet Archive and other institutions are using Heritrix to perform focused and increasingly broad crawls.

Younès Hafri presented a research crawler named Dominos [5] developed in the programming language Erlang during his Ph.D. work at Université des Sciences et Technologies de Lille. Acknowledging that Crawlers interact with thousands of Web servers over periods ranging from weeks to years, the author chose to focus his development project on robustness, flexibility and maintainability in order to achieve a real-time distributed system that runs on a cluster of machines and that is able to crawl several thousands of pages per second. Dominos includes a high-performance fault manager, and it is platform-independent and able to adapt transparently to a wide range of configurations without incurring additional hardware expenditure.

The methodological aspects of web crawling were then addressed by Lars Clausen, from The State and University Library, Denmark, in a study for reliably predicting whether web site content has changed without having to download it [6]. This can significantly save resources in an incremental crawl by avoiding wasteful downloading and archiving of multiple copies of the same content. Based on the two indicators for change of content in the HTTP headers (datestamp and the Etag field), Clausen demonstrated with a sample of more than 5 million downloads that unexpected heuristics can be applied proving to be both reliable and useful for this purpose.

José Coch from Lingway, France, then presented a project called WATSON, done jointly with BnF, for applying several language-engineering techniques for web archiving [7]. The project is aimed at testing the assumption underlying BnF's experiments since 2000 that advanced Information processing can enable automatic location and selection of content on the Web, and thus dramatically enhance accuracy and efficiency for building a large scale Web archive. Examples of pre-filtering and categorization of sites were presented, as well as a workstation prototype that aggregates useful information for professionals. Language-engineering techniques have also been applied to collections mining for researchers. For example, Coch discussed an analysis by BnF (with an emphasis on political discourse) of the 2002 and 2004 French election web sites.

The morning session closed with a presentation by Niels H. Christensen from the Royal Library of Denmark, who addressed preservation issues, specifically the digital "format challenge" for web archives. He proposed a number of specific web archiving requirements for format repositories [8]. This presentation was completed by a discussion from the audience of two on-going projects related to digital format repositories (PRONOM [9] and GDFR [10]) and their potential use for web archiving.

The afternoon session was dedicated to Web archiving use cases.

The first one was about the Chinese Web InfoMall [11] and was presented by Hongfei Yan from Peking University [12]. He described the data storage and service model of Web InfoMall 2.0, which meets the goals of collecting, storing perennially, and locating requests efficiently. Currently, the Web InfoMall holds 0.7 billion pages (10.6 terabytes) together with 5 terabytes of digital web resources other than web pages. It has the ability of collecting more than 1 million pages per day, a storage capacity to hold more than 10 billion pages (about 150 terabytes), and a scheme to manage large numbers of pages.

Paul Koerbin from the National Library of Australia (NLA) then presented PANDORA [13], one of the first-launched web archives, from the workflows and processes view point [14]. He showed how processes are supported by the functionality of the web archiving management system developed by the NLA, the PANDORA Digital Archiving System (PANDAS).

Alenka Kavcic-Colic from the National and University Library of Slovenia then presented a joint project done with the Jozef Stefan Institute, which aimed at the development of a national repository for long-term preservation of Slovenian web and electronic resources. Her talk dealt with first results and project experiences [15].

Unfortunately, due to a last minute impediment, the presentation on the Greek Web Archiving project could not be made at the workshop, but the paper [16] is available on the workshop website.

The final workshop presentation proposed an archivist perspective on the sampling issue and was made by Jared A. Lyle of the University of Michigan, USA [17]. Using purposive, systematic and random sampling methodologies applied in the pre-Web archival world, this case study tested the application of these same methodologies to the web archiving of an entire domain: Umich.edu. This study questioned whether sampling could improve objectivity and overall merit of sites captured for present and future retrieval in the institutional archiving context.

Participants of IWAW 2004 agreed that the workshop was as successful as in years past, and plans are already underway for next year's web archiving workshop to be held in conjunction with ECDL 2005 in Vienna, Austria [18].

References

[1] 4th International Web Archiving Workshop (IWAW04), <http://www.iwaw.net/>.

[2] International Internet Preservation Consortium, <http://netpreserve.org/>.

[3] Heritrix, <http://crawler.archive.org/>.

[4] Mohr, G., et al. "Introduction to Heritrix, an archival quality web crawler." In 4th International Web Archiving Workshop (IWAW'04). 2004. Bath (UK).

[5] Hafri, Y. and C. Djeraba. "Dominos: A New Web Crawler's Design." In 4th International Web Archiving Workshop (IWAW'04). 2004. Bath (UK).

[6] Clausen, L. "Concerning Etags and Datestamps." In 4th International Web Archiving Workshop (IWAW'04). 2004. Bath (UK).

[7] Coch, J. and J. Masanès. "Language engineering techniques for web archiving." In 4th International Web Archiving Workshop (IWAW'04). 2004. Bath (UK).

[8] Christensen, N.H. "Towards format repositories for web archives." In 4th International Web Archiving Workshop (IWAW'04). 2004. Bath (UK).

[9] The National Archives, PRONOM, <http://www.nationalarchives.gov.uk/pronom/>.

[10] Global Digital Format Registry (GDFR) News, <http://hul.harvard.edu/gdfr/news.html>.

[11] Web InfoMall, <http://www.infomall.cn/>.

[12] Yan, H., et al. "A New Data Storage and Service Model of China Web InfoMall." In 4th International Web Archiving Workshop (IWAW'04). 2004. Bath (UK).

[13] PANDORA, Australia's Web Archive, <http://pandora.nla.gov.au/>.

[14] Koerbin, P. "The PANDORA Digital Archiving System (PANDAS) and Managing Web Archiving." In 4th International Web Archiving Workshop (IWAW'04). 2004. Bath (UK).

[15] Kavcic, A. and M. Grobelnik. "Archiving the Slovenian web: recent experiences." In 4th International Web Archiving Workshop (IWAW'04). 2004. Bath (UK).

[16] Lampos, C., et al. "Archiving the Greek Web." In 4th International Web Archiving Workshop (IWAW'04). 2004. Bath (UK).

[17] Lyle, J.A. "Sampling the Umich.edu Domain." In 4th International Web Archiving Workshop (IWAW'04). 2004. Bath (UK).

[18] 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2005), <http://www.ecdl2005.org/>.

 

Copyright © 2004 Julien Masanès and Andreas Rauber
spacer
spacer

Top | Contents
Search | Author Index | Title Index | Back Issues
Previous Article | In Brief
Home | E-mail the Editor

spacer
spacer

D-Lib Magazine Access Terms and Conditions

doi:10.1045/november2004-masanes