D-lib MagazineDecember 2000
Volume 6 Number 12 ISSN 1082-9873
PhysDoc - A Distributed Network of Physics Institutions Documents
Collecting, Indexing, and Searching High Quality Documents by using Harvest
Thomas Severiens, Michael Hohlfeld, Kerstin Zimmermann, Eberhard R. Hilf
PhysNet offers online services that enable a physicist to keep in touch with the worldwide physics community and to receive all information he or she may need. In addition to being of great value to physicists, these services are practical examples of the use of modern methods of digital libraries, in particular the use of metadata harvesting.
One service is PhysDoc . This consists of a Harvest-based online information broker- and gatherer-network, which harvests information from the local web-servers of professional physics institutions worldwide (mostly in Europe and USA so far). PhysDoc focuses on scientific information posted by the individual scientist at his local server, such as documents, publications, reports, publication lists, and lists of links to documents. All rights are reserved for the authors who are responsible for the content and quality of their documents.
PhysDis  is an analogous service but specifically for university theses, with their dual requirements of examination work and publication.
The strategy is to select high quality sites containing metadata. We report here on the present status of PhysNet, our experience in operating it, and the development of its usage. To continuously involve authors, research groups, and national societies is considered crucial for a future stable service.
Anybody can feel drowned by the flood of irrelevant information when searching for a small but precious bit of relevant information in the web using search engines. Everyone dreams of a 'portal' somewhere in the web providing specific information: certified, up-to-date, correct, complete, just in time and ready to use. Several events have now prepared the ground for a future realization of these dreams. This paper describes our work in support of the field of physics and its relevant scientific documents.
This year the American Physical Society (APS),  as one of the major national physics societies in the world with some well renowned scientific journals (Physical Review Letters, Physical Review ), has allowed the authors of articles accepted for publication in APS journals to put a copy (even a scanned copy of the original paper journal printout) on the authors' own local institute's web-server . Netiquette commands that the author set a link to the publisher's server and cite the refereed document. Why is APS apparently giving away the 'ownership' of its documents? Because, by now, the publisher is proud of offering a full bunch of sophisticated add-on services, such as crosslinking, search engines, list of society members, and other professional services. Hence, the publisher has moved from the ownership of documents to professional add-on services.
This article describes PhysDoc , a physics document portal, which allows documents to be searched across physics institutions' web-servers worldwide. PhysDoc is part of PhysNet, the worldwide Physics Departments and Documents Network, under the auspices of the European Physical Society (EPS) and several national societies and controlled by the EPS Action Committee on Publication and Scientific Communication (ACPuC) . The technical development and standards are coordinated by the Institute of Science Networking  at the Physics Department of Oldenburg University. PhysDis  is a special subset of PhysDoc, a service for Ph.D. theses of universities. Theses need a specific set of metadata because of their dual role as examination works at a department and as publications by the authors.
PhysDoc is based on the concept of metadata harvesting, introduced by the Harvest project and being developed by the Open Archive Initiative (OAI)  of 1999. The increasing availability of long-term stable archiving formats (XML, MathML, SGML) and universally readable browser formats (HTML) and the Harvest technique for distributed gathering and brokerage form the technical foundation of a fully distributed worldwide information system. The Open Archive Initiative has proposed a seamless superstructure for the management of scientific documents and called for a worldwide standard for the gateway between information repositories and search engines. PhysDoc uses the international metadata standards defined by the Dublin Core initiative  for object description to enable search engines to make full use of the information given by the local repositories documents, if metadata have been added. Then there is no necessity to write individual wrappers for each local data-source.
The European Physical Society (EPS)  has opened its PhysNet services  to develop into a worldwide service network supported by national learned societies, physics institutions, and university departments.
PhysDis is part of the German project 'Dissertationen Online' , a project of the IuK Initiative Information and Communication of the Learned Societies in Germany , supported by the German Science Foundation (DFG) . Within Germany, the German National library  will serve as an archive, the university libraries as local repositories for the posting on the web, the departments for the exam-relevant part of metadata, and the learned societies for the search engines and training of candidates. The author's rights are reserved for the candidate.
We present the history, the concept, the realization, and our experiences for the field-specific document portals PhysDoc and PhysDis. We evaluate the coverage as well as the usage throughout the first six years of operation.
The report is complemented by documenting the present diplomatic activities to assure international control and freedom, namely the unbiased possibility of any professional author of contributing to the seamless system without a possibility of dominating the decisions.
The authors feel that a professional field-specific document portal in science should be restricted to documents that have been refereed or have been put into the web by scientists themselves. Thus, the scientists of professional institutes and physics departments at universities worldwide take responsibility for the quality, integrity, actuality, and completeness of the information they decide to put into the services via their local posting, in particular for their copyright. The PhysDoc service, on the other hand, has to ensure the visibility of information on the net through its search engines.
In 1994, a series of meetings, visits, and contacts cleared the visions of future possibilities of using the web for professional services in physics. The most famous event was a one-day conference at Los Alamos Physics ePrint Archive , where the German Physical Society (DPG)  presented its vision  and offered its international cooperation. In subsequent meetings that year with R. Kelly, technical director of APS, the vision of moving from ownership of information to professional add-on services for publishers as their future marketing concept became clear.
The EPS committee on publication, under the chairmanship of Franck Laloë, organized a series of early meetings and sessions, where publishers and physicists outlined their visions . At a certain point it was realized that the committee needed a new concept, as well as a new mandate, in order to better fit the general purposes of a learned society such as EPS; one concern was an appropriate balance of the contribution from private publishing houses to the committee, which assumed a new name: 'Action Committee on Scientific Publications and Communication'. One of the activities of this new committee was to set up new professional information services for physicists, such as PhysDep and PhysDoc as well as to coordinate the development of several European mirrors of the Los Alamos preprint server in order to improve safety (in case of fire, etc., destroying the computers) and rapidity of the connections.
The present committee (chaired by C. Montonen) with the Secretary General of the EPS (D.Lee) set up the procedures used to make decisions: proposition to the committee, eventual approval, annual report, decision whether it (still) meets the standards and requirements of a society's service. The committee also provided a small seed funding for operation. This made PhysNet possible, gave an ideal supportive framework, and boosted the acceptance by national societies as well as users.
For the German Physical Society (DPG) we offered training courses for nominated local experts from the physics departments at German universities, to teach them how to set up web-servers of their own. Thus, virtually all physics departments in Germany had installed a web-server by the end of 1994. The list of links to these servers and the email list of their operators formed the nucleus of PhysNet , a collection of services now provided by the EPS together with its international partners.
In the next few years, these lists of links were steadily expanded, adding more physics institutions from more countries. Further physics-related subservices were integrated as well.
The development of PhysDoc
In 1995, a Harvest search engine , an open software product by the University of Colorado, was successfully installed by our collaborator H. Stamerjohanns. The software allows gatherers to be installed which harvest information from given sources entered from link lists. An index-file is created and forwarded to a broker, which then answers incoming queries. Harvest allows networks of distributed gatherers and brokers to be set up.
Harvest is an integrated set of tools to gather, extract, search, cache, and replicate relevant information across the Internet. Harvest was a research project led by M. Schwartz at the University of Colorado. It was developed by members of the Internet Research Task Force Research Group on Resource Discovery (IRTF-RD), consisting of M. Bowman (Transarc Corporation), P. Danzig (University of Southern California), U. Manber (University of Arizona), and M. Schwartz (IRTF-RD chair, University of Colorado). Harvest has been used in some scientific document systems in Europe, and is used in Australia to connect governmental authorities at all levels. Unfortunately, Harvest is no longer supported by the Department of Computer Science at the University of Colorado at Boulder, and has become part of the free software development process. Presently, support and development (e.g., to make use of RDF) is headed and guaranteed by several groups in the world .
In 1997, R. Schwänzl and J. Plümer of the Technical Advisory Committee of the IMU in conjunction with us created MyMetaMaker (MMM) , a web-form for enriching documents with metadata. In its latest version MMM produces Dublin-Core conformal HTML-metadata.
Also in 1997, we started to set up the PhysDoc service as part of PhysNet. Using the PhysDep database of physics institutions, we browsed across the institutions' web-servers and gathered links with relevant information on science documents such as reports, publication lists, preprints, or even scanned copies of published papers. We started with European servers and later added servers from locations in the US. The list of URLs for publications on local web-servers is still incomplete.
In the spring of 1998, the 'Dissertationen Online' project started within the scope of which we set up PhysDis, a collection of online information and full texts of Ph.D. theses at physics institutions in Europe.
By the time of the international workshops CRISP, Cooperative Research Information Systems in Physics,  in Oldenburg in 1997 and 1999, the way was paved for worldwide distribution, and sharing of work and services on equal footing among the participating persons, groups, and national societies.
In May 2000, a new era towards internationalization was initiated by the action committee  of EPS and the Committee on Electronic Information and Communication CEIC  of IMU (International Mathematical Union) through the negotiation of an official cooperative agreement  between PhysNet and MathNet, and especially its distributed document service MPRESS .
In parallel, since PhysNet will comply with the Open Archive Initiative standards and requirements, it will be embedded and integrated into the up-coming seamless distributed network of all disciplines .
A document portal for a scientific field such as physics has to comply with the following requirements :
As a non-commercial service we restricted PhysDoc and PhysDis to those databases which are free to use, online, non-proprietary, and leave the rights with the authors.
The service is based on a distributed, heterogenous, large set of document and data sources. It is complemented by an analogously distributed workforce: the local experts responsible for organizing the online archive of documents of local authors, and finally the operators of the distributed brokers and gatherers. Thus, in principle, there is no strong central workforce needed, apart from some administration and network-maintenance offered by the EPS.
Although the whole system may appear difficult to manage, it has been operating very smoothly for more then 4 years and is considered the only structure stable against scaling of document numbers, site numbers, work force needed, and net-load.
The aim of PhysDoc is to enhance the public accessibility of scientific documents stored on local institutions' servers. The core of PhysDoc is a list of links to document sources of physics institutions worldwide. They are ordered by continent, country, and town. Such document sources are, for example, preprints, research reports, annual reports, and lists of publications of local research groups and individual scientists.
It is especially advisable to store documents or document information (such as publication lists) locally if the authors or their institutions want to control integrity, accessibility and archiving. Local storage allows for easy and frequent updating directly by authors. Thus, PhysDoc complements the central preprint server such as the arXiv.org e-Print archive  and the publisher's scientific journals.
Figure 1: Homepage of the PhysDoc service
The PhysDoc service is combined with a Harvest-based search engine. The search can be either a 'simple search' by just giving some keywords or by combining terms with Boolean operators, or it can be specified in a query form by author, title, and 'fulltext,' which is useful in case of documents coming with the respective metadata.
Harvest allows gatherers to be configured by feeding in a list of web-links and their click-successors. The gatherer extracts the retrieval-relevant information from the gathered files by using file-type-specific summarizers. The extracted and summarized file descriptions are stored in SOIF-format (Standard Object Interchange Format) in the gatherer's database. Harvest brokers respond to queries using local index files. These index files can be composed of the indexed information from different distributed gatherers and brokers. The advantage of Harvest is the separation of gatherer and broker components, which allows for the seamless cooperation of many distributed gatherers as well as brokers and thus is scalable.
To date, we have collected information from 1,113 document lists of local physics institutions and departments: 439 in Germany, 92 in France, 174 in UK, 120 in Italy, 189 in other European countries, and 99 in the USA and other countries worldwide. The number of the linked publication lists in PhysDoc seems to be stable with about 1,100 links, but certainly is still incomplete.
The number of documents reached by these links cannot be measured exactly, since distributed authors may change their input without notice. Our estimate of the number of reached documents thus far is well above 70,000 (62,000 in April 1999 and 50,000 in October 1998). To keep the links for PhysDoc up to date is time consuming, because links are often moved around without notice by the authors at their local sites.
It is therefore recommended that institutions set up one central web-page at their server providing the links to all the document lists of their subinstitutions, research groups, and individual scientists. Recently, more institutions have been setting up a local sub-homepage leading to all physics documents information pages of their local research groups and scientists. This increases the stability of our service.
Authors should additionally qualify their documents by adding metadata according to the Dublin Core standard. For this purpose we provide the web-form MyMetaMaker (MMM)  which enables the authors to create and integrate the machine-readable metadata, without the need to know the complicated Dublin Core specifications, and thus to improve the web-visibility of their documents.
PhysDis is a subset of PhysDoc and focuses on Ph.D. theses and dissertations as a special kind of publication. In the past few years, the prestige of these documents in physics has changed from 'grey literature' to an important source dealing with the latest research results. An agreement was reached with the German National library  concerning the set of metadata and the archiving of the electronic full texts.
PhysDis currently offers 229 links to collections of Ph.D. theses and dissertations in 18 European countries, plus the collections of MIT and Fermilab in the US. We list 85 links in Germany, 48 in Spain, 21 in Sweden, 20 in Switzerland, and 55 links in other European countries. In total, 1,818 datasets including 250 full texts have been collected so far.
The PhysDis service also uses a Harvest-based broker to allow for retrieval across all of the listed links.
An upload interface is offered for both services to allow local institutions or the institutions' authors to register documents, information on documents, or lists of theses.
Experience and Usage
For an innovative new service the only way to analyze the market is to put the service online, monitor its usage in detail, and perform live experiments. Here we give some examples of our experiences in taking actions to publicize the service and using an internal tool to monitor usage of each file.
In response to publicity most usage statistics showed a temporarily larger response and usage statistics also showed a gradual, less dramatic long-term increase, which was the the desired result.
Figure 2: File-requests plus usage of the search engines of the PhysDoc sites per day
Figure 2 shows, that the PhysDoc-service is mostly used to search for documents. Surfing the list of links is quite low, though the lists are essential to configure the search engine. The gap in year 2000 resulted from technical problems with the machine, which have now been solved.
Some positive examples:
In May 1999, we changed the layout of PhysNet to a frame-free minimal design of the web pages, allowing for easy linking, fast downloading for the customer, and access by any type of browser and operating system. A lot of positive responses were received from users, especially regarding the decision to replace the frames with tables.
Communication with users and system operators is one of the crucial aspects of operating a public service. We try to respond individually to any email contact, and we also use email to disseminate information about improvements to the service.
For example, any operator can help the users of his server and of PhysNet (and other portal services) by providing a detailed and stable linking structure. One element of this structure should be a central linking site for all the publication lists on the server, which should have a stable address. In PhysDoc we observed that addresses of publication lists are not stable over longer time periods on most servers, but stable linking is a central requirement for every referencing system.
The largest long-term increase in usage was achieved by increasing the quality of the service:
Some drawbacks: For a month the search engine for documents had an unresolved breakdown; it was operating but returning few results. This resulted in a permanent decrease in usage from which the service has not yet recovered since resuming full operation.
Vision of Future Developments
For the future long-term stability and professional usefulness of a field-specific document portal, some lines should be followed:
To harvest all relevant information on physics documents worldwide is a huge task, but by distributing the work among national societies and through international coordination, it can be done, and it is worth the long-term engagement because it should provide the means to make research processes more effective.
PhysDoc is operated under the auspices of the European Physical Society EPS and under the control of its Publication Committee . Funding by the EPS has enabled us to set up and maintain the service.
PhysDis has been initiated by the IuK Initiative . It is supported by the German Science Foundation (DFG)  as part of the grant Dissertationen Online , a joint project of several learned disciplines (Mathematics, Chemistry, Education, Physics), university libraries, and computer centres in Germany.
Many thanks are due to F. Laloë, Laboratoire de Physique de 1'ENS - FR, for detailed information about the early EPS meetings.
Hints on 'what to do and how to join?'
Institute Web-Server Operators:
Head of Institutes:
National Physical Societies:
References and link list
(On January 24, 2011 the heading of this article was changed to correctly identify the Institute for Science Networking Oldenburg, located at the Carl von Ossietzky-Universität, and to update the contact email address. Additionally, the organization names and URLs in the References and link list section were updated by author Eberhard R. Hilf. The links were last viewed on January 9, 2011.)
Copyright© 2000 Thomas Severiens, Michael Hohlfeld, Kerstin Zimmermann, Eberhard R. Hilf