Interfacing libraries to the net - Achievements and standards

Ole Husby

BIBSYS, Norway

Introduction: The hypertext paradigm

Hypertext was "invented" 30 years ago by Ted Nelson
"[It occurred to me] that the future of humanity is at the interactive computer screen, that the new writing and movies will be interactive and interlinked. It will be united by bridges of transclusion and we need a world-wide network to deliver it with royalty. . . ."
The World Wide Web was proposed in 1989 by Tim Berners-Lee
"HyperText is a way to link and access information of various kinds as a web of nodes in which the user can browse at will. It provides a single user-interface to large classes of information (reports, notes, data-bases, computer documentation and on-line help). We propose a simple scheme incorporating servers already available at CERN. . . ."
The World Wide Web project succeeded in offering the necessary platform-independent transport mechanism to support a distributed hypertext system: The Internet.

In 1986, William Gibson invented cyberspace, and perhaps also the virtual library in "Neuromancer":

"The Sense/Net research library was a dead storage area; the materials stored here had to be physically removed before they could be interfaced. . . ."
The success of the "web" is well known. Many quite striking web implementations are in fact to be found in library applications. I believe one of the main reasons for this is the abundance of "hypertextual structure" that is inherent in the material the libraries are treating: Within documents, between documents, between document and metadata, within thesauruses and so on.

It should not surprise us that library applications have been surfing at the top of the wave, so that today it is hard to distinguish between the "virtual library" and "cyberspace" itself.

Web achievements

Some library systems have with success ported their "public services" to web applications: Searching the library database and ordering documents from the collections.

HTML FORMS is today supported by most web browsers, and offers a very easy way to construct user interface that the untrained user finds easy to use.

A much harder task has been to interface the HTTP server to the library database. Three different approaches has been used:

  1. Duplicate the database in a special purpose format.
  2. Run a web server and a database engine as two different processes that use some interprocess communication.
  3. Tailor a server to perform TCP/IP socket calls and database queries within the same process.
Most systems have solved the problem by applying (very) special purpose pieces of software and quite proprietary intermediate protocols. Commercial software suppliers are however aware that there is a big market for tools to "connect anything to the Internet", and some are delivering.

A large number of library databases are now accessible through a web interface, both public and research libraries. Systems offering free access has been in the lead, but it seems that some of the big commercial hosts are now joining, by the help of some sort of authentication method.

As all URL's are visible to the user, and even the HTML source files, anyone may make his own variety of search pages or write a special purpose client program to request records (maybe all of them?) from a server. This shows that the web server manager must act according to basic client-server principles: Concentrate on delivering data efficiently and in a suitable format, but not care too much for how the client behaves. And of course: guard against unfriendly clients.

Z39.50 enters the scene

Now, does the web access to bibliographic databases solve all problems? Perhaps not, I can see a couple of problems of a technical nature:
  1. The HTTP protocol is stateless, and does not support the notion of a search session as part of the protocol. This means you can not perform interactive (iterative) searching, browse to the next screen and such things. However, in many systems the server simulates sessions, by storing session ID's, global sets (at least names), and by transferring this vital data between the client and server as part of the URL's.
  2. There are no standards for query syntaxes or record syntaxes. This problem is to some extent lessened by the fact that it is quite easy to offer user guidance.
The ISO SR protocol (Search and retrieve) and the NISO Z39.50 protocol (Information Retrieval) have been underway for the last ten years, but only recently has those protocols gained momentum. A number of the big suppliers of bibliographic records (like OCLC, LC and others) are now operating services based on Z39.50 (not just small test databases). There is furthermore the hope that the two protocols will merge into one, for the good of global interoperability.

Much could be said about the history and development principles of the two protocols, but I will skip most of this. I do however think it is fair to mention that one reason for the recent success of Z39.50 is that most hosts are applying Z39.50 on top of a pure TCP/IP layer, and thereby sharing some of the common Internet success.

Most of the Z39.50 activity has up to now happened in North America, but recently European systems are joining forces to connect, and particularly in the Nordic countries. Project ONE (OPAC Network in Europe) is spending 2.4 M ECU's to establish services, infrastructure and software tools in 8 European countries.

One main reason for the enormous success of the WWW is the free access (for most of us) to high-quality client software. It would be unfair not to mention by name the two most important tools: Mosaic and Netscape. It is not more than a year ago that Mosaic was considered to be a synonym for WWW, and today most resources offered by web servers are tailored after the Netscape capabilities.

The situation for the Z39.50 protocol is quite different, but improving. Some quite good commercial clients are available, and for the nonpaying world an assortment of clients and toolkits.

There is another obstacle to widespread Z39.50 access, namely the need for configuring at the client side. This is quite unlike the web, where all you need is an URL. As there are a number of optional aspects to the Z39.50, there are some parameters the client have to know about each host database. And even if the principle of negotiating is well taken care of, is there still a need for more metadata being sent from the server. This will hopefully be taken care of in the new EXPLAIN service of Z39.50-1995 (Version 3).

Web or Z39.50 or both?

Those two approaches are different in many aspects. They could be considered competing methods, and will probably attract different user groups.

But they could also complement each other and even to some degree interoperate. There are several possibiblities for interoperability:

WWW-to-Z39.50 gateways: There are a number of such gateways in operation, whereby a traditional web search form is used to enter a search request. The request is handled by a gateway that makes a "one-pass" access to a Z39.50 server, and delivers a response that is HTML formatted. This might be inferior to a pure web approach, but if the Z39.50 server is in place, you get the web interface for free. Surely this is inferior to a pure Z39.50 approach, but most users do not have a Z39.50 client.

Z39.50 as a MIME type: Some attempts have been made to define a specific MIME type (like "Search"), and thereby configuring the web browser to launch an external Z39.50 client, and offering necessary parameters for the client to connect to the Z39.50 host. This will help with the configuring mess.

Z39.50 URL's: Instead of a MIME type the URL itself could carry all necessary information to launch a Z39.50 client. A syntax for such an URL is proposed to the IETF (Internet Engineering Task Force).

A step further is a Z39.50 "retrieve" URL, which carry additional information to perform a search for a unique Document ID and retrieve it. In this case it is supposed that the Z39.50 capabilities are built into the web browser itself, and it could be very useful in combination with a gatewayed search. Such an URL is also proposed to IETF. But there are no browsers that support such an URL today, and I do not know what Netscape thinks of it (which seems to be the final test ...).

Cataloguing the Internet

As the "collections" of the virtual library are everywhere, the task of cataloguing is enormous. The rules for traditional cataloguing may not apply. And furthermore is it no longer clear what are the responsibilities of each cataloguer, and what possibilities there are for reuse of cataloguing ("shared cataloguing").

The Internet is not hierarchic (and neither is the information universe). There are however a lot of efforts to build subject-based hierarchical views of the network resources. Some are based on traditional classification systems (like UDC in the Swedish EELS project), while others are using private subject trees. There are even projects that are distributed in the sense that anyone is invited to add links or even expand the tree itself.

Cataloguing - both in the traditional and the new meaning - deals with metadata. The main method for storing metadata has for the last 25 years been to use some MARC format to keep separate "bibliographic records" in a separate database. A recent extension to the MARC format is the USMARC (and others) 856 tag, for describing the location of an electronic document on the network. The main use of this is to store the document URL in subfield 856 $u.

A number of projects for cataloguing electronic documents are based on the use of this 856 tag, of which I will mention two:

The "Internet Cataloging Project" of OCLC, where libraries are asked to join the project by contribute MARC records for electronic documents. OCLC in turn keeps a separate web-searchable database for those records, where the 856 tags are turned into hyperlinks. The project runs until March, 1996.

The "WebDoc" project of PICA. This projects is more "controlled", in the sense that a number of electronic journals are selected, and that both the publishers and libraries are involved. There is also a mechanism for payment and access control. This project runs until July, 1997.

All efforts to build structure into networked resources, and especially projects like these, encounters one of the most wellknown problems regarding Internet: the question of authenticity and integrity of the available resources. This problem is just stated here with no effort to discuss it!

More about metadata

What are the alternatives to using MARC records?

The MARC approach may be outdated, and requires an effort that may be to complicated. Many Internet "locating" services (like Lycos) are doing quite the opposite: automatically cataloguing the documents for indexing, using (of necessity) a much simpler metadata structure.

One aspect of cataloguing rules is to define a set of metadata elements that should go into a bibliographic record ("surrogate record"). An OCLC/NCSA Metadata Workshop in March 1995 proposed a new set of metadata elements: The Dublin Core. This set consists of 13 elements (Subject, Title, Author, ObjectType, Identifier, Source ...). There are no subfields. This set is defined with the intention that different encoding methods could be used, like:

MARC format
A special SGML DTD
HTML META-tag
Furthermore, the SGML and HTML encoding suggests of course that the metadata is inserted into the primary document itself, so there is no longer need for separate surrogate records. An automatic indexing engine could produce metadata description of a higher quality than today, and a much more sophisticated search is possible. If there is still need for a separate catalogue, a minimal MARC record could be extracted from the document itself.

Another initiative concerning metadata is the URC development within IETF. URC means Uniform Resource Characteristics, and has several aims. The main goal is to offer a resolution of one URN (Uniform Resource Name) into one or more URL's. It is also intended to add "extrinsic" metadata to the URL's (like price and availability), and even "intrinsic" metadata (bibliographic). The URC service would then supply all metadata necessary for the user to decide if the electronic document is worth downloading (or perhaps for the browser to decide automatically). The URC service could also serve as a search engine, offering all metadata elements as access points.

The URC standard is not yet in real use, but could offer a solution to many problems concerning networked documents.

References

[Library of Congress] Z39.50 Maintenance Agency Homepage
http://lcweb.loc.gov/z3950/agency/agency.html

[Internic] Z39.50 resources - a pointer page
http://ds.internic.net/z3950/z3950.html

[Juha Hakala] Z39.50-1995: Information retrieval protocol
http://renki.helsinki.fi/z3950/z3950pr.html

[ONE] ONE Homepage
http://www.bibsys.no/one.html

[EELS] EELS Homepage
http://www.ub2.lu.se/eel/eelhome.html

[OCLC] Internet Cataloging Project
http://www.oclc.org/oclc/catproj/overview.htm

[PICA] WebDoc
http://www.pica.nl/cgi-bin/wwwdoc.en

[OCLC/NCSA] OCLC/NCSA Metadata Workshop Report
http://www.oclc.org:5046/conferences/metadata/dublin_core_report.html

[IETF] URC Scenarios and Requirements
http://www.acl.lanl.gov/URI/Scenarios/index.html
ole.husby@bibsys.no