Workshop A: Web-wide Indexing/Semantic Header or Cover Page

Chair: Bipin C. Desai, Brian Pinkerton

Choosing an Indexing Strategy in an Enterprise Environment Christian Kuhnert, February 27th 1995 (kuhnert@welfa5.elektro.uni-wuppertal.de) http://welfad.uni-wuppertal.de/people/kuhnert.e.html Abstract This paper describes the demands in setting up an indexing system for WWW services within a world-wide operating company. Six systems for description and full text based information recovery, namely Aliweb, Harvest, DIENST, WAIS, FFW and GlimpseHTTP are discussed and compared. After selecting an appropriate solution, some statements about desirable future development are made. 1 Situation Siemens Nixdorf Informationssysteme AG (SNI) is the largest European manufacturer of Midrange Systems (*NIX Systems). Other important business areas are Mainframes (BS2000) and POS (Point Of Sale) equipment, with a total of roughly 40,000 employees world-wide. Currently the majority of customers reside within Europe. Data exchange between plants and establishments is carried out through a world-wide corporate network. For communications with subsidiaries, partner companies and customers, Internet paths become more and more important. Most of the current data paths are charged on a per volume basis. 2 Targets The possible internal application areas for WWW are currently evaluated. It is considered to provide Internet connectivity to employees primarily through a WWW interface that integrates most of the services in a user friendly manner. Today there are lots of internal database and information systems, all with their own different user interface. Gateways to these systems are being built. In a special case, migration from a proprietary system to HTTP is considered. This system mainly provides product information to sales executives and contains a full text index. Representation of this index and its query interface is the main problem herein. Another field of application is the query of internal library catalogues. In parallel WWW services for corporate presence are being built. Here indexing might be handled in a similar way as with other public WWW servers. 3 Priorities Currently the limiting factors for any implementation can be seen in the following order: 1. Network cost 2. Storage cost 3. Computation cost This might not be specific to the SNI environment since in the past years cost reductions generally took place in reverse order. From the users perspective, the main goals for each service are: 1. Quality of Service (QOS: Availability, Responsiveness) 2. Ease of use (For client as for system administration) 3. Actuality (This has been taken out of QOS) As you can see there's an inherent conflict between the top points in above listings: QOS is limited by network quality which is almost proportional to cost. Reduced to indexing and retrieval the main questions are: Q1: Index distribution cost versus query transfer cost Q2: Local computation versus WAN access 4 Review of Common Methods for WWW Server Index Generation Besides robot based gathering from information sources (like Lycos, WWWW, RBSE Spider and WebCrawler) the most frequently used system with server side support is ALIWEB[1]. A new and more general approach is followed by the HARVEST[2] system. In short HARVEST trades network load for storage cost - a reasonable choice regarding the given priorities. The DIENST[3] Protocol focuses on distribution of academic papers. It relies upon bibliographic descriptions in RFC-1357 format and distributes queries to multiple locally maintained databases. None of these systems support full text indexing (Harvest could, but would loose its advantages; DIENST actually does, but relies on WAIS[4] for the implementation), therefore WAIS must be considered. Finally there are specialised full text indexers that support HTML: FFW[5] and GlimpseHTTP[6]. 4.1 Description Based Indexers ALIWEB Aliweb is based on description files that are collected at regular intervals and then combined into a searchable index. It is the information provider's responsibility to compile and update this description file which follows a common standard. This can be done manually or using some information extraction tool. Aliweb currently depends on one master server for gathering data. The index then is mirrored. HARVEST Harvest essentially can be broken up into two main components called "Gatherer" and "Broker". The Gatherer extracts object descriptions from files of known type (besides HTML this includes binary, some graphics formats, etc.) and exports them via its TCP port. A proprietary, structured format called SOIF (Summary Object Interchange Format) has been defined to exchange these descriptions. The Broker connects with one ore more Gatherers (or other Brokers) to collect this data (optionally compressed with gzip) to build an index. It then accepts query requests by listening to an own TCP port. Structured queries using Boolean expressions and fielded search are provided. DIENST DIENST (which stands for Distributed Interactive Extensible Network Server for Techreports) provides an HTTP based protocol for structured search in distributed databases and object oriented document retrieval. It creates an index from bibliographic description files in RFC-1357 format that can be searched on each server. Similar to WAIS there is a master index of servers that can be used to forward queries to the appropriate sites. Documents can then be retrieved in a variety of formats. Recently also fulltext search is supported using the SMART search engine (which provides a look and feel that is similar to WAIS) or WAIS itself. 4.2 Full Text Indexers WAIS The WAIS (Wide Area Information Servers) system allows for full text search in a variety of databases, distributed on the network. A single directory of servers lists available WAIS indexes. Users can select appropriate servers and pose a query to them. Found items will be presented with a relevance rating depending on the number of occurrences of the keywords from the query in the document. A major drawback of WAIS concerning data distribution is that for a query to be answered not only the index but also the underlying database must be accessed. Therefore data and index information are kept in the same location. Also WAIS indexes are around the size of indexed data or even larger to provide fast search. There are different implementations of WAIS available, some of them supporting Boolean expressions and date search. FFW FFW (Freetext Search for the Web) is a fulltext indexing system that focuses on HTML documents. One of the advantages over WAIS in its application is that it generates a "self-contained" index: Only the index data is needed to answer a query. It provides a means for merging large indexes from existing smaller ones and to distribute queries amongst indexes which are scaled around 30% of dataset size. Only simple queries (but including expression grammar, word truncation and date search) are supported. GlimpseHTTP Glimpse is usually the underlying indexing mechanism for Harvest object descriptions in a Broker. It can also be used standalone with some small extensions to provide fulltext search on HTML documents. As with WAIS, Glimpse needs to access the indexed data to satisfy a query, but allows index size to be reduced to around 7% of data size by trading access speed. Glimpse supports a wider set of queries, including spelling errors and regular expression match. 5 Evaluation The presented description based indexers are very different in scope and implementation. While Aliweb has the appearance of being an ad hoc solution to the resource location problem, the others are more designed and allow for hierarchical index arrangement (Harvest) or query distribution (DIENST), easy expansion (both) and abstraction from files (DIENST). Concerning full text retrieval, WAIS is the most common and general system. FFW and GlimpseHTTP are more lightweight solutions which focus on WWW servers. They both have their individual advantages (e.g. FFW dealing with the full ECMA Latin-1 character set and providing a self-contained index; GlimpseHTTP being very unpretending about disk space). They lack a mechanism to build larger indexes from existing ones as WAIS (virtually) does. -------------------------------------------------------------------------------- explicit data indexed objects phys. location levels of query description of index hierarchy execution needed ================================================================================ Aliweb yes sites, files, centralised 1 on master services or mirror -------------------------------------------------------------------------------- Harvest no (generated files arbitrary n on master by essence) or replica -------------------------------------------------------------------------------- DIENST yes documents distributed 2 distributed on sites -------------------------------------------------------------------------------- WAIS no files distributed 2 distributed on data sites -------------------------------------------------------------------------------- FFW no HTML files arbitrary 1 on index loc. -------------------------------------------------------------------------------- Glimpse no HTML files on data site 1 local HTTP -------------------------------------------------------------------------------- The table shows some key characteristics of the six indexing tools discussed in this paper. Since the scope is HTTP retrieval within one enterprise, Aliweb must be dropped. WAIS would be a preferable choice if it was already used as a retrieval system within the company. Since it isn't, the other, more WWW oriented packages can be considered. DIENST contains some very good ideas about handling different data formats of the same document, but currently is limited to documents with bibliographic descriptions available. It's a very promising approach for on-line library services and presents a friendly user interface. For the final decision, a look on the priority list might help: As we have said, network load is the major concern. It is difficult to estimate that parameter for the different solutions. Statements like "can reduce [...] network traffic by a factor of 59" (from [2]) treated with care. A system that is flexible enough to allow a decision on distribution policy while being used would be preferable. This makes Harvest the most promising solution. After a gatherer is running locally for every resource, brokers can be set up at any location. For full text indexes, Harvest produces too much overhead: In its present implementation a (compressed) SOIF object containing the full document must be stored by the gatherer, be transferred to a broker and get indexed. The index then could be replicated. The same functionality could be achieved with FFW, using standard mirroring for the index. GlimpseHTTP does not meet this requirement as its index is not self-contained. Only Harvest and FFW make it possible to give a "flexible response" to Questions Q1 and Q2 for the desired application. 6 Application For indexing the internal and external WWW server contents Harvest is used. With Version 1.0 several major bugs showed up, that disappeared when upgrading to V1.1 these days. Administration of the Harvest system, beeing very complicated wit V1.0, is also more straightforward in the new version. For full text index generation on the product database FFW will be used. 7 The Future For further development, it would be desirable to integrate index generation with revision control. At present, revision control is the lacking element in providing WWW documents. It should be integrated within the server - as proposed by the HTTP protocol - as a handler for the PUT and DELETE methods. When this step has been taken, forming a database from an ugly heap of files, some of the former indexing problems will have disappeared. When the versioning system detects a change, it could initiate an incremental index update, thus removing the need to process the whole database at regular intervals. The changes in the index then can be propagated to registered sites as delta information, minimising network load. This is comparable to the transition from procedural to event driven programming. Instead of running batch jobs in the night (when is "night" in a world-wide web anyway?) to update an index, recently developed on-line index construction algorithms must be used[7]. This form of version tracking and change propagation will also help to solve the notification problem as described in [8]. References [1] Koster, Martijn: Welcome to ALIWEB. On-line document (http://web.nexor.co.uk/public/aliweb/aliweb.html)Nexor, UK, February 1995 [2] Hardy, Darren R. and Michael F. Schwartz: Harvest User's Manual, Version 1.1. Technical Report CU-CS-743-94, University of Colorado at Boulder, February 1995. [3] Davis, James R. and Carl Lagoze: A protocol and server for a distributed digital technical report library. (http://cs- tr.cs.cornell.edu/Server/ TR/CORNELLCS:TR94-1418) Cornell University, April 1994. [4] Marshall, Peter: WAIS: The Wide Area Information Server or Anonymous What???. (ftp://ftp.wais.com/pub/wais-doc/UWO-wais-paper.ps) University of Western Ontario, June 1992. [5] Hfjeld, Brd: FFW - Freetext search for the Web. On-line document (http://www.nta.no/produkter/ffw/ffw.html), Telenor Research, Norway, February 1995. [6] Klark, Paul: GLIMPSE, A tool to search entire file systems. On-line document (http://glimpse.cs.arizona.edu:1994/) University of Arizona, February 1995. [7] Srinivasan, V and Michael J. Carey: On-line Index Construction Algorithms. (http://www.cs.wisc.edu/TR/UWMADISONCS:CS-TR-91-1008) University of Wisconsin, February 1991. [8] Notification of new material. On-line document (http://info.cern.ch/ hypertext/WWW/DesignIssues/Notification.html) CERN, October 1993. ----------------------------------