Workshop A: Web-wide Indexing/Semantic Header or Cover Page
Chair: Bipin C. Desai, Brian Pinkerton
Choosing an Indexing Strategy in an Enterprise Environment
Christian Kuhnert, February 27th 1995
(kuhnert@welfa5.elektro.uni-wuppertal.de)
http://welfad.uni-wuppertal.de/people/kuhnert.e.html
Abstract
This paper describes the demands in setting up an indexing system for WWW
services within a world-wide operating company. Six systems for description and
full text based information recovery, namely Aliweb, Harvest, DIENST, WAIS, FFW
and GlimpseHTTP are discussed and compared. After selecting an appropriate
solution, some statements about desirable future development are made.
1 Situation
Siemens Nixdorf Informationssysteme AG (SNI) is the largest European
manufacturer of Midrange Systems (*NIX Systems). Other important business areas
are Mainframes (BS2000) and POS (Point Of Sale) equipment, with a total of
roughly 40,000 employees world-wide. Currently the majority of customers reside
within Europe. Data exchange between plants and establishments is carried out
through a world-wide corporate network. For communications with subsidiaries,
partner companies and customers, Internet paths become more and more important.
Most of the current data paths are charged on a per volume basis.
2 Targets
The possible internal application areas for WWW are currently evaluated. It is
considered to provide Internet connectivity to employees primarily through a WWW
interface that integrates most of the services in a user friendly manner. Today
there are lots of internal database and information systems, all with their own
different user interface. Gateways to these systems are being built. In a
special case, migration from a proprietary system to HTTP is considered. This
system mainly provides product information to sales executives and contains a
full text index. Representation of this index and its query interface is the
main problem herein. Another field of application is the query of internal
library catalogues. In parallel WWW services for corporate presence are being
built. Here indexing might be handled in a similar way as with other public WWW
servers.
3 Priorities
Currently the limiting factors for any implementation can be seen in the
following order:
1. Network cost
2. Storage cost
3. Computation cost
This might not be specific to the SNI environment since in the past years cost
reductions generally took place in reverse order. From the users perspective,
the main goals for each service are:
1. Quality of Service (QOS: Availability, Responsiveness)
2. Ease of use (For client as for system administration)
3. Actuality (This has been taken out of QOS)
As you can see there's an inherent conflict between the top points in above
listings: QOS is limited by network quality which is almost proportional to
cost. Reduced to indexing and retrieval the main questions are:
Q1: Index distribution cost versus query transfer cost
Q2: Local computation versus WAN access
4 Review of Common Methods for WWW Server Index Generation
Besides robot based gathering from information sources (like Lycos, WWWW, RBSE
Spider and WebCrawler) the most frequently used system with server side support
is ALIWEB[1]. A new and more general approach is followed by the HARVEST[2]
system. In short HARVEST trades network load for storage cost - a reasonable
choice regarding the given priorities. The DIENST[3] Protocol focuses on
distribution of academic papers. It relies upon bibliographic descriptions in
RFC-1357 format and distributes queries to multiple locally maintained
databases. None of these systems support full text indexing (Harvest could, but
would loose its advantages; DIENST actually does, but relies on WAIS[4] for the
implementation), therefore WAIS must be considered. Finally there are
specialised full text indexers that support HTML: FFW[5] and GlimpseHTTP[6].
4.1 Description Based Indexers
ALIWEB
Aliweb is based on description files that are collected at regular intervals and
then combined into a searchable index. It is the information provider's
responsibility to compile and update this description file which follows a
common standard. This can be done manually or using some information extraction
tool. Aliweb currently depends on one master server for gathering data. The
index then is mirrored.
HARVEST
Harvest essentially can be broken up into two main components called "Gatherer"
and "Broker". The Gatherer extracts object descriptions from files of known type
(besides HTML this includes binary, some graphics formats, etc.) and exports
them via its TCP port. A proprietary, structured format called SOIF (Summary
Object Interchange Format) has been defined to exchange these descriptions.
The Broker connects with one ore more Gatherers (or other Brokers) to collect
this data (optionally compressed with gzip) to build an index. It then accepts
query requests by listening to an own TCP port. Structured queries using Boolean
expressions and fielded search are provided.
DIENST
DIENST (which stands for Distributed Interactive Extensible Network Server for
Techreports) provides an HTTP based protocol for structured search in
distributed databases and object oriented document retrieval. It creates an
index from bibliographic description files in RFC-1357 format that can be
searched on each server. Similar to WAIS there is a master index of servers that
can be used to forward queries to the appropriate sites. Documents can then be
retrieved in a variety of formats. Recently also fulltext search is supported
using the SMART search engine (which provides a look and feel that is similar to
WAIS) or WAIS itself.
4.2 Full Text Indexers
WAIS
The WAIS (Wide Area Information Servers) system allows for full text search in a
variety of databases, distributed on the network. A single directory of servers
lists available WAIS indexes. Users can select appropriate servers and pose a
query to them. Found items will be presented with a relevance rating depending
on the number of occurrences of the keywords from the query in the document. A
major drawback of WAIS concerning data distribution is that for a query to be
answered not only the index but also the underlying database must be accessed.
Therefore data and index information are kept in the same location. Also WAIS
indexes are around the size of indexed data or even larger to provide fast
search. There are different implementations of WAIS available, some of them
supporting Boolean expressions and date search.
FFW
FFW (Freetext Search for the Web) is a fulltext indexing system that focuses on
HTML documents. One of the advantages over WAIS in its application is that it
generates a "self-contained" index: Only the index data is needed to answer a
query. It provides a means for merging large indexes from existing smaller ones
and to distribute queries amongst indexes which are scaled around 30% of dataset
size. Only simple queries (but including expression grammar, word truncation and
date search) are supported.
GlimpseHTTP
Glimpse is usually the underlying indexing mechanism for Harvest object
descriptions in a Broker. It can also be used standalone with some small
extensions to provide fulltext search on HTML documents. As with WAIS, Glimpse
needs to access the indexed data to satisfy a query, but allows index size to be
reduced to around 7% of data size by trading access speed. Glimpse supports a
wider set of queries, including spelling errors and regular expression match.
5 Evaluation
The presented description based indexers are very different in scope and
implementation. While Aliweb has the appearance of being an ad hoc solution to
the resource location problem, the others are more designed and allow for
hierarchical index arrangement (Harvest) or query distribution (DIENST), easy
expansion (both) and abstraction from files (DIENST).
Concerning full text retrieval, WAIS is the most common and general system. FFW
and GlimpseHTTP are more lightweight solutions which focus on WWW servers. They
both have their individual advantages (e.g. FFW dealing with the full ECMA
Latin-1 character set and providing a self-contained index; GlimpseHTTP being
very unpretending about disk space). They lack a mechanism to build larger
indexes from existing ones as WAIS (virtually) does.
--------------------------------------------------------------------------------
explicit data indexed objects phys. location levels of query
description of index hierarchy execution
needed
================================================================================
Aliweb yes sites, files, centralised 1 on master
services or mirror
--------------------------------------------------------------------------------
Harvest no (generated files arbitrary n on master
by essence) or replica
--------------------------------------------------------------------------------
DIENST yes documents distributed 2 distributed
on sites
--------------------------------------------------------------------------------
WAIS no files distributed 2 distributed
on data sites
--------------------------------------------------------------------------------
FFW no HTML files arbitrary 1 on index loc.
--------------------------------------------------------------------------------
Glimpse no HTML files on data site 1 local
HTTP
--------------------------------------------------------------------------------
The table shows some key characteristics of the six indexing tools discussed in
this paper.
Since the scope is HTTP retrieval within one enterprise, Aliweb must be dropped.
WAIS would be a preferable choice if it was already used as a retrieval system
within the company. Since it isn't, the other, more WWW oriented packages can be
considered. DIENST contains some very good ideas about handling different data
formats of the same document, but currently is limited to documents with
bibliographic descriptions available. It's a very promising approach for on-line
library services and presents a friendly user interface.
For the final decision, a look on the priority list might help: As we have said,
network load is the major concern. It is difficult to estimate that parameter
for the different solutions. Statements like "can reduce [...] network traffic
by a factor of 59" (from [2]) treated with care. A system that is flexible
enough to allow a decision on distribution policy while being used would be
preferable. This makes Harvest the most promising solution. After a gatherer is
running locally for every resource, brokers can be set up at any location.
For full text indexes, Harvest produces too much overhead: In its present
implementation a (compressed) SOIF object containing the full document must be
stored by the gatherer, be transferred to a broker and get indexed. The index
then could be replicated. The same functionality could be achieved with FFW,
using standard mirroring for the index. GlimpseHTTP does not meet this
requirement as its index is not self-contained. Only Harvest and FFW make it
possible to give a "flexible response" to Questions Q1 and Q2 for the desired
application.
6 Application
For indexing the internal and external WWW server contents Harvest is used. With
Version 1.0 several major bugs showed up, that disappeared when upgrading to
V1.1 these days. Administration of the Harvest system, beeing very complicated
wit V1.0, is also more straightforward in the new version. For full text index
generation on the product database FFW will be used.
7 The Future
For further development, it would be desirable to integrate index generation
with revision control. At present, revision control is the lacking element in
providing WWW documents. It should be integrated within the server - as proposed
by the HTTP protocol - as a handler for the PUT and DELETE methods. When this
step has been taken, forming a database from an ugly heap of files, some of the
former indexing problems will have disappeared.
When the versioning system detects a change, it could initiate an incremental
index update, thus removing the need to process the whole database at regular
intervals. The changes in the index then can be propagated to registered sites
as delta information, minimising network load. This is comparable to the
transition from procedural to event driven programming.
Instead of running batch jobs in the night (when is "night" in a world-wide web
anyway?) to update an index, recently developed on-line index construction
algorithms must be used[7].
This form of version tracking and change propagation will also help to solve the
notification problem as described in [8].
References
[1] Koster, Martijn: Welcome to ALIWEB. On-line document
(http://web.nexor.co.uk/public/aliweb/aliweb.html)Nexor, UK, February
1995
[2] Hardy, Darren R. and Michael F. Schwartz: Harvest User's Manual, Version
1.1. Technical Report CU-CS-743-94, University of Colorado at Boulder,
February 1995.
[3] Davis, James R. and Carl Lagoze: A protocol and server for a distributed
digital technical report library. (http://cs- tr.cs.cornell.edu/Server/
TR/CORNELLCS:TR94-1418) Cornell University, April 1994.
[4] Marshall, Peter: WAIS: The Wide Area Information Server or Anonymous
What???. (ftp://ftp.wais.com/pub/wais-doc/UWO-wais-paper.ps) University
of Western Ontario, June 1992.
[5] Hfjeld, Brd: FFW - Freetext search for the Web. On-line document
(http://www.nta.no/produkter/ffw/ffw.html), Telenor Research, Norway,
February 1995.
[6] Klark, Paul: GLIMPSE, A tool to search entire file systems. On-line
document (http://glimpse.cs.arizona.edu:1994/) University of Arizona,
February 1995.
[7] Srinivasan, V and Michael J. Carey: On-line Index Construction
Algorithms. (http://www.cs.wisc.edu/TR/UWMADISONCS:CS-TR-91-1008)
University of Wisconsin, February 1991.
[8] Notification of new material. On-line document (http://info.cern.ch/
hypertext/WWW/DesignIssues/Notification.html) CERN, October 1993.
----------------------------------