Third 
International World-Wide Web Conference

Workshop A: Web-wide Indexing/Semantic Header or Cover Page


Chair: Bipin C. Desai, Brian Pinkerton

Berkeley Search Engine (BSE)

Berkeley Search Engine (BSE)

Paul Gauthier
gauthier@cs.berkeley.edu
http://http.cs.berkeley.edu/~gauthier/

Introduction

A number of Internet/WWW search facilities are available at the current time, all suffering under tremendous load. Response times and failure rates are being driven up by the popularity of and demand for such services. The Berkeley Search Engine is an attempt to cope with these problems through implementation of a parallel, fault-tolerant, and scalable server. This new service will run on a network of workstations (NOW) utilizing the CPUs, memory, and disks of a large collection of workstations. The ultimate goal of this research is to identify and build the tools and framework needed to construct generally useful parallel servers. Fault tolerance and incremental scalability are primary goals.

Details

An Internet search engine is a particularly good application to explore needs for parallel server development. Existing searching services offer evidence that there is a very high demand, one that a parallel server could potentially satisfy. It is important that NOW servers be capable of operating with existing protocols and software through standard communication mechanisms. Building a server which conforms to the HTTP protocol will provide a diverse collection of potential clients from many platforms. The HTTP protocol in particular is fairly well suited as an initial test project for NOW server development. The protocol is simple, stateless and allows for some interesting options for dynamic load balancing.

The architecture of the BSE server will be a collection of workstations joined by a high-speed ATM interconnect. The server as a whole must have a single external contact point (so that a URL may be published for a single machine which doesn't change over time). This special machine's purpose will be to redirect incoming query requests amongst the collection of workstations which will actually conduct searches. The front end machine will monitor the status of the collection of workstations and redirect queries to those with the lightest loads. It should have little trouble attaining very high connection throughput due to the simplicity of its task and the very short duration of connection.

The collection of workstations will maintain collective fault-tolerant data structures to aid in query processing, and will stripe the database across their disks. In response to varying traffic load the size of the workstation pool can be varied by releasing workstations or acquiring idle workstations from the NOW. By aggressive cooperative caching techniques and fault-tolerant distributed data structures a highly efficient database query system will be produced.

Implementation Status

At this point the software to build and maintain the database structure is complete, as is a highly optimized query kernel. A parallel implementation on a NOW is currently underway, using the Split-C language and non-fault-tolerant distributed data structures. Implementation of fault-tolerant distributed data structures is also underway, building directly on top of the Global Unix (GLUnix) layer of the NOW project.

Database Content

A successful Internet search service results from two parts: a fast search engine, and a rich database. As well the ability to perform efficient searches, one requires that the database be rich and have wide coverage of the information resources of the Internet. Our research is concerned with the former of these two needs, producing a searching solution scalable to high user loads and large databases.

Population of the database with documents from the WWW, FTP sites and other network sources has not yet begun. It is our hope that projects such as the Harvest system and other established search systems will begin to make their database content available for exchange. The task of crawling the web for content is best done by a small number of parties who can coordinate their activities and reduce impact on HTTP servers and network load.

By acquiring shared data, it would permit our efforts to be more closely focused on the task of building scalable servers. A division of effort between indexing and data collection researh would be beneficial to both groups.

About this document ...

Berkeley Search Engine (BSE)

This document was generated using the LaTeX2HTML translator Version 95.1 (Fri Jan 20 1995) Copyright © 1993, 1994, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -dir /home/orodruin/j/grad/gauthier/public_html -split 0 -address gauthier@cs.berkeley.edu BSE.tex.

The translation was initiated by Paul_A Gauthier on Thu Mar 23 16:07:11 PST 1995


gauthier@cs.berkeley.edu