Workshop A: Web-wide Indexing/Semantic Header or Cover Page
Chair: Bipin C. Desai, Brian Pinkerton
Michael L. Mauldin 1 March 95
Carnegie Mellon University
Pittsburgh, PA 15213-3890
fuzzy@cmu.edu
http://fuzine.mt.cs.cmu.edu/mlm/
The Lycos (tm) Catalog of the Internet is a collection of rich
abstracts of texts available on the World Wide Web. These
abstracts are automatically generated by the Lycos robot, which
continually scans the Web looking for new documents and checking
older documents for changes. As each document is read, a full
abstract is produced. Further, any document referenced by a
hyper-link is given at least a short description.
Although the collection process is primarily automatic, we do provide a
URL registration service to allow users and authors to include documents
in the catalog; further, we provide mechanisms for authors to
remove their documents from the catalog or keep them out in the
first place.
Carnegie Mellon University also provides a query retrieval service against
this catalog, serving over a half a million retrievals to over 140,000 users
a week. We are committed to providing ubiquitous global access to this
catalog retrieval service, and are working on licensing arrangements to
ensure its continued availability.
Carnegie Mellon is also committed to scaling up the collection process
to the entire World Wide Web and maintaining this collection current
to within a month. We currently have downloaded 10% of the estimated
web within the previous 3 months, and have identified two thirds
of the documents by URL. We are scaling up incrementally, continuing
to provide a world wide retrieval service against the entire catalog.
Since November, we have added new computers to this effort at the
rate of one every two weeks.
We believe common or compatible data and index formats as critical issues
for this workshop. A minimum level of interoperability would cover the
data format for document and abstract representation. More difficult
is to define a common format for indexes (inverted files), since these
are often specialized and in some cases proprietary. For example,
Lycos/Pursuit uses word position information not stored in WAIS indexes.
Some WWW search engines match regular expressions against the whole
document collection and do not need inverted files.