Workshop A: Web-wide Indexing/Semantic Header or Cover Page

Chair: Bipin C. Desai, Brian Pinkerton

Michael L. Mauldin 1 March 95 Carnegie Mellon University Pittsburgh, PA 15213-3890 fuzzy@cmu.edu http://fuzine.mt.cs.cmu.edu/mlm/ The Lycos (tm) Catalog of the Internet is a collection of rich abstracts of texts available on the World Wide Web. These abstracts are automatically generated by the Lycos robot, which continually scans the Web looking for new documents and checking older documents for changes. As each document is read, a full abstract is produced. Further, any document referenced by a hyper-link is given at least a short description. Although the collection process is primarily automatic, we do provide a URL registration service to allow users and authors to include documents in the catalog; further, we provide mechanisms for authors to remove their documents from the catalog or keep them out in the first place. Carnegie Mellon University also provides a query retrieval service against this catalog, serving over a half a million retrievals to over 140,000 users a week. We are committed to providing ubiquitous global access to this catalog retrieval service, and are working on licensing arrangements to ensure its continued availability. Carnegie Mellon is also committed to scaling up the collection process to the entire World Wide Web and maintaining this collection current to within a month. We currently have downloaded 10% of the estimated web within the previous 3 months, and have identified two thirds of the documents by URL. We are scaling up incrementally, continuing to provide a world wide retrieval service against the entire catalog. Since November, we have added new computers to this effort at the rate of one every two weeks. We believe common or compatible data and index formats as critical issues for this workshop. A minimum level of interoperability would cover the data format for document and abstract representation. More difficult is to define a common format for indexes (inverted files), since these are often specialized and in some cases proprietary. For example, Lycos/Pursuit uses word position information not stored in WAIS indexes. Some WWW search engines match regular expressions against the whole document collection and do not need inverted files.