Third 
International World-Wide Web Conference

Workshop A: Web-wide Indexing/Semantic Header or Cover Page


Chair: Bipin C. Desai, Brian Pinkerton

Kevin Hughes <kevinh@eit.COM> Abstract ~~~~~~~~ Enterprise Integration Technologies (EIT) deals with a large number of Internet-savvy clients who desire a fast, simple indexing technology that can take advantage of the data they're provided on their World-Wide Web sites. Because of this need, I have developed SWISH - the Simple Web Indexing System for Humans. SWISH is a program that is both a indexer and a searcher. It builds inverted indices of keywords and stored them in single index files to which HTML gateways can be made. Because SWISH can recognize HTML tags and entities, we can allow it to search internationalized text or search for specific words within particular tags. SWISH is not meant to be fully-featured - its main strength is that it is extremely easy to use and configure. Already EIT has saved money by giving SWISH to clients rather than having to configure, support, and license WAIS to satisfy the same requirements. Many Web sites do not require a complex, industrial-strength indexing solution - they do require something that is easy to use and HTML-aware, and by keeping with this philosophy, I hope SWISH can fulfill this need. Outline ~~~~~~~ What is SWISH? SWISH is a generic indexing and search engine that has as its core philosophy ease of use and HTML awareness. For more information, please see: http://www.eit.com/software/swish/ http://www.eit.com/cgi-bin/wwwwais SWISH is also being used at: http://www.xerox.com/ http://www.city.net/ The Personal Indexer SWISH is in a class of tools that I call "personal indexers" - these are utilities that allow one to find information that they're looking for at their own site. Glimpse, FFW, and htgrep could be considered to be in this class. What makes all these tools "personal" is mainly the fact that one doesn't necessarily have to be a computer genius to set them up. So ease of use is a definite factor. "Personality" also comes from the ability to learn from one's preferences and the data itself. If indexing programs work more like signal processors and less like word-grepping beasts, it's possible to make indexing programs both language and topic independent. Look at Architext, for instance. One nice thing about WAIS is that it can narrow one's searches based on feedback. But Architext is proprietary and WAIS is overkill. We need an open, simple solution that doesn't exist yet. In making such an indexer, one should realize that it would be used to index and search many types of Arabic languages. Thanks to a good deal of international feedback I've been able to make SWISH less language independent - you can define what characters make up words, what certain characteristics of a word are, etc. This feature, it turns out, ends up culling a lot of "garbage" information from index files, shaving about 20% off the index file size (a very rough estimate) or more. This simple filter even seems to work well in extracting "real" words from binary files. More and more people want to index SGML-like (particularly HTML) structured data. Witness the number of people on comp.infosystems.wais complaining that they can't figure out how to index their Web site. One of the greatest promises of HTML was the idea that one would be able to search and find more easily using structured markup rather than plain old text. So where are all the tools to do this? I believe that the core code is so small, you could include an index/search program with every server, much in the same way that you find an imagemap program everywhere. After all, most Web sites are comprised of about half graphics and other media and only half text. And many Web sites are not large enough to require a full-strength indexer. Such a well-distributed program would certainly need to communicate with other similar programs, so meta-indexers (like Harvest or GLOSS) and other users could cull them for information. By putting it on the server side (or on a proxy server) it could be contacted via HTTP. I very much intend to add any functionality to SWISH that is needed to make it communicative. I look forward to hearing ideas from the rest of the folks at the workshop about how we can all share a common language! -- Kevin Hughes -- Kevin Hughes * kevinh@eit.com Enterprise Integration Technologies Webmaster (http://www.eit.com/) Hypermedia Industrial Designer * Duty now for the future!