Workshop A: Web-wide Indexing/Semantic Header or Cover Page
Chair: Bipin C. Desai, Brian Pinkerton
Kevin Hughes
Abstract
~~~~~~~~
Enterprise Integration Technologies (EIT) deals with a large
number of Internet-savvy clients who desire a fast, simple indexing
technology that can take advantage of the data they're provided on
their World-Wide Web sites. Because of this need, I have developed
SWISH - the Simple Web Indexing System for Humans.
SWISH is a program that is both a indexer and a searcher.
It builds inverted indices of keywords and stored them in single
index files to which HTML gateways can be made. Because SWISH can
recognize HTML tags and entities, we can allow it to search
internationalized text or search for specific words within particular
tags.
SWISH is not meant to be fully-featured - its main strength
is that it is extremely easy to use and configure. Already EIT has
saved money by giving SWISH to clients rather than having to configure,
support, and license WAIS to satisfy the same requirements. Many Web
sites do not require a complex, industrial-strength indexing solution -
they do require something that is easy to use and HTML-aware, and
by keeping with this philosophy, I hope SWISH can fulfill this need.
Outline
~~~~~~~
What is SWISH?
SWISH is a generic indexing and search engine that has as its
core philosophy ease of use and HTML awareness. For more information,
please see:
http://www.eit.com/software/swish/
http://www.eit.com/cgi-bin/wwwwais
SWISH is also being used at:
http://www.xerox.com/
http://www.city.net/
The Personal Indexer
SWISH is in a class of tools that I call "personal indexers" -
these are utilities that allow one to find information that they're
looking for at their own site. Glimpse, FFW, and htgrep could be
considered to be in this class. What makes all these tools "personal"
is mainly the fact that one doesn't necessarily have to be a computer
genius to set them up. So ease of use is a definite factor.
"Personality" also comes from the ability to learn from one's
preferences and the data itself. If indexing programs work more like
signal processors and less like word-grepping beasts, it's possible
to make indexing programs both language and topic independent. Look
at Architext, for instance. One nice thing about WAIS is that it can
narrow one's searches based on feedback. But Architext is proprietary
and WAIS is overkill. We need an open, simple solution that doesn't
exist yet.
In making such an indexer, one should realize that it
would be used to index and search many types of Arabic languages.
Thanks to a good deal of international feedback I've been able to
make SWISH less language independent - you can define what characters
make up words, what certain characteristics of a word are, etc. This
feature, it turns out, ends up culling a lot of "garbage" information
from index files, shaving about 20% off the index file size (a very
rough estimate) or more. This simple filter even seems to work well in
extracting "real" words from binary files.
More and more people want to index SGML-like (particularly HTML)
structured data. Witness the number of people on comp.infosystems.wais
complaining that they can't figure out how to index their Web site.
One of the greatest promises of HTML was the idea that one would be
able to search and find more easily using structured markup rather
than plain old text. So where are all the tools to do this? I believe
that the core code is so small, you could include an index/search
program with every server, much in the same way that you find an
imagemap program everywhere. After all, most Web sites are comprised
of about half graphics and other media and only half text. And many
Web sites are not large enough to require a full-strength indexer.
Such a well-distributed program would certainly need to
communicate with other similar programs, so meta-indexers (like
Harvest or GLOSS) and other users could cull them for information.
By putting it on the server side (or on a proxy server) it could
be contacted via HTTP. I very much intend to add any functionality
to SWISH that is needed to make it communicative.
I look forward to hearing ideas from the rest of the folks
at the workshop about how we can all share a common language!
-- Kevin Hughes
--
Kevin Hughes * kevinh@eit.com
Enterprise Integration Technologies Webmaster (http://www.eit.com/)
Hypermedia Industrial Designer * Duty now for the future!