Workshop A: Web-wide Indexing/Semantic Header or Cover Page
Co:Chairs: Bipin C. Desai, Brian Pinkerton
Anders Ardo
Traugott Koch
NORDIC WAIS/WORLD WIDE WEB PROJECT
IMPROVING RESOURCE DISCOVERY AND RETRIEVAL ON THE INTERNET
Anders Ardo and Traugott Koch
The main goal of the project is to contribute to the task of improving
searching capability of the existing networked information retrieval
(NIR) tools and to make a step in the direction towards unifying the
existing tools: WWW, WAIS, etc.
Objectives of the project
* Contribute to the development of gateway services between two
of the popular protocols, WAIS and WWW.
* Contribute to the understanding of the problems involved by
integrating a typical state-of-the-art library system in
the World Wide Web
* Contribute to the solution of the resource discovery problem by using
the WWW hypertext systems as a well organised frontend to the
information resources offered by hundreds of WAIS servers, and by
using WAIS to establish a searchable database of the WWW-based
information resources which can otherwise only be located through
extensive browsing.
As an answer to some of the weaknesses of the most important NIDR
(Networked Information Discovery and Retrieval) tools, the Nordic
WAIS/WWW Project started to explore the possibilities of improving
navigation and searching in the Net. The main approach has been to
combine and in this way further develop the strength of two of the
most important tools, WWW (World Wide Web) and WAIS (Wide Area
Information Server). The project has accomplished the following
results:
A model integration of a library system into WWW.
An experimental system for automatic detection and classification
of WAIS databases, featuring a WWW frontend.
An experimental system for automatic detection and indexing of
Nordic WWW pages.
An improved gateway between WWW and WAIS, supporting multi-database
searching and relevance feedback.
Several pilot services offering the possibility to try out these
project results.
This project was sponsored by NORDINFO and carried out by The National
Technological Library of Denmark and Lund University Library, UB2. The
complete final report is under publication as a NORDINFO Publication.
WAIS INDEX OF WWW RESOURCES
When the project started (summer 1993), it was not possible to search
information published on WWW-servers throughout the Internet. The
project has tried to build such a service by creating databases of WWW
resources. For WWW users, the usual hypertext browsing is supplemented
with a search option, based on WAIS indexes of WWW resources which can
be accessed through the improved WWW to WAIS gateway.
For practical purposes we decided to restrict the service to the
Nordic part of Internet and to index only html-pages. The WWW pages
are indexed into one WAIS database for each Nordic country.
What to index in hypertext
In order to build the WAIS database it must be decided what index
words should be associated with a specific WWW-page. Four alternatives
were considered:
1) Use the context or "anchor" of links pointing to the page.
(Might be just one word.)
2) Use the title and headings of the page.
(Not all pages have title and headings).
3) Use the entire page. (Might create a prohibitly huge database).
4) Use the filename of the page. (Might not be relevant/meaningful).
As can be seen from the comments there is no obvious choice. The best
solution will most probably be a combination of two or more of the
above alternatives or to offer a choice of different databases for the
different alternatives.
After some trials, alternative 2) was choosen for more extensive
experimentation since it was felt it offered the best compromise
between size, relevance of content and usefulness of the database.
INDEXING PROCESS
There are several ways to do the indexing, centralized as well
as decentralized solutions:
a) A program that recursively traverses the entire Web and extracts
the indexing information from all pages into a centralized database.
b) A program is provided to WWW-server maintainers to be run locally
which only indexes the local server. Only the relatively small local
indexing files are collected into a central WAIS databases.
c) The polling and building of WAIS databases can be distributed to a
number of places, each collecting information from their part of the
net. The resulting WAIS databases can be searched separately or
together, using either a WAIS client or our WWW to WAIS gateway which
has this capability.
Our indexing is done with a script which traverses the WWW web. The
script takes one html-page as start page and extracts index
information according to alternative 2 above as well as all URL's in
the page. The URL's retained for further processing are those using
the HTTP protocol and which leads to either files with a .html
extension or to directories. Html-pages corresponding to these URL's
are fetched while keeping track of which pages already seen. Hostnames
are resolved using the domain name services and IP-numbers in order to
avoid problems with aliased hostnames. Not resolving hostnames might
cause the same WWW-page to be fetched several times.
It is important to select a start page that lists a relevant selection
of the WWW servers that are to be indexed as it is only those
WWW-pages that are reachable from the start page that gets indexed. We
use either resource pages for the different countries (preferably
centraly maintained) or construct a relevant start page ourselves.
There are also provisions in the script for only indexing part of Web
based on the domain name of the server. One variable (allowed_hosts)
determines what part of the Web that is legal (e.g. setting
allowed_hosts to ".dk" would index all servers in Denmark). To allow
finer control over what is to be indexed a variable (disallowed_hosts)
can be used to restrict part of the allowed area from indexing.
Setting allowed_hosts to ".se" and disallowed_hosts to ".lu.se" would
index all servers in Sweden except those at Lund University (which is
the domain "lu.se").
This process is repeated recursively until no more unseen html-pages
can be found. Then all html-pages reachable from the original page
(given the restriction made by allowed_hosts and disallowed_hosts)
have been fetched and indexing information extracted and stored
locally. The indexing information is finally used to build a WAIS
database.
DISTRIBUTING THE INDEXING.
The indexing script above has been modified so it could be run as a
CGI-script and only index the local server it is running on. This will
allow for WWW-server maintainers to install this script on their
server. One service site could poll all participating WWW servers
regularly thus collecting all indexing information, which is then made
into one or more WAIS databases. All traversing of a server is done
locally and only the relatively small indexing file is transferred
over the net.
To further distribute the load also the polling and building of WAIS
database can be distributed to a number of places, e.g. each service
provider collects information from their part of the net. The division
of the Web could be based on server domain-names. This will result in
a number of WAIS databases which can be searched separately or
together using either a WAIS client or our WWW to WAIS gateway which
has this capability.
DISKUSSION
There exists around twenty different search engines with accompanying
services for parts of the WWW. One third of them are based on listings
of resources and the others are robots which automatically scan parts
of the net. A few institutions collect and provide a combination of
and entry point to many services developed by others. Each different
search service indexes a part (often only a very small part) World
Wide Web servers and documents worldwide. No one service seems to
cover a considerable and easily identifiable part of the WWW-based
information.
Each one uses its own selection of servers indexed, a different
method of collecting the information, different update periods
etc. Even the content of the resulting databases varies
considerably: some services index just machine names, directory
and file names (the URLs), others index the titles and headers
of the HTML-pages, or the words, sentences or paragraphs at the
origin of a link. Another alternative is to index the pages
which include a link to a certain page (citation) or to index
the full text of the HTML-pages. The best services offer the
choice of several different indexes regarding their content.
Even using all the search services together, the combined search
result will be far from exhaustive regarding all available WWW-based
information.
There is a obvious need for a comprehensive index of the WWW-based
information, an application of coherent and improved indexing,
collection and update methods and a reliable architecture for
distributed servers with stable performance for the service. This
will contribute to the task of improving the searching capability of
the existing networked information retrieval (NIR) tools.
REFERENCES:
Nordic WAIS/World Wide Web Project:
http://www.ub2.lu.se/W4.html
Pilot service, WAIS index of WWW resources in the Nordic countries:
http://www.ub2.lu.se/wwwindex.html