Workshop A: Web-wide Indexing/Semantic Header or Cover Page

Co:Chairs: Bipin C. Desai, Brian Pinkerton

Anders Ardo Traugott Koch NORDIC WAIS/WORLD WIDE WEB PROJECT IMPROVING RESOURCE DISCOVERY AND RETRIEVAL ON THE INTERNET Anders Ardo and Traugott Koch The main goal of the project is to contribute to the task of improving searching capability of the existing networked information retrieval (NIR) tools and to make a step in the direction towards unifying the existing tools: WWW, WAIS, etc. Objectives of the project * Contribute to the development of gateway services between two of the popular protocols, WAIS and WWW. * Contribute to the understanding of the problems involved by integrating a typical state-of-the-art library system in the World Wide Web * Contribute to the solution of the resource discovery problem by using the WWW hypertext systems as a well organised frontend to the information resources offered by hundreds of WAIS servers, and by using WAIS to establish a searchable database of the WWW-based information resources which can otherwise only be located through extensive browsing. As an answer to some of the weaknesses of the most important NIDR (Networked Information Discovery and Retrieval) tools, the Nordic WAIS/WWW Project started to explore the possibilities of improving navigation and searching in the Net. The main approach has been to combine and in this way further develop the strength of two of the most important tools, WWW (World Wide Web) and WAIS (Wide Area Information Server). The project has accomplished the following results: A model integration of a library system into WWW. An experimental system for automatic detection and classification of WAIS databases, featuring a WWW frontend. An experimental system for automatic detection and indexing of Nordic WWW pages. An improved gateway between WWW and WAIS, supporting multi-database searching and relevance feedback. Several pilot services offering the possibility to try out these project results. This project was sponsored by NORDINFO and carried out by The National Technological Library of Denmark and Lund University Library, UB2. The complete final report is under publication as a NORDINFO Publication. WAIS INDEX OF WWW RESOURCES When the project started (summer 1993), it was not possible to search information published on WWW-servers throughout the Internet. The project has tried to build such a service by creating databases of WWW resources. For WWW users, the usual hypertext browsing is supplemented with a search option, based on WAIS indexes of WWW resources which can be accessed through the improved WWW to WAIS gateway. For practical purposes we decided to restrict the service to the Nordic part of Internet and to index only html-pages. The WWW pages are indexed into one WAIS database for each Nordic country. What to index in hypertext In order to build the WAIS database it must be decided what index words should be associated with a specific WWW-page. Four alternatives were considered: 1) Use the context or "anchor" of links pointing to the page. (Might be just one word.) 2) Use the title and headings of the page. (Not all pages have title and headings). 3) Use the entire page. (Might create a prohibitly huge database). 4) Use the filename of the page. (Might not be relevant/meaningful). As can be seen from the comments there is no obvious choice. The best solution will most probably be a combination of two or more of the above alternatives or to offer a choice of different databases for the different alternatives. After some trials, alternative 2) was choosen for more extensive experimentation since it was felt it offered the best compromise between size, relevance of content and usefulness of the database. INDEXING PROCESS There are several ways to do the indexing, centralized as well as decentralized solutions: a) A program that recursively traverses the entire Web and extracts the indexing information from all pages into a centralized database. b) A program is provided to WWW-server maintainers to be run locally which only indexes the local server. Only the relatively small local indexing files are collected into a central WAIS databases. c) The polling and building of WAIS databases can be distributed to a number of places, each collecting information from their part of the net. The resulting WAIS databases can be searched separately or together, using either a WAIS client or our WWW to WAIS gateway which has this capability. Our indexing is done with a script which traverses the WWW web. The script takes one html-page as start page and extracts index information according to alternative 2 above as well as all URL's in the page. The URL's retained for further processing are those using the HTTP protocol and which leads to either files with a .html extension or to directories. Html-pages corresponding to these URL's are fetched while keeping track of which pages already seen. Hostnames are resolved using the domain name services and IP-numbers in order to avoid problems with aliased hostnames. Not resolving hostnames might cause the same WWW-page to be fetched several times. It is important to select a start page that lists a relevant selection of the WWW servers that are to be indexed as it is only those WWW-pages that are reachable from the start page that gets indexed. We use either resource pages for the different countries (preferably centraly maintained) or construct a relevant start page ourselves. There are also provisions in the script for only indexing part of Web based on the domain name of the server. One variable (allowed_hosts) determines what part of the Web that is legal (e.g. setting allowed_hosts to ".dk" would index all servers in Denmark). To allow finer control over what is to be indexed a variable (disallowed_hosts) can be used to restrict part of the allowed area from indexing. Setting allowed_hosts to ".se" and disallowed_hosts to ".lu.se" would index all servers in Sweden except those at Lund University (which is the domain "lu.se"). This process is repeated recursively until no more unseen html-pages can be found. Then all html-pages reachable from the original page (given the restriction made by allowed_hosts and disallowed_hosts) have been fetched and indexing information extracted and stored locally. The indexing information is finally used to build a WAIS database. DISTRIBUTING THE INDEXING. The indexing script above has been modified so it could be run as a CGI-script and only index the local server it is running on. This will allow for WWW-server maintainers to install this script on their server. One service site could poll all participating WWW servers regularly thus collecting all indexing information, which is then made into one or more WAIS databases. All traversing of a server is done locally and only the relatively small indexing file is transferred over the net. To further distribute the load also the polling and building of WAIS database can be distributed to a number of places, e.g. each service provider collects information from their part of the net. The division of the Web could be based on server domain-names. This will result in a number of WAIS databases which can be searched separately or together using either a WAIS client or our WWW to WAIS gateway which has this capability. DISKUSSION There exists around twenty different search engines with accompanying services for parts of the WWW. One third of them are based on listings of resources and the others are robots which automatically scan parts of the net. A few institutions collect and provide a combination of and entry point to many services developed by others. Each different search service indexes a part (often only a very small part) World Wide Web servers and documents worldwide. No one service seems to cover a considerable and easily identifiable part of the WWW-based information. Each one uses its own selection of servers indexed, a different method of collecting the information, different update periods etc. Even the content of the resulting databases varies considerably: some services index just machine names, directory and file names (the URLs), others index the titles and headers of the HTML-pages, or the words, sentences or paragraphs at the origin of a link. Another alternative is to index the pages which include a link to a certain page (citation) or to index the full text of the HTML-pages. The best services offer the choice of several different indexes regarding their content. Even using all the search services together, the combined search result will be far from exhaustive regarding all available WWW-based information. There is a obvious need for a comprehensive index of the WWW-based information, an application of coherent and improved indexing, collection and update methods and a reliable architecture for distributed servers with stable performance for the service. This will contribute to the task of improving the searching capability of the existing networked information retrieval (NIR) tools. REFERENCES: Nordic WAIS/World Wide Web Project: http://www.ub2.lu.se/W4.html Pilot service, WAIS index of WWW resources in the Nordic countries: http://www.ub2.lu.se/wwwindex.html