A System for Seamless Search of Distributed Information Sources

Bipin C. Desai Department of Computer Science Concordia University 1455 De Maisonneuve Blvd. West Montreal, Quebec Canada H3G 1M8 Email: bcdesai@alcor.concordia.ca Note: This is still under construction. June 15, 1994

Summary

This article discusses the issues in developing a system that will provide users with desktop access to the world's digital information resources. It provides a more focused approach to searching, retrieving and perusing hypermedia documents. The documents are stored in heterogeneous distributed information systems representing virtual libraries. The system should allow users to search and obtain information from systems exhibiting a range of categorizing and organizing conflicts. Thus users, at a given workstation, perceive their local information system as having been augmented by further data rather than having to deal with a number of unfamiliar systems. For example, a user at a workstation could locate and peruse any document stored electronically anywhere across a wide network of virtual libraries.

Once a document is retrieved, links in the document to associated documents, directly or via referred passages is possible. The relevant referred passage can then be displayed and, if required, its source document as well. In addition a graph is displayed. Its purpose is to provide a visual illustration of the search history, thus enabling the user to move easily from one document to any other document.

1. Introduction

The digital network[Lynch and Preston 90], connecting information resources and computing facilities, is becoming the accepted method of sharing information in most research institutes, universities, and business organizations. These networks, in turn, have been interconnected allowing information to be exchanged across networks using the common interchange protocol (IP) for communications. The number of interconnected networks and their functionality continues to grow. This amalgamation is now referred to as the Internet.

Within Canada, CA*net links most of the colleges and universities. With the launching of SchoolNet in October 1993, many primary and secondary schools in Canada have access to the worldwide Internet. With the emergence of powerful workstations connected to these networks, it is possible to support the search and retrieval of information stored on any component of the interconnection. Today, the network connects information sources that are a mixture of publicly available information (with or without charge) and private information shared by collaborators. Because of the volume of multimedia information available and its rapid growth, it is essential that cutting edge technology work toward the development and implementation of systems that facilitate accurate and quick retrieval of information.

2. State of the Art

A number of facilities already exist on the Internet for easing user access to the vast amounts of information.

Libraries across the world continue to develop their electronic catalogues, which are becoming increasingly accessible through the Internet. Systems such as Hytelnet[Hytelnet] facilitate access to remote libraries using a menu-oriented interface. They provide a connection to the remote host: however, once logged on, the users must navigate the remote system using its commands. In addition, the users are obliged to move from one system to another if the search for an item in one remote system fails.

Of all the systems such as Archie, Gopher, that have been designed to facilitate information retrieval on the Internet, Wide Area Information Server(WAIS) distinguishes itself primarily in two areas. Firstly, through its use of the Z39.50 protocol, WAIS automatically translates the user's query into the command language of the database(s) to be searched. Secondly, WAIS completely eliminates the necessity of using the ftp protocol, as it automatically downloads the document from the remote site for the user. The strength of WAIS appears to be primarily in its document retrieval capacities.

The problem with Archie is that a user must be familiar with the name of the file; this is not always the case. Gopher systems contain much information that is of local interest, and it is tedious to go from one system to another. System such as Veronica facilitates this aspect of search in the Gopherspace. The main deficiency of WAIS is its inability to target precise search results. While it is deemed an advantage that it performs a full text search of the document as opposed to only the title, the onus of locating the correct terms to represent the query is left entirely to the users. Once the users have entered their selected terms and the search has been processed, they are presented with a list of results ranked through a relevancy scoring technique based on the number of occurrences of the terms specified in the search request. Because the numbers are relevant only within the request, even the most highly ranked match can result in poor retrieval. Such deficiencies can be eliminated through the use of a predefined vocabulary and a semantic header page containing keywords and key concepts. With the use of a predefined vocabulary, user supplied terms could be translated into these predefined terms, and users are no longer required to anticipate the terms used in the document.

Although hypermedia systems have many attractive features - as showcased in World Wide Web (WWW) and the many graphical browsers such as Mosaic, Chimera, and Cello - their overuse of links has a drawback. A link in a document being perused leads the users from one document to another, with in turn, more links to tertiary documents. The users eventually get lost and feel trapped in the web.

The obvious question one has to ask is this: what do the typical users want to know while reading a document, and how often do the users want to wait for a related document to be downloaded from a remote site? Such downloading of referred hypermedia documents has an obvious delay, which not only breaks the chain of thought, but is an annoyance when the document is not available. Take, for example, the announcement for the first WWW conference. Normally one receives a few pages of conference details via the post or electronic mail. The current hypermedia format of the WWW announcement entails a number of connections to the remote site to retrieve the same details. Such a scheme would be more beneficial when the pertinent abstracts, and, through them, the entire conference papers are linked in to a single document, which contains the announcement for the conference. In order to improve the response, caching and pre-fetching may be an option with longer documents.

The situation is reminiscent of the global enterprise, which as a result of piecemeal development, must manage a number of independently designed private and departmental information sub-systems. In such a distributed environment, the information required by users may be available in a remote computer system with which they are unfamiliar, or of which they are unaware. Many business organizations have discovered that cooperation among these existing distributed information systems is crucial to meet the challenge of the world information market. Even in this limited domain of integration, the lack of appropriate tools, techniques and understanding of the concepts in the integration creates multiple problems. These problems will be further amplified in any world-wide virtual digital library system. In such a system, different organizations will use different systems to index, catalogue, and store their information. Already, systems such as Archie, Gopher, Hytelnet, LIBS, WAIS, and WWW (World Wide Web) maintain information resources indices and an elementary query system to search various databases [Obraczka, Danzig and Li 93]. These have been developed independently and require appropriate interfaces to interconnect them.

The rationale for a future project should be to provide easy, seamless access to information distributed geographically over nodes interconnected by wide area networks. The information to be handled will be not only the cataloging information, but also the documents themselves in hypermedia form. The system will be responsible to navigate to the most convenient locations to access the required information. Here, we assume that the distributed system will eventually have replicates of documents to improve availability as is done in many current, distributed database systems.

3. A Proposal

We believe that global objective should be to provide an improved user interface that will provide an enhanced environment for information search and retrieval.

The specific objectives we perceive are:

--development of an expert system that will model a reference librarian's knowledge and behaviour;

--creation of an enhanced graphical user interface that will, not only facilitate document search and subsequent retrieval, but will also allow a user to move between a document and the referred section of another document, and then, if required, to the actual referred document.

4. Description of Proposed Components

As should be obvious from above, the proposed prototype (CUILT system) contains two major components: (1) an expert system and (2) a graphical user interface. Each of these components is described below.

4.1 Expert System-Based Search

When a user comes to a library, she is usually looking for information on a specific subject area. She may not have in mind any specific document, or she may know something about it: abstract, dissertation, place of publication, etc.

When the user presents a query to a reference librarian, the librarian uses his expertise to locate the document. The librarian has several resources at his disposal, such as a library catalogue, indexes and abstracts, serial and union lists, etc. With these, the librarian determines not only whether the document exists, but also its location. With electronic libraries becoming more likely in the future, it is desirable that this process be automated.

The librarian uses his knowledge to access the correct resource. The knowledge is developed from what he learned while studying library science during his training, and what he developed himself from experience over the years. The knowledge obtained during training often exists explicitly in books and other printed material. This knowledge needs to be stored in a knowledge base accessible by the search engine. The knowledge that the librarian developed with experience is often informal; it is based on heuristics, or thumb rules. These heuristics, too, need to be formalized so that the librarian's expertise, too, can be stored on-line in the knowledge base.

Artificial intelligence, or more specifically expert systems[Shinghal 92], have techniques for formalizing heuristics usually as frames, semantic nets, and production rules. Once the heuristic knowledge has been stored on-line, software in the form of an inference engine is employed to use that knowledge. In essence, the knowledge and the inference engine replicate the mental workings of the librarian. The knowledge can be continually refined such that the expertise of the system grows over time. The system involves distributed computation and cooperation of expert systems at participating nodes.

The system will work in coordination with the protocols such as WAIS, WWW, Gopher, etc. A user should be able to use it to locate what she is looking for in a seamless manner.

4.2 Graphical User Interface (GUI)

Our proposal for the GUI is an enhancement to the increasingly popular Mosaic system allowing the interactivity of the expert search engine with both local and remote databases of information resources. This enhances Mosaic by providing several user-friendly features. Not only will the users be able to link to relevant documents, but they can also display the relevant sections from referred documents. These hidden sections will be displayed in an independent window and manipulated independently.

The system uses modifications to HTML[World Wide Web] that allow embedding the relevant section of the referred documents, suitably acknowledged and annotated, in the source document but hidden until asked to be displayed. The author of the document is responsible for embedding such passages. The display of such hidden passages, in a graphical environment could be done in an independent window and manipulated independently. The entire referred document becomes accessible only from the second window thus avoiding the creation of a clone window as in Mosaic.

In addition, the system will have an intelligent caching mechanism to store locally the documents used often. An area of the screen shows, graphically, the user the set of documents visited during a session and a random selection of any of these documents is permitted. This prevents the user from becoming trapped in a web and becoming disoriented.

5. Plan and Methodology

An open heterogeneous distributed database management system (HDDBMS) [Sheth and Larson 90] is characterized by its ability to homogenize and integrate diverse local DBMSs. The homogenization consists of providing appropriate transformation between global and local data models and insulating the end users from different operating systems and machines. The integration is concerned about resolving conflicts among different local schemata. Such conflicts arise when, for example, the same entities are described by differing sets of attributes, or when the same attribute is stored using different units of measure or granularity.

As in the design of a HDDBMS, the intent of our proposed prototype is to provide the user with a system that represents an extension of his or her local library, albeit in a digital form and accessed via a workstation. However, the indices and the information are, in fact, distributed among multiple participating digital libraries. These libraries may be resident on separate nodes of the interconnected network, each having its own local cataloguing system. The user should be able to submit queries to all the libraries as though they were a part of her local library. The following functions must be allocated to such a system:

--convert a user's query to a plan of one or more queries on a number of distant libraries (as would be done by an expert human librarian) and manage the necessary filtering and data transfers.

--manage the execution of these operations by using agents and intelligent gatekeepers at remote sites.

--combine the results obtained from various systems, into a uniform format conforming to the user's expectation.

--maintain composite indices[Desai 89, 90] to improve subsequent similar queries.

It is preferable to implement an open library system that allows the integration of data from multiple information sources. Such an approach avoids the upheaval that may be caused by cataloguing, software, or hard- ware changes. This approach also caters to the usual need to continue with the local library usage in the usual mode.

Initially, we are restricting our prototype to the area of Computer Science. Once built, the prototype can easily be extended to other subject areas. In developing the prototype, there is a need to carry out the following:

--develop a comprehensive thesaurus that reflects the terminology employed in the field of computing science. A cover page to be employed by the authors when inputting their documents will be created.

--modify HTML to allow inclusion of a semantic header page with each

document. This will contain controlled key-terms and concepts as well as the usual cataloging entries used by librarians.

--extend and adapt the approach used in Multiple Database Access System (MDAS)[Desai 92] for federated heterogeneous distributed object oriented-data and information systems.

--design the expert system employing the following steps (1) the acquisition of heuristic knowledge used by the reference librarian; (2) the development of suitable representation for this knowledge; (3) the verification of the knowledge [Preece and Shinghal 1992]; (4) the incorporation of knowledge into the system; and (5) finally its testing and validation

6. Conclusions

The use of WWW is growing rapidly, and it is becoming a standard tool for the research and the business community to exchange up-to-date information. Browsers such as Mosaic are a convenient medium for such exchange. To facilitate searching, we have proposed that authors of documents include a semantic header page. Moreover, expert systems acting in conjunction with the browsers will help focus the search to the most relevant documents and nodes. A graphical representation of the document links will keep the users aware of where they are in their browsing.

REFERENCES

Desai, Bipin C. An Introduction to Database Systems, West Publishing, St. Paul, MN, 1990 Desai, Bipin C., Pollack, R. "MDAS: A Heterogeneous Distributed Database Management System", Information and Software Technology, 34-1, pp. 28-41, January 1992. Desai, Bipin C., "Performance of a Composite Attribute and Join Index", IEEE Trans. On Software Engineering, Vol. 15-2, February 1989, pp. 142-152. Lynch, C. A, Preston, C. M., "Internet Access to Information Resources", Annual Review of Information Science and Technology, 1990. Obraczka, K, Danzig, P. B., Li, Shih-Hao, "Internet Resource Discovery Services", IEEE Computer, September 1993, pp.8-22. Preece, A. D., Shinghal, R., "Foundation and Application of Knowledge-Base Verification", International Journal of Intelligent Systems, (in press) Shinghal, R., "Formal Concept in Artificial Intelligence", Chapman & Hall, London, U.K., 1992 Sheth, A. P., Larson, J. A., "Federated Database System for Distributed, Heterogeneous, and Autonomous Databases" ACM Computing Surveys, Vol 22-3, September 1990, pp237-266. Hytelnet: Available via anonymous ftp from ftp.usask.ca (128.233.3.11) in the /pub/hytelnet directory. Mosaic http:/www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/help-about.html World Wide Web and other misc. stuff http://www.cs.concordia.ca/Web-refs.html