The Semantic Header and Indexing and Searching on the Internet[1]

(c) Bipin C. Desai

Department of Computer Science

Concordia University

7141 Sherbrooke St. W

Montreal, H4B 1R6

bcdesai@cs.concordia.ca

http://www.cs.concordia.ca/~bcdesai/

Keyword: Bibliographic record, Content description, Indexing, Index aid, Database system, Expert system, Searching, URC, URL

Abstract

This paper describes an indexing system called semantic header for "document like" Internet resources. The semantic header contains the meta-information for each "publicly" accessible resource on the Internet. It also describes the registering system and the distributed database representing the union catalog of resources on the Internet. This distributed database would be used in a search system to facilitate search.

Introduction

The trend in most research institutes, universities and business organization to interconnect their computing facilities using a digital network has become the accepted method of sharing resources. Such networks, in turn, are interconnected allowing information to be exchanged across networks using a common interchange protocol(viz TCP/IP). The number of such interconnected networks (Internet) continues to grow and with the emergence of powerful workstation-based servers connected to these networks, it is possible to support local as well as the remote search and retrieval of information stored on any component of the interconnection. At this time a number of information sources, both public(free) and private(available for a fee), are available on the Internet. They include text, computer programs, books, electronic journals, newspapers, organizational, local and national directories of various types, sound and voice recordings, images, video clips, scientific data, and private information services such as price lists and quotations, databases of products and services, and speciality newsletters.

There is a need for the development of a system which allows easy 'search for and access to' resources available on the Internet. It has been observed that distributed information systems, even though under control of a single administrative unit, create multiple problems typically caused by differences in semantics and representation, incomplete and incorrect data dictionaries (cataloging) [DESA4]. These problems would be magnified manyfold in any distributed information system which tries to integrate the resources offered by information systems over the Internet. It is important, also, to avoid problems encountered in a library system where, in spite of the fact that while the same cataloging system[2] is used, the same item may be differently catalogued or classified in two different libraries.

Such problems could be avoided by starting with a standard index structure and building a bibliographic system using standardized control definitions. Such definitions could be built into the knowledgebase of an expert system based index entry and search interfaces. Furthermore, there must be a mechanism to revise index information as the resource changes over time. Finally, annotation of a resource by independent users should be allowed.

The bibliographic entry system should be distributed and accessible to providers as well as users of the Internet. In a distributed system such as the Internet, it is natural to have the providers of resources, prepare and enter the bibliographic information about each resource using the standardized index scheme. The entry system should be a distributed system and the index should be recorded in a distributed database. Finally, a search system to help in locating and retrieving appropriate information with ease from this database is required.

Whereas the bibliographic entry and search systems (clients) could be located locally at the providers and users of information resources respectively, the bibliographic database system (server) should be distributed and replicated at a number of regional nodes for enhanced availability and response. The entry and search systems have to be supported by an easy-to-use graphical interface for entering the index information and access to it. These systems should incorporate the expertise and knowledge of expert cataloguers and reference librarians with a help system to guide the user at all steps. The search system, should in addition provide appropriate feedback indicating the number of hits for each search, and help in providing access to the relevant resources. The navigation of database and resource nodes and the protocols and filters used would be selected by the system, thus facilitating the task of the user. The purpose is to provide uniform access to all resources, as is done in the centralized information system through the intermediary of an expert system analyst.

Source of Information and Meta-Information

Information sources can be classified into three categories[KATZ]: primary, secondary and tertiary. Primary information is the original material in the form of published or posted articles, monographs, reports, dissertations, programs, images, movies, etc. Other primary sources such as personal communications are not usually available. Secondary sources, sometimes called meta-information, are used as indices to these primary sources of information and are created after a delay which may be a few months to a few years. The meta-information is data about the primary source. A tertiary source of information is a combination of selected and distilled information from primary and secondary sources.

The purpose of indices and bibliographies (secondary information) is to inventory the primary information and allow easy access to it. Preparing a bibliography requires finding the primary source, identifying it as to its subject, etc., describing it for later identification by unknown future users and classifying it according to accepted norms.

Since an index is to be used by many users, it has to be accurate, easy to use (usage via author, title, subject, etc.) properly classified, up-to date and complete for its area of coverage. In order for a bibliography to be useful, it must fill a real need. The success of Archie as a bibliography system (for files available on the Internet via FTP) is that it provides a simple interface to users who are aware of the name of a program, file or the general nature of the file likely to be distributed from one or more anonymous FTP sites. In the case of the on-line bibliography to the Internet resources such as the Web, the need is for the system to be current within a short period (minutes or at most hours) of the posting of a new resource. Compare this with the bibliography system for a printed publication which requires weeks or months in the case of the on-line databases, longer for the CD version and up to a year or more for the printed annual version. Even the on-line database needs a considerable amount of time before documents are indexed and incorporated[3].

The method of compiling a traditional bibliography varies. At one extreme, we have scholars spending years of their lives evaluating sources and compiling annotated and descriptive entries for each item. The accuracy of this bibliography is high but the coverage tends to be limited. At the other extreme, we have the semi-automatic mechanism which scans the published works from limited sources (by domain, language, or geographic regions) and assigns each work to appropriate sub-subject(s). Access from multiple headings may be provided. This is desirable because an item may deal with more than one topic. Whereas the bibliography prepared in the former method could be more accurate it tends, however, to be retrospective rather than current.

The dependence on titles as a search criterion dictates that they must be indicative of the contents of the document. This is not always the case, hence, someone (the author or the cataloger) has to add annotation, keywords or key phrases to indicate the actual content. Accuracy or quality of a document can be indicated by including reviewers' opinions. However, such opinions are rarely accessible to the cataloger. Another feature of importance to the user of an index, is the presence of an accurate abstract. An abstract provides a summary of the material and thus is more indicative of the contents than the title or keywords supplied by the author, bibliographer or selected from scanning the text. Reference librarians and library users tend to use such annotated bibliographies to help choose among competing sources.

Features such as division of the bibliography by subject and sub-subjects, though of concern in the manual systems, should not be apparent in the electronic form. However, access through these criteria must be supported. The weeding of bibliography entries, to remove Internet resources which are no longer accessible, though attractive, may require careful thought from the point of completeness. The archiving of resources in central libraries could mean that such weeding of the bibliography would not be necessary.

A Cataloging and Searching System

Library catalogs are prepared by a specialist and for each entry, it records the author, title, publisher, place of publication, date of publications and other details - some of these details are not displayed to users of on-line public catalogue.

The term union catalogue or union list, in library lexicons, is used to refer to the catalog which is the union, or combined listing, of the catalogs of a number of participating libraries. The union list forms a grouped list of items and their sites indicating which item is located where. In this sense, the bibliography, forms a union list of all sources of documents. Such catalogues, which could be regional, national or consortia based, are used by Interlibrary Loans departments in libraries to identify potential sites which hold items not available locally. Since the item in question is not in electronic form, it requires the intermediary of the inter-library loan mechanism to borrow it (usually from the nearest location which permits the title to be borrowed or if possible to photocopy sections of it.)

Currently a large number of documents exist in addition to the files whose names could be searched via systems such as Archie or Xarchie. The popularity of the World Wide Web [BERN, BERN3] and browsers such as Mosaic [MOSA] has prompted many researchers to start publishing on-line. Attempts to provide easy searching of relevant documents has lead to a number of systems including WAIS, and more recently a number of Spiders, Worms and other creepy crawlers.[DEBR, FLET, KOST, MCBR, META, THAU, SEAR, WEBC, WWWW]

However, the problem with many of these indices is that their selectivity of documents is often poor. The chances of getting inappropriate documents and missing relevant information because of poor choice of search terms is large. In addition, the user is required to access the actual resource, based on just the title and author information, as is provided through a library catalog, and decide whether the resource meets the needs.

These problems are addressed in our proposed system by using an appropriate index entry called Semantic Header [BCD2] and providing a mechanism to register, manage and search the bibliography. The system is an active system requiring the provider of information to register the resource by entering an index entry for the resource. Since the provider is responsible for preparing the index entry, there is the potential for its accuracy to be high.

The overall system uses knowledge bases and expert sub-systems to help the user in the registering and search processes. One such need for an expert system is in avoiding chaos introduced by differences in perception of different indexer. Hence, some form of standardization of terms used has to be enforced. We envisage this through the intermediary of an expert system based engine. The index generation and maintenance sub-system uses the knowledge and expertise of the expert cataloguer to help the provider of the resource select correct terms for items such as subject, sub-subject and keywords. Similarly, another expert system is used in the search sub-system to help the user in the search for appropriate information resources. The third component of the system is a distributed and replicated database of the bibliography to resources available on-line. The database is in the background and the users are not aware of its presence, much less of its distributed and replicated nature. These components are described below.

Semantic Header

The heart of any bibliography or indexing system is the record that is kept for each item that is being indexed. Standardization of a bibliographic entry allows libraries to exchange information about their collections. A number of projects in the Library domain have addressed the problem of cataloging and in particular cataloging of information in electronic and multi-media format. CORE[CROM], MARC system[BRYN, CRAW, MARC, PETE], MLC[HORN, ROSS, RHEE] and TEI[GAYN, GIOR] are examples of some of these initiatives. These existing and proposed indexing systems range from a minimum to full level of bibliographic information. However, such systems are designed for professional catalogers and many of the elements included in them, though useful, are beyond the comprehension of most providers or users of information.

We have proposed a simple index structure called Semantic Header [DESA2] for resources accessible directly on the Internet. The structure of the index is similar to the ones used for most libraries indices and includes other information deemed useful for on-line systems. The syntax of the semantic header is the HTML markup language[BERN2] which is based on the SGML markup language. However, the user working with the index entry system is guided through the process by an expert system. This system guides the user in the choice of standardized terms through an easy to use graphical interface.

Figure 1 below indicates the structure of the Semantic Header. An example of use of the semantic header is given in Figure 2. The intent of the semantic header is to include those items that are most often used in the search of an information resource. Since the majority of searches begin with a title, name of one of the authors (70%), subject and sub-subject (50%)[Katz], we have made the entry of these elements to be mandatory in the semantic header. The abstract and annotations are relevant in deciding whether the resource would be useful; these items are also included. Logically, the entries in the semantic header are not positionally sensitive. However, for ease of use, we have arranged the fields in Figure 1 using the traditional library catalog layout.

The first field of the semantic header is the title of the resource. It is a required field and is given within the tags beginning with <title> and terminated by </title>. The title could include the sub-title, as is done many cataloging system. The next field is a alt-title and is used to indicate an "official" secondary title or an alternate title of the resource. This field is optional. The subject and the sub-subject of the resource is indicated in the next field which is a repeating group (a multi-part field with one or more occurrences of items in the group). All resources must have at least one occurrence for this field.

The character set used and the language of the resource is given in the next two optional fields, marked by the tags <language> ... </language> and <char-set> ... </char-set>, respectively.

The details about the author(s) and/or other agent(s) responsible for the resource is given in the next repeating group. The sub-fields are for the role of the agent (typical values could be author, co-author, designer, editor, programmer, creator, artist, corporate entity, publisher, etc.) name, organization, address, phone and fax numbers and e-mail address. All sub-fields except the name are optional except in instances for corporate entities in which case the organization must be given.

The list of keywords is included by a field marked by the tags <Keyword> ... </Keyword>. Each resource must have at least one keyword.

The next element is a repeating group for recording the identifiers for the resource. Each occurrence of this group consists of two sub-fields: one for the domain and the other for the corresponding value. The domain could an accepted or standardized coding scheme issued by appropriate authority such as ISBN, ISSN, URN[RFC1737], or URL[BERN1] etc, and the value contains the corresponding coded identifier. Since a resource in electronic form may be accessible from one or more sites there could be one or more entries for the same domain such as URL. The URN [RFC1737] field gives the unique name of the item, if any. This name may be used instead of a location (URL) if the item is likely to move or may be accessible from multiple locations[4].

In the absence of an accepted scheme for URN, we use an alternate unique name, called Semantic Header Name(SHN). The SHN is derived by concatenating the initial location of the resource with the title, name of first author (or name of organization, if the resource is corporate), first subject creation date and version number.

The identifier entry in the semantic header may also contains an entry for an archive site. The domain value UAS (Universal archive site) is used to indicate the archive site for the resource. It is expected that the resource will exist at this site beyond the expiry date of the resource, if any. Of course, the site itself is guaranteed to exist beyond the life of any resource. It is envisaged that the archive site could be an independent resource provider. One example of such a traditional resource provider is the national library in most countries. One possibility is for the national libraries such as the Library of Congress in U.S., British Library, National Library and CISTI in Canada, to archive Internet resources. However, private, for profit, corporations could be alternate sites for archiving resources. Archiving would provide an anchor for the otherwise ephemeral nature of some resources on the network. Since the archive site may not be known when the semantic header is first registered, the system will support update operations in which existing entries could be modified. Other update operations such as modification of addresses, URLs etc., would also be supported.

The dates of creation(required), expiry, if any, are given next. The version number, and the version number being superseded if any, are given next. The intended classification (nature of resource, security or distribution restriction, copyright status) and the coverage (target audience, coverage in spatial and/or temporal term, etc.) is indicated in the next two repeating groups.

The abstract and annotations are given in the next fields. The abstract is provided by the author of the resource; the annotations are made by independent users of the resource and includes the identity of the user along with a digital signature. The annotation cannot be modified.

List of system requirements such as hardware and software required is included in the semantic header as a repeating group within the tags <SysReq> ... </SysReq>. This is followed by the size of the resource and the cost of accessing it[5].

The last set of items in the semantic header is the control items such as the account to which credits are to be made for charges for accessing the resource, encoded passwords or the digital signature of the provider of the resource. Any change to the update-able part of the semantic header requires the password or digital signature. Another control piece of information is the digital signature of the resource itself. This may be used to authenticate the resource when it is retrieved through a semantic header. It is assumed that there is a mechanism to access the resource's digital signature.

<semhdr>

<title> required </title>

<alt-title> OPTIONAL </alt-title>

<Subject> required: a list each of which includes fields for subject and up to two levels of sub-subject: at least one entry is required </Subject>

<language> OPTIONAL: of the information resource </language>

<char-set> OPTIONAL: character set used </char-set>

<author> required: a list each of which includes role, name, organization, address, etc. of each person/institute responsible for the information resource: at least the name or the organization and address is required </author>

<Keyword> required: a list of keywords </Keyword>

<Dates>

<Created> required: </Created>

<Expiry> OPTIONAL: </Expiry>

<Updated> system generated </Updated>

</Dates>

<Version> OPTIONAL: version of the resource </Version>

<Supersedes> OPTIONAL: which version is being replaced </Supersedes>

<Coverage> OPTIONAL: audience, spatial, temporal </Coverage>

<Classification> OPTIONAL: nature (legal, security level etc.) of the resource </Classification>

<Identifier> A list of domains for identifiers and the corresponding values: typical identifiers could be one of more Unique Resource Locator(URL), Call No. for the resource, unique name of the resource (URN), site where the item is to be archived: at least one required

</Identifier>

<Abstract> OPTIONAL but recommended </Abstract>

<Annotation> OPTIONAL: </Annotation>

<SysReq> OPTIONAL: list of system requirements for example hardware and software: the component and the corresponding requirements are given

</SysReq>

<Source> OPTIONAL: gives the source or related list of resources for each such resource it indicates a relationship and gives an identifier which includes the domain and the corresponding value

</Source>

<size> size of the resource in appropriate units (e.g., bytes) </size>

<Cost> OPTIONAL: cost of accessing the resource </Cost>

<control>

<Ac> account number </Ac>

<password> required: encoded password or digital signature of provider of resource for initial entry and subsequent update </password>

<signature> digital signature of the resource for authentication </signature>

</control>

</semhdr>

Figure 1. Structure of the Semantic Header

<semhdr>

<title>Semantic Header and Indexing and Searching on the Internet</title>

<alt-title>Sailing the Internet with a navigational System</alt-title>

<Subject>

<ul>

<li>

<General>Computer Science </General>

<Sublevel1>Information Storage and Retrieval</Sublevel1>

<Sublevel2>indexing</Sublevel2>

</li>

<li>

<General>Library Studies</General>

<Sublevel1>cataloging</Sublevel1>

<Sublevel2>semantic header</Sublevel2>

</li>

<li>

<General>Computer Science </General>

<Sublevel1>Artificial Intelligence</Sublevel1>

<Sublevel2>expert systems</Sublevel2>

</li>

<li>

<General>Computer Science </General>

<Sublevel1>Database Management</Sublevel1>

<Sublevel2>distributed databases</Sublevel2>

</li>

</ul>

</Subject>

<Language> English </Language>

<Character> ISO-8879 </Character>

<author>

<ul>

<li>

<arole> Author </arole>

<aname>DESAI, Bipin C.</aname>

<aorg>Concordia University, Department of Computer Science</aorg>

<aAddress>7141 Sherbrooke Street West, Montreal, QC, CANADA, H4B 126 </aAddress>

<aphone>(514) 848 3025</aphone>

<aFax>(514) 848 8652</aFax>

<aemail>bcdesai@cs.concordia.ca</aemail>

</li>

</ul>

</author>

<Keyword>

<ul>

<li>Bibliographic record</li>

<li>Content description</li>

<li>Database system</li>

<li>Expert system</li>

<li>Index aid</li>

<li>Search/index aid</li>

<li>URC</li>

</ul>

</Keyword>

<Dates>

<Created> 1994-07-11</Created>

<Expiry>1996-01-11</Expiry>

<Updated>1995-05-11</Updated>

</Dates>

<Version>

<ul>

<li>Current: 1.1 </li>

<li>Supersede: 1.0 </li>

</ul>

</Version>

<Coverage>

<ul>

<li>Audience: Computer Science </li>

<li>Audience: Library Science </li>

<li>Audience: Internet types</li>

</ul>

</Coverage>

<Classification><ul>

<li> Legal: Copyright</li>

<li> Security: Public </li>

<li> Nature: Electronic Paper </li>

</ul>

</Classification>

<identifierr>

<ul>

<li>URL: http://www.cs.concordia.ca/~bcdesai/cindi-system-1.1.html</li>

<li>URN:<comment>Unique Universal Resource Name for this resource. No such service exists to date. In the absence of one, we use the concatenation of Location, Title, first author, first subject creation date and version number. Do we really need another level of complexity especially if we have a good index and catalogue system? Is the current system of using domain name followed by other names not good enough? It is the most distributed version possible. Here domain names not only signify Internet domain but other domains such as ISBN, UPC, etc. </comment></li>

<li>SHN:132.205.50.24|Semantic Header and Indexing and Searching on the Internet|Computer Science|Information Storage and Retrieval|Indexing|DESAI, Bipin C.|1994-07-11|1.1 </li>

<li><UAS><comment>Universal Archive Site where this document is archived</comment> ftp://ftp.cs.concordia.ca/bcd/cindi-system-1.1.html</li>

</ul>

</identifierr>

<abstract>This paper describes an indexing system called semantic header for "document like" Internet resources. The semantic header contains the meta-information for each "publicly" accessible resource on the Internet. It also describes the registering system and the distributed database representing the union catalog of resources on the Internet. This database would be used in a search system to facilitate search.

</abstract>

<Annotation></Annotation>

<size> 44000 </size>

<Cost><comment>Currency, Cost<comment> Can$: 0.31</Cost>

<control>

<Ac> BCD's Swiss number a/c </Ac>

<password> thequickbrownfoxjumpsoverthelazydog </password>

<signature> 01001010101110101101010110011101 </signature>

</control>

</semhdr>

Figure 2 An example of a Semantic Header Entry

Index Registering Sub-system

The index entry and registering sub-system provides a graphical interface (Figure 3) to facilitate the provider (author/creator) of a resource to register the bibliographic information about the resource. The interface allows the provider to enter the information and it offers help by means of pop-up selection windows and an expert engine (not shown in Figure 3) to suggest controlled terms. Once the information is correctly entered the author can decide to register the Semantic Headed entry in the Semantic Header database. When the header information is accepted by the database, the author/creator is notified. A password or a digital signature is to be provided when the semantic header is first registered and for all changes made to it. Since the encoded password or digital signature is not accessible by anyone other than the original registrar of the index entry, the entry can only be updated by person(s) who are cognizant of it. Changes that may be made could be due to changes made in the resource or its migration from one system to another. A copy of the semantic header is stored at the site of the resource. It is desirable that the semantic header be attached to the actual resource. However, this can not be done until all hardware and/or software systems can handle such a header (viz. ignore it).

The system verifies the accessibility of the electronic resource being added. Also the digital signature of the resource is retrieved and added to the semantic header. The purpose of this last piece of information is to establish the veracity of the resource when it is retrieved through a semantic header. If the resource is corrupted, this veracity validation would fail and the user would be notified; no charges, if there are any, would be made.

If each resource is given a unique name (URN or SHN), the semantic header database can be used for mapping from such an unique name to a location (URL). Since only one semantic header could be associated with a given URN, a search with a given URN will retrieve at most one semantic header. One of the URLs in it can be used to access the resource in question. This form of search can be implemented at a low level without the need for a graphical interface.

The index entry that is registered is communicated to a database described below.

The Semantic Header Database System

The index entries registered by a provider of a resource is stored in a distributed database system (SHDDB). From the point of view of the users of the system, the underlying database may be considered to be a monolithic system. In reality, it would be distributed and replicated allowing for reliable and failure-tolerant operations. The interface hides the distributed and replicated nature of the database. The distribution is based on subject areas and as such the database is considered to be horizontally partitioned [DESA5].

It is envisaged that the database on different subjects will be maintained at different nodes of the Internet. The locations of such nodes need only be known by the intrinsic interface. A database catalog would be used to distribute this information. However, this catalog itself could be distributed and replicated as is done for distributed database systems.

The Semantic Header information entered by the provider of the resource using a graphical interface is relayed from the user's workstation by a client process to the database server process at one of the nodes of the SHDDB. The node is chosen based on its proximity to the workstation or on the subject of the index record. On receipt of the information, the server verifies the correctness and authenticity of the information and on finding everything in order, sends an acknowledgment to the client.

The server node is responsible for locating the partitions of the SHDDB where the entry should be stored and forwards the replicated information to appropriate nodes. For example, the semantic header entry of Figure 2 would be part of the SHDDB for subjects Computer Science and Library Studies.

Similarly the database server process is responsible for providing the catalogue information for the search system. In this way the various sites of the database work in a cooperating mode to maintain consistency of the replicated portion. The replicated nature of the database also ensures distribution of load and ensures continued access to the bibliography when one or more sites are temporarily nonfunctional. The performance of search with the growing size of the SHDDB database could be improved by using techniques used in databases[DESA6].

The Search System

The guiding principle of the design of the search system uses the model of a human reference librarian. S/he is called on to help in identifying the best sources of information for a given purpose and to aid in the selection of materials to meet a particular interest or need. The reference librarian seeks to respond to these queries by using information derived from a bibliographic search and is facilitated by the librarians own expertise and knowledge of the relevant subject. In addition, users of a library have access to the same bibliographic indices and many of the information databases from which they are called on to select relevant titles or weed out irrelevant ones.

A typical query to a reference librarian can be divided into two categories: known and unknown[KATZ]. In the former, a user asks for an item identified by author, title, or publication source. In the latter, the need of the user is fuzzy; s/he has no idea of any of the identifiers of the needed item. Even in the case of the known queries, there is the possibility that the user may have the wrong author, right author but the wrong title, wrong dates or incorrect volume number or issue number for a serial. It may also happen that even when these are correct, the item is not the one that meets the need of the user.

A specific search and research type query may require the user to peruse a number of titles and select from among them. This type of query involves users who have fuzzy notions of their needs and their questions are vague. They involve a certain amount of trial and error retrieval of documents and their browsing.

One problem that human librarians deal with is that of the inability of the users to ask the relevant questions. The reference librarian, through a dialog with the user tries to narrow down the user's needs in terms of what and how much information is required. In many cases the librarian is called upon to match the user needs with the sources of information. For example, an article from the popular press may be appropriate for a lay person as opposed to one appearing in a prestigious journal dedicated to the subject.

In the search component of the proposed system we plan to incorporate the expertise used by a reference librarian. This expertise will guide the user in entering the various search items in a graphical interface similar to the one used by the index entry system (Figure 4). The expert search sub-system requires the expertise of a reference librarian to be built into it to help users formulate queries and launch these queries. As in the case of the index generation sub-system, the expert system provides help in choosing appropriate search terms for index entries such as subject, sub-subject, keywords etc. The expert system uses various statistics, derived from past interactions, to optimize the search.

The search system also uses a graphical interface and a client process. Once the user has entered a search request, the client process communicates with the nearest SHDDB catalogue to determine the appropriate site of the SHDDB database. Subsequently, the client process communicates with this database and retrieves one or more semantic headers. The result of the query could than be collected and sent to the user's workstation. The contents of these headers are displayed, on demand, to the user who may decide to access one or more of the actual resources using a graphical window as in Figure 5. It may happen that the item in question may be available from a number of sources. In such a case the best source is chosen based on optimum costs. The client process would attempt to use appropriate hardware/software to retrieve the selected resources.

Annotations and Reviewing

The scientific world depends on peer review of documents submitted for publication. Such annotation used for reviews tend not to be published. However, comments to the editor made by readers of the serials are usually published and are accessible to the community. Since many of the resources on the Internet tend to be non-reviewed, it would be useful for a user to have access to annotations made by other users for a given resource. The proposed system allows users to add annotations to an existing resource. These annotations are stored along with the index in the SHDDB.

The annotation sub-system is similar to the indexing subsystem. However, only a few of the indexing entries, to uniquely identify the resource in question, are required (Figure 6). An annotation made by any user can be entered and would be registered with the identity and digital signature of the user. Each annotation could than be incorporated in the index entry (at least logically) and could be retrieved with the index. Such annotations, by recognized persons would be a valuable guide for future users.

The peer reviews of electronically submitted papers could be implemented using such annotations. Authentication of reviews has to be done by an appropriate editorial board.

Conclusions: Advantages of the approach

Current index systems are based on harvesting the network for new documents and such documents are retrieved and their contents used to provide terms for the index. The big disadvantage with his scheme is the unreliability of the index entries produced and the lack of an authentic abstract for the item. Currently, such schemes are relevant for Web text documents and are not applicable to other resources. Another problem with this approach is the unnecessary traffic on the network and lack of cooperation and sharing among different systems. Finally, the infeasibility of this approach as more and more providers of information would require payments. Furthermore, users, without having a better idea of their contents, would not be inclined to retrieve resources which, from their titles, seem irrelevant.

In the proposed system, the provider of the resource is the one who prepares the index information. Consequently, such index entry would be more reliable than the one derived by a third party or by simply scanning a document. The presence of an abstract affords the provider of the resource to give a pertinent abstract or summary. Such a summary in the index allows users to make better informed decisions regarding the relevance of the source resource.

The system provides an expert system-driven graphical interface for the provider of the resource to produce an index entry, and have this entry entered in the index database. The expert system provides help in choosing appropriate terms for index entries such as subject, sub-subject, keywords etc. It also is responsible for verifying the consistency of the index entry and accessibility of the resource and then posting the index entry to the index database.

In addition, the index database contains a number of control entries for the resource. Control entries are items such as size of the resource, the password for authenticating subsequent updates of the index entry, and a list of annotations made about the resource by independent users

Acknowledgment

The author wishes to gratefully acknowledge the many thought provoking comments by colleagues Carol Caughlin, Lee Harris and Rajjan Shinghal. This work was supported in part by a grant from the Seagram Funds for Academic Innovation.

References

[BERN] Berners-Lee, T., & Cailliau, R., "WorldWideWeb: Proposal for a HyperText Project" http://info.cern.ch/hypertext/WWW/Proposal.html

[BERN1] Berners-Lee, T. "UR* and The Names and Addresses of WWW objects", http://info.cern.ch/hypertext/WWW/Addressing/Addressing.html

see also RFC 1738,

[BERN2] Berners-Lee, Tim, Connolly, "Hypertext Markup Language, Internet working draft", http://info.cern.ch/hypertext/WWW/MarkUp/HTML.html

[BERN3] Berners-Lee, T. "Wide Web Initiative: The Project", http://info.cern.ch/hypertext/WWW/TheProject

[BYRN] Byrne, Deborah J., "MARC manual: understanding and using MARC record", Libraries Unlimited, Englewood, Colo. 1991.

[CRAW] Crawford, Walt, "MARC for Library Use: Understanding USMARC", G. K. Hall, Boston, MA, 1989.

[CROM] Cromwell, Willy, "The Core Record: A New Bibliographic Standard", Library Resources and Technical Services, Vol. 38-4, pp. 415-424, 1994.

[DEBR] De Bra, P., Houben, G-J., & Kornatzky, Y., "Search in the World-Wide Web", http://www.win.tue.nl/help/doc/demo.ps

[DESA1] Desai, Bipin C., "WebJournal: Visualization of Web Journey", August 1994, http://www.cs.concordia.ca/WebJournal.html

[DESA2] Desai, Bipin C., "Cover page aka Semantic Header", July 1994, http://www.cs.concordia.ca/semantic-header.html, revised version, August 1994, http://www.cs.concordia.ca/~bcdesai/semantic-header.html

[DESA3] Desai, Bipin C., Shinghal, Rajjan, "A System for Seamless Search of Distributed Information Sources", May 1994, http://www.cs.concordia.ca/w3-paper.html

[DESA4] Desai, Bipin C., Pollock, Richard, "MDAS: A Heterogeneous Distributed Database Management System", Information and Software Technology, January 1992, Vol. 34-1, pp. 28-41.

[DESA5] Desai, Bipin C., "An Introduction to Database Systems", West, St. Paul, MN 1990.

[DESA6] Desai, Bipin C., "Performance of a Composite Attribute and Join Index", IEEE Trans. On Software Engineering, Vol. 15-2, pp. 142-152, 1989.

[FLET] Fletcher, J. 1993., "Jumpstation", http://www.stir.ac.uk/jsbin/js

[GAYN] Gaynor, Edward, "Cataloging Electronic Texts: The University of Virginia Library, Experience", Library Resources and Technical Services, Vol. 38-4, pp. 403-413, 1994.

[GIOR] Giordano, Richard, "The Documentation of Electronic Texts Using Text Encoding Initiative Headers: An Introduction", Library Resources and Technical Services, Vol. 38-4, pp. 389-401, 1994.

[GNAM] Global Network Academy Meta-Library, http://uu-gna.mit.edu:8001/cgi-bin/meta

[HORN] Horny, Karen L., "Minimal-level cataloging: A look at the issues- A symposium", Journal of Academic librarianship, Vol. 11, pp. 332-334.

[KATZ] William A. Katz, "Introduction to Reference Work", Vol. 1-2 McGraw-Hill, New York, 1987

[KOST] Koster, M. "ALIWEB(Archie Like Indexing the WEB)", http://web.nexor.co.uk/aliweb/doc/aliweb.html

[KOST1] Koster, M., "Simple Unified Search Interface (SUSI)", http://web.nexor.co.uk/susi/susi.html

[KOST2] Koster, M., "Configurable Unified Search Interface", http://web.nexor.co.uk/public/cusi/cusi.html

[MARC] Library of Congress, "MARC manuals used by the Library of Congress", American Library Association, Chicago, 1969.

[MCBR] McBryan, Oliver A., "World Wide Web Worm", http://www.cs.colorado.edu/home/mcbryan/WWWW.html

[MCBR1] McBryan, Oliver A., "GENVL", http://www.cs.colorado.edu/homes/mcbryan/public_html/bb/summary.html

[META] Experimental Search Engine Meta-Index, http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Demo/metaindex.html

[MOSA] NCSA Mosaic http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/NCSAMosaicHome.html

[PETE] Petersen, Toni, Molholt, Pat (ed), "Beyond the book: extending MARC for subject access", G.K. Hall, Boston, MA, 1990.

[POST] Post, R., "Lagoon: a WWW cache", http://www.win.tue.nl/lagoon

[RFC1357] "A Format for E-mailing Bibliographic Records", D. Cohen.: can be obtained via anonymous FTP from anyone of: ds.internic.net, nis.nsf.net, src.doc.ic.ac.uk, munnari.oz.au and a number of other sites.

[RFC1737] "Functional Requirements for Uniform Resource Name", K. Sollins, L. Masinter: pl. see RFC1357 above.

[RFC1738] "Uniform Resource Locators(URL)", T. Berners-Lee, L. Masinter, M. McCahill: pl. see RFC1357 above.

[ROSS] Ross, Rayburn M., West, Linda, "MLC: A contrary viewpoint", Journal of Academic librarianship, Vol. 11, pp.334-336

[RHEE] Rhee, Sue, "Minimal-level cataloging: Is it the best local solution to a national problem?", Journal of Academic librarianship, Vol. 11, pp.336-337, 1986.

[SEAR] Search WWW document full text, http://rbse.jsc.nasa.gov/eichmann/urlsearch.html

[TAYL] Taylor, Arlene G., "The information universe: Will we have chaos or control?", American Libraries, Vol.25-7, pp. 629-632, 1994.

[THAU] Thau, R., "SiteIndex Transducer", http://www.ai.mit.edu/tools/site-index.html

[WEBC] WebCrawler, http://www.biotech.washington.edu/WebCrawler/WebQuery.html

[WWWC] World Wide Web Catalog,

http://cui_www.unige.ch/cgi-bin/w3catalog


(c)Bipin C. Desai
Feb 1995