Semantic Header

Professional cataloguers have found the need for elements similar to those in DMEL in most indexing applications. This dictates that they must be included in most indexes for information resources. The dependence on titles as the most commonly used search criteria dictates that they must be indicative of the contents of the document. This is not always the case, hence someone (the author or the cataloger) has to add annotations, keywords or key phrases to indicate the actual content.

Accuracy or quality of a document can be indicated by including reviewers' opinions. However, such opinions are rarely accessible to the traditional cataloger. Another feature of importance to the user of an index, is the presence of an accurate abstract. An abstract provides a summary of the material and thus is more indicative of the contents than the title or keywords supplied by the author, bibliographer or selected from scanning the text. Reference librarians and library users tend to use such annotated bibliographies to help choose among competing sources. Fortunately, for an on-line index system as proposed in CINDI[DESA2], it is possible to include not only the author supplied abstract but also annotations made by independent users in the index entry for the information resource.

Semantic Header[DESA1] was conceived as a required component of all HTML documents for the Web. It was originally presented at the First International World Wide Web Conference in Geneva(April 1994). Since then, it has been extended to other resources accessible directly on the Internet.

The structure of the index is similar to the ones used for most library indices and include other information deemed useful for on-line systems. The semantic header may be considered as an application of SGML[GOLD]. However, the user working with the index entry system is guided through the process by an expert system. This system guides the user in the choice of standardized terms through an easy-to-use graphical interface. Figures 1 through 3 below give the DTD for the Semantic Header.

The intent of the semantic header is to include those elements that are most often used in the search for an information resource. Since the majority of searches begin with a title, name of one of the authors (70%), subject and sub-subject (50%)[Katz], we have made the entry of these elements mandatory in the semantic header. The abstract and annotations are, as well, relevant in deciding whether or not the resource would be useful; these items are also included. The elements of the semantic header are described briefly below:

<!ENTITY % SE_SBJCT '(General,(SubLevel1, SubLevel2?)?)+'>

<!ENTITY % SE_RA '(Role, Name, Organization?, Address?, Phone?, Fax?, EMail?)+'>

<!ENTITY % SE_KW 'Kw+'>

<!ENTITY % SE_ID '(IdDomain, IdValue)+'>

<!ENTITY % SE_DT '(DSchema, Created, Expiry?, Updated?)'>

<!ENTITY % DATE_SCHEMA "YYYY | YYYY-MM-DD | Other" >

<!ENTITY % SE_VR '(Current, Supersede?)?'>

<!ENTITY % SE_CLASS '(ClassDomain, ClassValue)*'>

<!ENTITY % DOM_CLASS "Legal | Security | Nature | Other">

<!ENTITY % SE_CVRG '(CovDomain, CovValue)*'>

<!ENTITY % DOM_CVRG "Audience | Geographical Coverage | Spatial Coverage |

Epoch | Other">

<!ENTITY % SE_SYSRQ '(SysDomain, SysValue+)*'>

<!ENTITY % DOM_SYSRQ "Hardware | Network | Software | Other">

<!ENTITY % SE_GNR '(Form, Size)*'>

<!ENTITY % SE_SRC '(Relationship, IdDomain, IdValue)*'>

<!ENTITY % RELATIONS "Contains| ContainedIn | ContinuedFrom | ContinuedTo |

DerivedFrom | IndexOf | IndexedIn |

PartOf | PrecededBy | FollowedBy | Other">

<!ENTITY % SE_COST '(Currency, Amount)*'>

<!ENTITY % SE_ANN '(Annotation, Signature)*'>

<!ENTITY % SE_CNTRL '(Account, Password)'>

Figure 1 DTD for Semantic Header: Entitiess

Title, Alt-title

The first field of the semantic header is the title[5] of the resource. It is a name given to the resource by its creator(s) and is a required field. In the formal definition it is enclosed within the tags beginning with <title> and terminated by </title>. The title could include the sub-title, as is done in many cataloging systems. The alternate title field is enclosed by the tags <alt-title>, </alt-title> and used to indicate an "official" secondary title or an alternate title of the resource. Whereas the element title is a required element, the alternate title is optional.

<-- Element Minimization Value Default -->

<!ELEMENT SemHdr - - (Title, AltTitle?, Subject, Language?, CharSet?, RespAgent,

Keywords, Identifier, Dates, Version, Classification, Coverage, Sysreq,

Genre, Source, Cost, Abstract?, Annotation, Control) >

<!ELEMENT Title - - CDATA #REQUIRED >

<!ELEMENT AltTitle - - CDATA #IMPLIED >

<!ELEMENT Subject - - (% SE_SBJCT;) >

<!ELEMENT Language - - CDATA #IMPLIED >

<!ELEMENT CharSet - - CDATA #IMPLIED >

<!ELEMENT RespAgent - - (% SE_RA;) >

<!ELEMENT Keywords - - (% SE_KW;) >

<!ELEMENT Identifier - - (% SE_ID;) >

<!ELEMENT Dates - - (% SE_DT;) >

<!ELEMENT Version - - (% SE_VR;) >

<!ELEMENT Classification - - (% SE_CLASS;) >

<!ELEMENT Coverage - - (% SE_CVRG;) >

<!ELEMENT Sysreq - - (% SE_SYSRQ;) >

<!ELEMENT Genre - - (% SE_GNR;) >

<!ELEMENT Source - - (% SE_SRC;) >

<!ELEMENT Cost - - (% SE_COST;) >

<!ELEMENT Abstract - - CDATA #IMPLIED >

<!ELEMENT Annotation - - (% SE_ANN;) >

<!ELEMENT Control - - (% SE_CNTRL;) >

Figure 2 DTD for Semantic Header: Elements

Subject

The subject and sub-subjects of the resource are indicated in the next field which is a repeating group (a multi-part field with one or more occurrences of items in the group). All resources must have at least one occurrence for this field.

Language, Character set

The character set used and the language of the resource is given in the next two optional fields.

Author and other responsible agents

The details about the author(s) and/or other agent(s) responsible for the resource is given in the next repeating group[6]. The sub-fields are: role[7] of the agent, name, organization, address, phone and fax numbers, and e-mail address. All sub-fields save the name are optional, except in the instances of corporate entities in which case the organization must be given. By using the role sub-field and giving it appropriate value, semantics for agents such as editor or publisher are incorporated in this repeating group.

<-- Element Minimization Value Default -->

<!ELEMENT General - O CDATA #REQUIRED >

<!ELEMENT Sublevel1 - O CDATA #IMPLIED >

<!ELEMENT Sublevel2 - O CDATA #IMPLIED >

<!ELEMENT Role - O (%ROLE;) #REQUIRED >

<!ELEMENT Name - O CDATA #REQUIRED >

<!ELEMENT Organization - O CDATA #IMPLIED >

<!ELEMENT Address - O CDATA #IMPLIED >

<!ELEMENT Phone - O CDATA #IMPLIED >

<!ELEMENT Fax - O CDATA #IMPLIED >

<!ELEMENT EMail - O CDATA #IMPLIED >

<!ELEMENT Kw - O CDATA #IMPLIED >

<!ELEMENT IdDomain - O (%DOM_ID;) #REQUIRED >

<!ELEMENT IdValue - O #PCDATA #REQUIRED >

<!ELEMENT DSchema - O (%DATE_SCHEMA;) #REQUIRED >

<!ELEMENT Created - O #PCDATA #IMPLIED >

<!ELEMENT Expiry - O #PCDATA #IMPLIED >

<!ELEMENT Updated - O #PCDATA #IMPLIED >

<!ELEMENT Current - O CDATA #IMPLIED >

<!ELEMENT Supersede - O CDATA #IMPLIED >

<!ELEMENT ClassDomain - O (%DOM_CLASS;) #REQUIRED >

<!ELEMENT ClassValue - O #PCDATA #REQUIRED >

<!ELEMENT CovDomain - O (%DOM_CVRG;) #REQUIRED >

<!ELEMENT CovValue - O #PCDATA #REQUIRED >

<!ELEMENT SysDomain - O (%DOM_SYSRQ;) #REQUIRED >

<!ELEMENT SysValue - O CDATA #REQUIRED >

<!ELEMENT Form - O CDATA #IMPLIED >

<!ELEMENT Size - O CDATA #IMPLIED >

<!ELEMENT Relationship - O (%RELATIONS;) #REQUIRED >

<!ELEMENT IdDomain - O (%DOM_ID;) #REQUIRED >

<!ELEMENT IdValue - O #PCDATA #REQUIRED >

<!ELEMENT Currency - O CDATA #IMPLIED >

<!ELEMENT Amount - O #PCDATA #IMPLIED >

<!ELEMENT Annotation - O CDATA #IMPLIED >

<!ELEMENT Signature - O CDATA #IMPLIED >

<!ELEMENT Account - O CDATA #IMPLIED >

<!ELEMENT Password - O SECRET #IMPLIED >

Figure 3 DTD for Semantic Header: Sub-elements

Keyword

The list of keywords is included in this field.

Identifier

The next element is a repeating group for recording the identifiers of the resource. Each occurrence of this group consists of two sub-fields: one for the domain and the other for the corresponding value.

The domain could be an accepted or standardized coding scheme issued by an appropriate authority such as ISBN, ISSN, URL(FTP, GOPHER, HTTP)[BERN1], or URN[RFC1737] etc., and the value contains the corresponding coded identifier. Since a resource in electronic form may be accessible from one or more sites there could be one or more entries for the same domain such as URL. The URN field gives the unique name of the resource, if any. This name may be used instead of a location (URL) if the item is likely to move or is accessible from multiple locations[8]. The identifier(s) can be used to locate the resource.

In the absence of an accepted standard for URN, we use an alternate name, called Semantic Header Name(SHN). The SHN is derived by concatenating the following required elements in the semantic header: the title, name of first author (or name of organization, if the resource is attributable to a corporate or organizational entity), first subject, and creation date. The string generated is prefixed by the initial location of the resource and suffixed with and an optional system-generated integer number for possible disambiguation. With this scheme, the user supplied elements in the SHN, with a very small probability, may map to more than one resource. If multiple hits are encountered during a search based on user supplied elements of the SHN, the system would inform the user of the "collision". The user could then select the appropriate resource index entry by perusing the other elements recorded in the semantic header.

The identifier entry in the semantic header may also contain an entry for an archive site. The domain value UAS (universal archive site) is used to indicate the archive site for the resource. It is expected that the resource will exist at this site beyond its expiry date, if any. Of course, the site itself is guaranteed to exist beyond the life of any resource. It is envisaged that the archive site could be an independent resource provider. Examples of such traditional resource providers that would be feasible archive sites for the resource are the national libraries such as the Library of Congress in U.S., British Library, National Library and CISTI in Canada. However, private, for profit, corporations could be alternate sites for archiving resources. Archiving would provide an anchor for the otherwise ephemeral nature of some resources on the network. Since the archive site may not be known when the semantic header is first registered, the system would support update operations in which existing entries could be modified. Other update operations such as modification of addresses, URLs etc., would also be supported.

Dates

The dates of creation(required), expiry and update, if any, are given next. Any updates made are indicated by a system generated date.

Version

The version number, and the version number being superseded if any, are given in these optional elements.

Classification

The intended classification is indicated in the next optional repeating group. It consists of a domain (nature of resource, security or distribution restriction, copyright status, etc.) and the corresponding value.

Coverage

The coverage is indicated in the next optional repeating group. It consists of a domain (target audience, coverage in a spatial and/or temporal term, etc.) and the corresponding value.

System Requirements

A list of system requirements such as hardware and software required to access, use, display or operate the resource is included in the semantic header as an optional repeating group. It consists of a domain of the system requirements (possible values are: hardware, software, network, protocols, etc.,) and the corresponding exigance.

Genre

This optional element is used to describe the physical or electronic format of the resource. It consists of a domain (type of representation or form which in the case of a file could be its format such as ASCII, Postscript, TeX, GIF, etc.,) and the corresponding value or size of the resource.

Source/Reference

The relationship of the resource to other resources may be indicated by the optional repeating group. It contains the relationships, domains and identifiers of related resources. A related object may be used in deriving the resource being described, or it may be its sub/super components. Such information, is usually found in the body of a document-like resource. However, this optional group permits an option for this type of resource and an opportunity to register it for resources of other formats.

Cost

In the case of a resource accessible for a fee, the cost of accessing it[9] is given next. It consists of a currency and the cost for accessing the resource.

Abstract and Annotations

The abstract and annotations are given in the next fields. The abstract is provided by the author of the resource; the annotations are made by the author and/or independent users of the resource and include their identities along with their digital signatures. Once registered, the annotations cannot be modified.

Control

The last set of items in the semantic header is that of the control items such as the account to which credits are to be made for charges for accessing the resource, encoded passwords or the digital signature of the provider of the resource. Any change to the update-able part of the semantic header requires the password or digital signature. Another control piece of information is the digital signature of the resource itself. This may be used to authenticate the resource when it is retrieved through a semantic header. It is assumed that there is a mechanism to access the resource's digital signature.

NEXT: Importance of Metadata for Indexing and Searching and Discovery
PREV: Comments on DMEL
Contents