Metadata is the information which records the characterization and relationship of the source data. It helps to provide succinct information about the source data which may not be recorded in the source itself due to its nature or an oversight.
In this white paper, we limit our discussion to the importance of metadata for indexing to support subsequent search and discovery operations by future users. Presently, users are able to search for and obtain, the required information, after a number of trials with various indexing services. However, unless the challenges outlined below are met, this may not be the case in the year 2010.
Due to the sheer volume of data in the emerging information infrastructure, search and discovery would become difficult without some well thought out discovery mechanism built around adequate metadata. Consider what would happen if one had to search for a specific volume from the LC if its entire collection were piled together, helter-skelter, in a darkened hanger. The task becomes even more daunting if we were not looking for a specific volume but for a volume which dealt with such-and-such topic. The problem with current automatically generated index databases is their inadequate semantic information. Yet, it is evident that professional cataloging of the ever-growing information resources, would be prohibitively expensive. Thus, the design of adequate metadata to describe and establish the semantic contents of resources and to establish their semantic dependencies on other resources is of utmost importance. This, along with a registering system, would establish a basis for later search and discovery.
Metadata would provide an instrument to describe the semantic content of a resource. Such metadata is better suited to supporting discovery than the resource itself. In many cases the resources themselves may not be able to provide the semantic dependencies or it would be computationally too expensive to do so. (For example how does one conclude that a given program code is used to provide computation of consumer loan payments without analyzing the program.) Metadata, for instance, facilitates the cataloging of resources such as audio, computer programs, services, images and videos. This becomes important when the resource itself is not as easily accessible as the index
Another reason for using metadata and extracting salient features of a resource is to support retrieval by content. Automatic processing of the contents of a source by extractors have been done on an ad-hoc basis but have been found to be unreliable. A case in point is the promise of NLP not quite realized. Approximations such as WAIS have been useful but have also shown that relevancy measures derived using frequencies, proximity etc. may not always be meaningful.
Metadata could also be used to express semantic dependencies which are inherent in a collection of objects. This means that the structure of the objects could be expressed using metadata as their surrogates and the actual sources could be separated from their metadata. This simplifies the storage of the resources and allows for the recognition of redundancies. Extracting such semantic dependencies in metadata allows for search based on the contents of multimedia resources.
Initial query processing could be done on the metadata and thus avoid access to most of the resources and the possibility of their computationally bound interpretation. This becomes more advantageous when there are costs (time, money, network bandwidth and overloading) involved in accessing resources. The cost of accessing metadata would be much smaller than the cost of accessing the resource. Query processing would be supported by statistics, and an expert system to help formulate queries as is done by a research librarian.
Appropriately constructed metadata could support query based on contents as well as traditional query based on items such as title, author, subject, etc. This means that the structure of the objects could be expressed using metadata as their surrogates and the actual sources could be separated from their metadata. This simplifies the storage of the resources.
The challenge of the coming information age in the area of metadata can be summarized as: defining an extensible metadata structure; automatic and semi-automatic (human assisted) extracting of metadata from resources; designing of a distributed indexing system; designing of a query language to support discovery and provide location transparency; designing of expert based resource registering and searching systems; and designing of an intuitive graphical user interface to interact in the discovery process.