Workshop A: Web-wide Indexing/Semantic Header or Cover Page

Chair: Bipin C. Desai, Brian Pinkerton

Leon Shklar Over the last few years, there has been a proliferation of different indexing technologies. There has also been a proliferation of applications and information management systems that handle specific types of data (text, images, structured, etc.). We believe that it is unrealistic to expect that all massive amounts of existing heterogeneous data will ever get converted to a single format, that everyone will use a single indexing technology, or even that all different retrieval engines will use the same indexing information: 1. It would be prohibitively expensive to convert all the existing data and all the existing indexing information into single representations. It would be almost just as bad to have to support backward compatibility of existing tools and applications. 2. Given the diversity of both the existing information and the retrieval objectives, any single representation of indexing information would most likely be redundant to the point of being impractical. 3. New data formats and representations, as well as new indexing technologies will continue to emerge. We believe that the same approach should be adopted in dealing with both the legacy data and the legacy indexing structures. We are developing a declarative language to support object encapsulation of both data and indices [1]. We intend to treat each indexing technology as a black box and to find meaningful ways of combining results of querying different heterogeneous indices. Of course, the weights of selections computed when running a query against a particular index generally make sense only in the context of that index. We see the major challenge in using statistical and machine learning methods to scale the weights computed for different (possibly heterogeneous) indices against each other. Our statistics-based approach is to view a query against several different indices as the problem of predicting an unknown outcome based on the observed values of multiple heterogeneous predictors. The methods of machine learning provide the ability to perform similar extrapolation from training sets of queries. The general goal of inductive learning is to generalize from labeled data and form rules for accurately labeling future, unlabeled data. In this case, the results of learning will be procedures for user-specific filtering of results based on the user's history of data access (from examples of appropriate and inappropriate retrievals). [1] L. Shklar, K. Shah, and C. Basu, "Putting Legacy Data on the Web: A Repository Definition Language", To be presented at the WWW'95, Darmstadt, Germany.