Workshop A: Web-wide Indexing/Semantic Header or Cover Page
Chair: Bipin C. Desai, Brian Pinkerton
Leon Shklar
Over the last few years, there has been a proliferation of different
indexing technologies. There has also been a proliferation of applications
and information management systems that handle specific types of data (text,
images, structured, etc.). We believe that it is unrealistic to expect that
all massive amounts of existing heterogeneous data will ever get converted
to a single format, that everyone will use a single indexing technology, or
even that all different retrieval engines will use the same indexing
information:
1. It would be prohibitively expensive to convert all the existing data
and all the existing indexing information into single representations.
It would be almost just as bad to have to support backward compatibility
of existing tools and applications.
2. Given the diversity of both the existing information and the retrieval
objectives, any single representation of indexing information would
most likely be redundant to the point of being impractical.
3. New data formats and representations, as well as new indexing technologies
will continue to emerge.
We believe that the same approach should be adopted in dealing with both
the legacy data and the legacy indexing structures. We are developing
a declarative language to support object encapsulation of both data and
indices [1]. We intend to treat each indexing technology as a black box
and to find meaningful ways of combining results of querying different
heterogeneous indices.
Of course, the weights of selections computed when running a query against
a particular index generally make sense only in the context of that index.
We see the major challenge in using statistical and machine learning methods
to scale the weights computed for different (possibly heterogeneous) indices
against each other.
Our statistics-based approach is to view a query against several different
indices as the problem of predicting an unknown outcome based on the observed
values of multiple heterogeneous predictors. The methods of machine learning
provide the ability to perform similar extrapolation from training sets of
queries. The general goal of inductive learning is to generalize from labeled
data and form rules for accurately labeling future, unlabeled data. In this
case, the results of learning will be procedures for user-specific filtering
of results based on the user's history of data access (from examples of
appropriate and inappropriate retrievals).
[1] L. Shklar, K. Shah, and C. Basu, "Putting Legacy Data on the Web:
A Repository Definition Language", To be presented at the WWW'95,
Darmstadt, Germany.