Workshop A: Web-wide Indexing/Semantic Header or Cover Page
Chair: Bipin C. Desai, Brian Pinkerton
Ted Hardie
NASA provides an enormous amount of information via the
World Wide Web: astronomical imagery, current program
announcements, archival data, technical reports, educational and
technical resources, and, yes, even pictures of the space shuttle.
This wealth of data means that walking the web at NASA can
produce many unexpected epiphanies.
Unfortunately, a user may have to rely on epiphany; finding
any specific piece of information can prove to be a daunting task.
Each NASA center maintains its own web servers, and many
divisions, branches, and projects choose to publish directly to the
web. Subdivisions may or may not be linked hierarchically, and
those hierarchies are, in any case, none too clear to those not
embedded within them. Subject linkages often recursively cross-
link, which can mean that a search leads the user endlessly from
pages containing primarily pointers to other pages containing
primarily pointers.
One of the NAIC's efforts to enhance access to NASA
resources has been a project to examine how users traverse the
webspace at the NASA Ames Research Center. In order to
conduct this research, modifications were made to the NCSA
httpd's logging functions to create a more session-oriented view of
web accesses; the access patterns shown by the logs were then
analyzed1. This (ongoing) research attempts to understand the
basic patterns of use by those who walk the NASA or Ames web
from a recognized homepage; how arriving in NASA webspace at a
point not perceived as an entry point influences usage patterns;
the different usage patterns associated with graphical and non-
graphical web browsers; and the search strategies employed by
users of indexed reference sources. What follows is a collection of
observations drawn from our research which may be of interest to
this group; the analysis and data collection are still going on,
however, so it should be understood that these observations are
subject to later revision.
Preliminary data indicate that the Ames web best serves
users with graphical browsers who are interested in the work at a
center or division level; the graphical links at the Ames homepage
and other major entry points gives ready access to these resources.
Users of non-graphical browsers seem to find the same links more
difficult to follow, apparently because many of the cues for which
links are appropriate are contained in in-line graphics or
imagemaps2. (The non-graphical alternatives often simply say
"image" or give some equally unhelpful phrase, making it difficult
for the non-graphical browsers to follow the links.)
Most pages at Ames have appropriate backlinks, and all
pages linked from the Ames homepage are required to have a
backlink to it; relatively few pages, however, have cross-links for
related pages at Ames. For many users, this means that the
experience of walking the web at Ames often follows the pattern
home-out-back-out-back. Metaphorically, most entry points are
like the center point of an asterisk, with one additional line
connecting it to a main entry point. What cross-linking does occur
relates mainly to resources outside of Ames; it seems to be an
assumption of web designers that other resources at Ames are
known to the user or are best reached in the home-out-back
pattern described earlier.
This backlink pattern is reasonable, if somewhat slow, for
users who are following a web path rooted at Ames; for users who
enter Ames' space via a non-entry point page linked from outside
Ames, the pattern is much more difficult. If, for example, a user
follows a link from a page at Stanford on cryogenics to a page
describing Pulse-Tube Refrigeration studies at Ames, the only link
on the new page will take the user to the divisional homepage for
Space sciences. Programs related to cryogenics are several layers
below this page and not immediately visible; the search engines
available, in contrast, are several layers above. If the Stanford
entry point does not contain pointers to the cryogenics research at
Ames, the user will likely not find it; even if the Stanford entry
point does contain the information, the user is forced back into the
home-out-back pattern, with the Stanford page as "home".
Some of those arriving in Ames webspace outside of a main
entry point are those using a search engine like lycos,
jumpstation, or webcrawler, rather than a subject linked page.
>From what we can tell so far, most of these users end up near, but
not at the resource they desire. Two basic problems seem to cause
this "offset landing" phenomenon. The first is that many of the
search engines apparently weight by the number of occurrences of
the target word or set of words within a document. This tends to
favor subject link pages (pages which attempt to draw together
resources grouped around a particular topic) over content pages
which directly relate to the topic, unless the target word is
repeated frequently in the content page. The second form of offset
landing occurs because of users' tendency to describe the type of
resource they want as well as the content; for example, many
users type search strings like "pictures of the space shuttle",
rather than simply using "space shuttle". Since the word "picture"
is much more likely to occur in a page describing the photo set or
providing a front-end to mission data, the user will end up there
rather than at the graphical data itself.3
Offset landing can be a bug or a feature. In many cases,
NASA would prefer that the user arrive at a front end page rather
than at an image or other data page; it makes it possible to
provide background information once rather than as a wrapper to
each page. It can be frustrating for the user, however, and it is
only a benefit if the user lands at an entry point that the providers
have foreseen.
As the above material shows, one of the main conclusions of
our research is likely to be that reliable indexing and searching is
possible only when the web design supports a reasonable method
of traversal; if it's not browsable, it's probably not searchable
either. We also see a few things on a wish list:
1) A way of declaring to robots and spiders noting a page
that users should be directed to a different page (an "index page"
or a "root page").
2) A way of indicating the type of data which might be
returned by a form (for forms which are front-ends to data sets).
3) A way of optionally weighting local resources higher than
off-site resources in searches which encompass multiple sites.
This would be of especial benefit to those creating documents, as
they would be better able to locate local resources for inclusion as
cross-links; it might also be of benefit to other users if the aim of a
search is to find resources which are geographically or
institutionally bound.
4) Dual-method sorting of search results. For example, if a
user chose "scoring, host", the search results would be scored and
the URLs from a particular host grouped together in output, with
the highest scoring host followed by the second scoring host etc.
5) Under the dream, rather than wish, category comes the
vision of a search engine that is easily scaled, shares data
seamlessly with its peers, stores data in a compact format, caches
well, and hogs neither bandwidth nor cpu time.
Dr. Edward Hardie
Network Applications and Information Center
NASA Ames Research Center (Sterling)
Mtn. View, CA
hardie@nasa.gov
1.415.604.0134
Disclaimer: As a consultant, rather than a NASA employee, I do
not speak officially for NASA; no part of this document should be
taken as official NASA policy..
1The primary additions checked http-referrer and user-agent variables;
perl scripts were then used to separate the log entries by host ,
user-agent, and time. In most cases this produced a pattern which
clearly related to a single users' session; in certain cases time
factors were ambiguous and the researchers made individual judgments
about assigning session boundaries. Only a tiny portion of each day's
accesses could be analyzed, so semi-random assignment methods were used
to select analysis targets.
2This determination was made primarily from the patterns of
revisitation used by those with non-graphical browsers. Other
explanations are, of course, possible.
3This problem is made much worse at NASA by the sheer volume of
graphical data. Many photo sets or movie archives are stored on
multi-player cdrom drives ("Jukeboxes") ; the cdroms involved were
originally configured for local access with a specialized player.
While access via web browser is possible, the titles of the actual
files tend to be short and unhelpful (STS63-L2.gif, for example, is the
second gif of a space-shuttle launch which took place in October of
1994, but unless you know the mission number and the coding, the title
is completely unhelpful).