Workshop A: Web-wide Indexing/Semantic Header or Cover Page

Chair: Bipin C. Desai, Brian Pinkerton

Ted Hardie NASA provides an enormous amount of information via the World Wide Web: astronomical imagery, current program announcements, archival data, technical reports, educational and technical resources, and, yes, even pictures of the space shuttle. This wealth of data means that walking the web at NASA can produce many unexpected epiphanies. Unfortunately, a user may have to rely on epiphany; finding any specific piece of information can prove to be a daunting task. Each NASA center maintains its own web servers, and many divisions, branches, and projects choose to publish directly to the web. Subdivisions may or may not be linked hierarchically, and those hierarchies are, in any case, none too clear to those not embedded within them. Subject linkages often recursively cross- link, which can mean that a search leads the user endlessly from pages containing primarily pointers to other pages containing primarily pointers. One of the NAIC's efforts to enhance access to NASA resources has been a project to examine how users traverse the webspace at the NASA Ames Research Center. In order to conduct this research, modifications were made to the NCSA httpd's logging functions to create a more session-oriented view of web accesses; the access patterns shown by the logs were then analyzed1. This (ongoing) research attempts to understand the basic patterns of use by those who walk the NASA or Ames web from a recognized homepage; how arriving in NASA webspace at a point not perceived as an entry point influences usage patterns; the different usage patterns associated with graphical and non- graphical web browsers; and the search strategies employed by users of indexed reference sources. What follows is a collection of observations drawn from our research which may be of interest to this group; the analysis and data collection are still going on, however, so it should be understood that these observations are subject to later revision. Preliminary data indicate that the Ames web best serves users with graphical browsers who are interested in the work at a center or division level; the graphical links at the Ames homepage and other major entry points gives ready access to these resources. Users of non-graphical browsers seem to find the same links more difficult to follow, apparently because many of the cues for which links are appropriate are contained in in-line graphics or imagemaps2. (The non-graphical alternatives often simply say "image" or give some equally unhelpful phrase, making it difficult for the non-graphical browsers to follow the links.) Most pages at Ames have appropriate backlinks, and all pages linked from the Ames homepage are required to have a backlink to it; relatively few pages, however, have cross-links for related pages at Ames. For many users, this means that the experience of walking the web at Ames often follows the pattern home-out-back-out-back. Metaphorically, most entry points are like the center point of an asterisk, with one additional line connecting it to a main entry point. What cross-linking does occur relates mainly to resources outside of Ames; it seems to be an assumption of web designers that other resources at Ames are known to the user or are best reached in the home-out-back pattern described earlier. This backlink pattern is reasonable, if somewhat slow, for users who are following a web path rooted at Ames; for users who enter Ames' space via a non-entry point page linked from outside Ames, the pattern is much more difficult. If, for example, a user follows a link from a page at Stanford on cryogenics to a page describing Pulse-Tube Refrigeration studies at Ames, the only link on the new page will take the user to the divisional homepage for Space sciences. Programs related to cryogenics are several layers below this page and not immediately visible; the search engines available, in contrast, are several layers above. If the Stanford entry point does not contain pointers to the cryogenics research at Ames, the user will likely not find it; even if the Stanford entry point does contain the information, the user is forced back into the home-out-back pattern, with the Stanford page as "home". Some of those arriving in Ames webspace outside of a main entry point are those using a search engine like lycos, jumpstation, or webcrawler, rather than a subject linked page. >From what we can tell so far, most of these users end up near, but not at the resource they desire. Two basic problems seem to cause this "offset landing" phenomenon. The first is that many of the search engines apparently weight by the number of occurrences of the target word or set of words within a document. This tends to favor subject link pages (pages which attempt to draw together resources grouped around a particular topic) over content pages which directly relate to the topic, unless the target word is repeated frequently in the content page. The second form of offset landing occurs because of users' tendency to describe the type of resource they want as well as the content; for example, many users type search strings like "pictures of the space shuttle", rather than simply using "space shuttle". Since the word "picture" is much more likely to occur in a page describing the photo set or providing a front-end to mission data, the user will end up there rather than at the graphical data itself.3 Offset landing can be a bug or a feature. In many cases, NASA would prefer that the user arrive at a front end page rather than at an image or other data page; it makes it possible to provide background information once rather than as a wrapper to each page. It can be frustrating for the user, however, and it is only a benefit if the user lands at an entry point that the providers have foreseen. As the above material shows, one of the main conclusions of our research is likely to be that reliable indexing and searching is possible only when the web design supports a reasonable method of traversal; if it's not browsable, it's probably not searchable either. We also see a few things on a wish list: 1) A way of declaring to robots and spiders noting a page that users should be directed to a different page (an "index page" or a "root page"). 2) A way of indicating the type of data which might be returned by a form (for forms which are front-ends to data sets). 3) A way of optionally weighting local resources higher than off-site resources in searches which encompass multiple sites. This would be of especial benefit to those creating documents, as they would be better able to locate local resources for inclusion as cross-links; it might also be of benefit to other users if the aim of a search is to find resources which are geographically or institutionally bound. 4) Dual-method sorting of search results. For example, if a user chose "scoring, host", the search results would be scored and the URLs from a particular host grouped together in output, with the highest scoring host followed by the second scoring host etc. 5) Under the dream, rather than wish, category comes the vision of a search engine that is easily scaled, shares data seamlessly with its peers, stores data in a compact format, caches well, and hogs neither bandwidth nor cpu time. Dr. Edward Hardie Network Applications and Information Center NASA Ames Research Center (Sterling) Mtn. View, CA hardie@nasa.gov 1.415.604.0134 Disclaimer: As a consultant, rather than a NASA employee, I do not speak officially for NASA; no part of this document should be taken as official NASA policy.. 1The primary additions checked http-referrer and user-agent variables; perl scripts were then used to separate the log entries by host , user-agent, and time. In most cases this produced a pattern which clearly related to a single users' session; in certain cases time factors were ambiguous and the researchers made individual judgments about assigning session boundaries. Only a tiny portion of each day's accesses could be analyzed, so semi-random assignment methods were used to select analysis targets. 2This determination was made primarily from the patterns of revisitation used by those with non-graphical browsers. Other explanations are, of course, possible. 3This problem is made much worse at NASA by the sheer volume of graphical data. Many photo sets or movie archives are stored on multi-player cdrom drives ("Jukeboxes") ; the cdroms involved were originally configured for local access with a specialized player. While access via web browser is possible, the titles of the actual files tend to be short and unhelpful (STS63-L2.gif, for example, is the second gif of a space-shuttle launch which took place in October of 1994, but unless you know the mission number and the coding, the title is completely unhelpful).