[Paleopsych] Sci Am: Seeking Better Web Searches

Wed Jan 26 15:48:39 UTC 2005

Seeking Better Web Searches
http://www.sciam.com/print_version.cfm?articleID=0006304A-37F4-11E8-B7F483414B7F0000
January 24, 2005

Deluged with superfluous responses to online queries, users will soon benefit 
from improved search engines that deliver customized results

By Javed Mostafa

In less than a decade, Internet search engines have completely changed how 
people gather information. No longer must we run to a library to look up 
something; rather we can pull up relevant documents with just a few clicks on a 
keyboard. Now that "Googling" has become synonymous with doing research, online 
search engines are poised for a series of upgrades that promise to further 
enhance how we find what we need.

New search engines are improving the quality of results by delving deeper into 
the storehouse of materials available online, by sorting and presenting those 
results better, and by tracking your long-term interests so that they can 
refine their handling of new information requests. In the future, search 
engines will broaden content horizons as well, doing more than simply 
processing keyword queries typed into a text box. They will be able to 
automatically take into account your location--letting your wireless PDA, for 
instance, pinpoint the nearest restaurant when you are traveling. New systems 
will also find just the right picture faster by matching your sketches to 
similar shapes. They will even be able to name that half-remembered tune if you 
hum a few bars.

Today's search engines have their roots in a research field called information 
retrieval, a computing topic tracing back nearly 50 years. In a September 1966 
Scientific American article, " Information Storage and Retrieval," Ben Ami 
Lipetz described how the most advanced information technologies of the day 
could handle only routine or clerical tasks. He then concluded perceptively 
that breakthroughs in information retrieval would come when researchers gained 
a deeper understanding of how humans process information and then endowed 
machines with analogous capabilities. Clearly, computers have not yet reached 
that level of sophistication, but they are certainly taking users' personal 
interests, habits and needs into greater account when completing tasks.

Prescreened Pages

Before discussing new developments in this field, it helps to review how 
current search engines operate. What happens when a computer user reads on a 
screen that Google has sifted through billions of documents in, say, 0.32 
second? Because matching a user's keyword query with a single Web page at a 
time would take too long, the systems carry out several key steps long before a 
user conducts a search.

Efforts are under way to make it as easy to search the "hidden Web" as the 
visible one.

First, prospective content is identified and collected on an ongoing basis. 
Special software code called a crawler is used to probe pages published on the 
Web, retrieve these and linked pages, and aggregate pages in a single location. 
In the second step, the system counts relevant words and establishes their 
importance using various statistical techniques. Third, a highly efficient data 
structure, or tree, is generated from the relevant terms, which associates 
those terms with specific Web pages. When a user submits a query, it is the 
completed tree, also known as an index, that is searched and not individual Web 
pages. The search starts at the root of the index tree, and at every step a 
branch of the tree (representing many terms and related Web pages) is either 
followed or eliminated from consideration, reducing the time to search in an 
exponential fashion.

To place relevant records (or links) at or near the top of the retrieved list, 
the search algorithm applies various ranking strategies. A common ranking 
method--term frequency/inverse document frequency--considers the distribution 
of words and their frequencies, then generates numerical weights for words that 
signify their importance in individual documents. Words that are frequent (such 
as " or," " to" or " with") or that appear in many documents are given 
substantially less weight than words that are more relevant semantically or 
appear in comparatively few documents.

In addition to term weighting, Web pages can be ranked using other strategies. 
Link analysis, for example, considers the nature of each page in terms of its 
association with other pages--namely, if it is an authority (by the number of 
other pages that point to it) or a hub (by the number of pages it points to). 
Google uses link analysis to improve the ranking of its search results.

Superior Engines

During the six years in which Google rose to dominance, it offered two critical 
advantages over competitors. One, it could handle extremely large-scale Web 
crawling tasks. Two, its indexing and weighting methods produced superior 
ranking results. Recently, however, search engine builders have developed 
several new, similarly capable schemes, some of which are even better in 
certain ways.

Recently Amazon, Ask Jeeves and Google announced initiatives that would allow 
users to personalize their searches.

Much of the digital content today remains inaccessible because many systems 
hosting (holding and handling) that material do not store Web pages as users 
normally view them. These resources generate Web pages on demand as users 
interact with them. Typical crawlers are stumped by these resources and fail to 
retrieve any content. This keeps a huge amount of information--approximately 
500 times the size of the conventional Web, according to some 
estimates--concealed from users. Efforts are under way to make it as easy to 
search the " hidden Web" as the visible one.

To this end, programmers have developed a class of software, referred to as 
wrappers, that takes advantage of the fact that online information tends to be 
presented using standardized " grammatical" structures. Wrappers accomplish 
their task in various ways. Some exploit the customary syntax of search queries 
and the standard formats of online resources to gain access to hidden content. 
Other systems take advantage of application programming interfaces, which 
enable software to interact via a standard set of operations and commands. An 
example of a program that provides access to the hidden Web is Deep Query 
Manager from BrightPlanet. This wrapper-based query manager can provide 
customized portals and search interfaces to more than 70,000 hidden Web 
resources.

Relying solely on links or words to establish ranking, without placing any 
constraint on the types of pages that are being compared, opens up 
possibilities for spoofing or gaming the ranking system to misdirect queries. 
When the query " miserable failure," for example, is executed on the three top 
search engines--Google, Yahoo and MSN--a page from the whitehouse.gov site 
appears as the top item in the resulting set of retrieved links.

Rather than providing the user with a list of ranked items (which can be 
spoofed relatively easily), certain search engines attempt to identify patterns 
among those pages that most closely match the query and group the results into 
smaller sets. These patterns may include common words, synonyms, related words 
or even high-level conceptual themes that are identified using special rules. 
These systems label each set of links with its relevant term. A user can then 
refine a search further by selecting a particular set of results. Northern 
Light (which pioneered this technique) and Clusty are search engines that 
present clustered results.

Mooter, an innovative search engine that also employs clustering techniques, 
provides researchers with several additional advantages by presenting its 
clusters visually. It arrays the subcategory buttons around a central button 
representing all the results, like the spokes of a wheel. Clicking on a cluster 
button retrieves lists of relevant links and new, associated clusters. Mooter 
remembers the chosen clusters. By clicking on the " refine" option, which 
combines previously retrieved search clusters with the current query, a user 
can obtain even more precise results.

A similar search engine that also employs visualization is Kartoo. It is a 
so-called metasearch engine that submits the user's query to other search 
engines and provides aggregated results in a visual form. Along with a list of 
key terms associated with various sites, Kartoo displays a " map" that depicts 
important sites as icons and relations among the sites as labeled paths. Each 
label can be used to further refine the search.

Another way computer tools will simplify searches is by looking through your 
hard drive as well as the Web. Currently searches for a file on a computer 
user's desktop require a separate software application. Google, for example, 
recently announced Desktop Search, which combines the two functions, allowing a 
user to specify a hard disk or the Web, or both, for a given search. The next 
release of Microsoft's operating system, code-named Longhorn, is expected to 
supply similar capabilities. Using techniques developed in another Microsoft 
project called Stuff I've Seen, Longhorn may offer " implicit search" 
capabilities that can retrieve relevant information without the user having to 
specify queries. The implicit search feature reportedly harvests keywords from 
textual information recently manipulated by the user, such as e-mail or Word 
documents, to locate and present related content from files stored on a user's 
hard drive. Microsoft may extend the search function to Web content and enable 
users to transform any text content displayed on screens into queries more 
conveniently.

Search Me

Recently Amazon, Ask Jeeves and Google announced initiatives that attempt to 
improve search results by allowing users to personalize their searches. The 
Amazon search engine, A9.com, and the Ask Jeeves search engine, 
MyJeeves.ask.com, can track both queries and retrieved pages as well as allow 
users to save them permanently in bookmark fashion. In MyJeeves, saved searches 
can be reviewed and reexecuted, providing a way to develop a personally 
organized subset of the Web. Amazon's A9 can support similar functions and also 
employs personal search histories to suggest additional pages. This advisory 
function resembles Amazon's well-known book recommendation feature, which takes 
advantage of search and purchasing patterns of communities of users--a process 
sometimes called collaborative filtering.

The search histories in both A9 and MyJeeves are saved not on users' machines 
but on search engine servers so that they can be secured and later retrieved on 
any machine that is used for subsequent searches.

In personalized Google, users can specify subjects that are of particular 
interest to them by selecting from a pregenerated hierarchy of topics. It also 
lets users specify the degree to which they are interested in various themes or 
fields. The system then employs the chosen topics, the indicated level of 
interest, and the original query to retrieve and rank results.

Although these search systems offer significant new features, they represent 
only incremental enhancements. If search engines could take the broader task 
context of a person's query into account--that is, a user's recent search 
subjects, personal behavior, work topics, and so forth--their utility would be 
greatly augmented. Determining user context will require software designers to 
surmount serious engineering hurdles, however. Developers must first build 
systems that monitor a user's interests and habits automatically so that search 
engines can ascertain the context in which a person is conducting a search for 
information, the type of computing platform a user is running, and his or her 
general pattern of use. With these points established beforehand and placed in 
what is called a user profile, the software could then deliver appropriately 
customized information. Acquiring and maintaining accurate information about 
users may prove difficult. After all, most people are unlikely to put up with 
the bother of entering personal data other than that required for their 
standard search activities.

Web searchers will steer through voluminous data repositories using interfaces 
that establish broad patterns in information.

Good sources of information on personal interests are the records of a user's 
Web browsing behavior and other interactions with common applications in their 
systems. As a person opens, reads, plays, views, prints or shares documents, 
engines could track his or her activities and employ them to guide searches of 
particular subjects. This process resembles the implicit search function 
developed by Microsoft. PowerScout and Watson are the first systems introduced 
capable of integrating searches with user-interest profiles generated from 
indirect sources. PowerScout has remained an unreleased laboratory system, but 
Watson seems to be nearing commercialization. Programmers are now developing 
more sophisticated software that will collect interaction data over time and 
then generate and maintain a user profile to predict future interests.

The user-profile-based techniques in these systems have not been widely 
adopted, however. Various factors may be responsible: one issue may be the 
problems associated with maintaining profile accuracy across different tasks 
and over extended periods. Repeated evaluation is necessary to establish robust 
profiles. A user's focus can change in unpredictable and subtle ways, which can 
affect retrieval results dramatically.

Another factor is privacy protection. Trails of Web navigation, saved searches 
and patterns of interactions with applications can reveal a significant amount 
of secret personal information (even to the point of revealing a user's 
identity). A handful of available software systems permit a user to obtain 
content from Web sites anonymously. The primary means used by these tools are 
intermediate or proxy servers through which a user's transactions are 
transmitted and processed so that the site hosting the data or service is only 
aware of the proxy systems and cannot trace a request back to an individual 
user. One instance of this technology is the anonymizer.com site, which permits 
a user to browse the Web incognito. An additional example is the Freedom 
WebSecure software, which employs multiple proxies and many layers of 
encryption. Although these tools offer reasonable security, search services do 
not yet exist that enable both user personalization and strong privacy 
protection. Balancing the maintenance of privacy with the benefits of profiles 
remains a crucial challenge.

On the Road

Another class of context-aware search systems would take into account a 
person's location. If a vacationer, for example, is carrying a PDA that can 
receive and interpret signals from the Global Positioning System (GPS) or using 
a radio-frequency technique to establish and continuously update position, 
systems could take advantage of that capability. One example of such a 
technology is being developed by researchers at the University of Maryland. 
Called Rover, it is a system that makes use of text, audio or video services 
across a wide geographic area. Rover can present maps of the region in a user's 
vicinity that highlight appropriate points of interest. It is able to identify 
these spots automatically by applying various subject-specific " filters" to 
the map.

The system can provide additional information as well. If a Rover client were 
visiting a museum, for example, the handheld device would show the 
institution's floor plan and nearby displays. If the user stepped outside, the 
PDA display would change to an area map marking locations of potential 
interest. Rover would also permit an operator to enter his or her position 
directly and retrieve customized information from the networked database. In 
2003 the group that created Rover and KoolSpan, a private network company, 
received funding from the Maryland state government to develop jointly 
applications for secure wireless data delivery and user authentication. This 
collaboration should result in a more secure and commercially acceptable 
version of Rover.

Unfortunately, the positional error of GPS-based systems (from three to four 
meters) is still rather large. Even though this resolution can be enhanced by 
indoor sensor and outdoor beacon systems, these technologies are relatively 
expensive to implement. Further, the distribution of nontext information, 
especially images, audio and video, would require higher bandwidth capacities 
than those currently available from handheld devices or provided by today's 
wireless networks. The IEEE 802.11b wireless local-area network protocol, which 
offers bandwidths of up to 11 megabits per second, has been tested successfully 
in providing location-aware search services but is not yet widely available.

Picture This

Context can mean more than just a user's personal interests or location. Search 
engines are also going beyond text queries to find graphical material. Many 
three-dimensional images are now available on the Web, but artists, 
illustrators and designers cannot effectively search through these drawings or 
shapes using keywords. The Princeton Shape Retrieval and Analysis Group's 3-D 
Model Search Engine supports three methods to generate such a query. The first 
approach uses a sketchpad utility called Teddy, which allows a person to draw 
basic two-dimensional shapes. The software then produces a virtual solid 
extrusion (by dragging 2-D images through space) from those shapes. The second 
lets a user draw multiple two-dimensional shapes (approximating different 
projections of an image), and the search engine then matches the flat sketches 
to 13 precomputed projections of each three-dimensional object in its database. 
Theoretically, this function can be generalized to support retrieval from any 
2-D image data set. The third way a person can find an image is to upload a 
file containing a three-dimensional model.

The system, still in development, matches queries to shapes by first describing 
each shape in terms of a series of mathematical functions--harmonic functions 
for three-dimensional images and trigonometric ones for two-dimensional 
representations. The system then produces certain " fingerprinting" values from 
each function that are characteristic for each associated shape. These 
fingerprints are called spherical or circular signatures. Two benefits arise 
from using these descriptors: they can be matched no matter how the original 
and search shapes are oriented, and the descriptors may be computed and matched 
rapidly.

What's That Song?

Music has also entered the search engine landscape. A key problem in finding a 
specific tune is how to best formulate the search query. One type of solution 
is to use musical notation or a musical transcription-based query language that 
permits a user to specify a tune by keying in alphanumeric characters to 
represent musical notes. Most users, however, find it difficult to transform 
the song they have in mind to musical notation.

The Meldex system, designed by the New Zealand Digital Library Project, solves 
the problem by offering a couple of ways to find music. First, a user can 
record a query by playing notes on the system's virtual keyboard. Or he or she 
can hum the song into a computer microphone. Last, users can specify song 
lyrics as a text query or combine a lyrics search with a tune-based search.

To make the Meldex system work, the New Zealand researchers had to overcome 
several obstacles: how to convert the musical query to a form that could be 
readily computed; how to store and search song scores digitally; and how to 
match those queries with the stored musical data. In the system, a process 
called quantization identifies the notes and pitches in a query. Meldex then 
detects the pitches as a function of time automatically by analyzing the 
structure of the waveforms and maps them to digital notes. The system stores 
both notes and complete works in a database of musical scores. Using data 
string-matching algorithms, Meldex finds musical queries converted into notes 
that correspond with notes from the scores database. Because the queries may 
contain errors, the string-matching function must accommodate a certain amount 
of " noise."

Searching the Future Future search services will not be restricted to 
conventional computing platforms. Engineers have already integrated them into 
some automotive mobile data communications (telematics) systems, and it is 
likely they will also embed search capabilities into entertainment equipment 
such as game stations, televisions and high-end stereo systems. Thus, search 
technologies will play unseen ancillary roles, often via intelligent Web 
services, in activities such as driving vehicles, listening to music and 
designing products.

Another big change in Web searching will revolve around new business deals that 
greatly expand the online coverage of the huge amount of published materials, 
including text, video and audio, that computer users cannot currently access.

Ironically, next-generation search technologies will become both more and less 
visible as they perform their increasingly sophisticated jobs. The visible role 
will be represented by more powerful tools that combine search functions with 
data-mining operations--specialized systems that look for trends or anomalies 
in databases without actually knowing the meaning of the data. The unseen role 
will involve developing myriad intelligent search operations as back-end 
services for diverse applications and platforms. Advances in both data-mining 
and user-interface technologies will make it possible for a single system to 
provide a continuum of sophisticated search services automatically that are 
integrated seamlessly with interactive visual functions.

By leveraging advances in machine learning and classification techniques that 
will be able to better understand and categorize Web content, programmers will 
develop easy-to-use visual mining functions that will add a highly visible and 
interactive dimension to the search function. Industry analysts expect that a 
variety of mining capabilities will be available, each tuned to search content 
from a specialized domain or format (say, music or biological data). Software 
engineers will design these functions to respond to users' needs quickly and 
conveniently despite the fact they will manipulate vast quantities of 
information. Web searchers will steer through voluminous data repositories 
using visually rich interfaces that focus on establishing broad patterns in 
information rather than picking out individual records. Eventually it will be 
difficult for computer users to determine where searching starts and 
understanding begins.