[Paleopsych] Gale Group: Google Beta review

Sat Jan 29 16:43:34 UTC 2005

Google Beta review
http://www.galegroup.com/free_resources/reference/peter/dec.htm#googlescholar
[I got this on December 15 last year.]

Title: Google Scholar Beta
Publisher: Google
URL: http://scholar.google.com
Cost: Free
Tested: November 18-27, 2004

Google Scholar has enormous gaps in its coverage of publishers' archives, and
implicitly in the direct links to the full-text documents therein. The
citedness scores of documents displayed in the results lists have great
potential for choosing the most promising articles and books on a subject,
but they often are inflated. The prominent display of the citedness scores
could help the scholars and practitioners whose libraries don't have access
to the best citation-based systems, such as Web of Science and Scopus, or to
the smartest implementations of citation-enhanced abstracting indexing
databases, like some on CSA and EBSCO. Google should take a page from the
best open-access services and repositories, such as CiteBase, Research Index
and RePEc/LogEc, which handle citing and cited references and citedness
scores much better than Google Scholar.

Google's crawlers, which many scholarly publishers and preprint servers let
in to their archives for this project, picked up information for many
redundant and irrelevant pages and ignored a few million full-text scholarly
papers and/or their citation/abstract records.

With the exception of the authors' name field, Google treated the items in
the huge archives as any of the zillions of unstructured pages on the Web.
Google Scholar needs much refinement in collecting, filtering, processing and
presenting this valuable data.
The Context

In the universal, ritualistic adulation, it was no surprise that Google's
latest service received publicity that was as wide as it was shallow. The
blogorrhea and avalanche of e-mail was as if a free, magical cure for cancer
had been announced by the National Institutes of Health. I like and use
Google a lot, but not with the "nothing-but-Google" zealotry of its fans.

Google Scholar is a follow-up to the CrossRef Search Pilot project that was
launched in April with the help of CrossRef and the Digital Object Identifier
(DOI) registration agency that handles scholarly and professional publishers.
CrossRef was the matchmaker between Google and the nine original
participating publishers. My review of the first version of that project
praised the agency for its work, and criticized Google for the careless
implementation. At the annual conference of the Society for Scholarly
Publishing in mid-2004, I moderated a session about "Searching Proprietary
Content" that was graced by learned panelists, including systems developers
from Google and Elsevier. I presented some of my disheartening findings about
the massive omissions of documents from the nine publishers' archive in
Google CrossRef Search. My testing of Google Scholar eerily reminded me of
the same symptoms.

Amid the many myths, one is that Google could penetrate the invisible Web. It
couldn't and it didn't. The publishers who cooperated in the Google Scholar
project opened the doors of their document stores (normally invisible to
Web-wide search engines) to allow Google's special crawlers to collect data
and to show some of it free to anyone. This would then steer users to their
libraries' subscription-based digital journal archives.

Undoubtedly, Google significantly enlarged the scope of Google Scholar by
crawling and gathering data from the sites of many additional publishers
and/or their digital facilitators, as well as from open access
abstracting/indexing databases and from the largest archives of preprint and
reprint servers.

Google Scholar offers free access for anyone to the bibliographic records and
often to the abstracts of millions of articles. It may also lead users to
full-text documents that can be displayed (if they qualify for free access)
or (if they don't) to a document-delivery company.

Elsevier has been doing this with the Scirus service for years (although on a
smaller scale), offering far better search and results display options. Yes,
I did criticize Scirus at its launch for including in its database zillions
of not merely worthless, but sometimes inane and vulgar Web pages (mostly
created by undergraduate students with an .edu account). However, some time
ago this subset was significantly reduced. By the way, Google Scholar also
includes a large number of pages that are not scholarly by any stretch of the
imagination.
The Content

Content is the most obscure part of Google Scholar. Apart from the generic
statement on the About page that states that "Google Scholar enables you to
search specifically for scholarly literature, including peer-reviewed papers,
theses, books, preprints, abstracts and technical reports from all broad
areas of research. Use Google Scholar to find articles from a wide variety of
academic publishers, professional societies, preprint repositories and
universities, as well as scholarly articles available across the web" there
is no specific information about the publisher archives or the (p)reprint
servers covered, nor about the type of documents processed (such as major
articles versus all the content, including reviews, letter to the editors) or
the time span covered. Exploring the dimensions of the content base of this
service is as difficult as deciphering the real meaning and implications of
the credit card agreements penned by the lawyers in the banking industry, so
consider this a beta review.

Just because a service is free doesn't mean that the producer is not expected
to disclose substantial information about the content. Scirus, HighWire
Press, Research Index and RePEc show the best examples of the professional
attitude of enlightening users about their free information services. One
implementation of the open access RePEc archive goes the farthest by
providing substantial and very informative content details. The content
disclosure of Google Scholar is not at all informative.

Furthermore, the Google Scholar's FAQ page does not address the most
substantial content issues. The questions included seem unlikely to be the
really frequent type. They sound more like the scripted questions in
infomercials that let the inventor impress the carefully selected audience
with the invention's capability to meet all the needs that the average
customer will never have.

Scope and Size of the Database

Sample searches may shed light on the size of the content . sometimes. You
will find hits from the archives of ACM, Blackwell, the Institute of Physics,
the Nature Publishing Group, Wiley Interscience, Springer, IEEE and many
others. But there is no list of publishers; preprint and reprint servers; or
open access abstracting/indexing databases, like the largest e-print
collection of the NASA Astrophysics Data System (ADS), the outstanding
digital preprint and reprint collection of the RePEc repository or PubMed,
among others.

Breadth of Archives' Coverage

More importantly, users would not have the faintest idea that only a small
subset of the articles in many of these digital archives are known to Google
Scholar. This is particularly painful in such cases where open-access,
full-text scholarly articles are ignored by Google Scholar. The RePEc
archive, for example, has 292,416 items, of which 196,025 are full-text.
Google Scholar has information about and links to merely 43,800 items.

To get some sense of the breadth of coverage, the journal and the source base
of Google Scholar, a somewhat experienced searcher may make test searches to
find out if a given publisher's archive is covered and to what extent, but
the process is not exactly intuitive. Scholars may be good in their subject
territories, but not necessarily in the syntax of Google's advanced search.

Even if they know that the search can be restricted to a domain with the
"site:" parameter (though it is not documented), would they know that the
correct site name for, say, Blackwell is blackwell-synergy and that it must
be followed by .com, as in "site:blackwell-synergy.com"? Would they really
know that there must be no space before and after the colon? There is not
even an advanced mode in Google Scholar, which could make the syntax somewhat
more transparent.

After making some simple searches, users would see various domain names in
the results list and could figure out that if they want articles from, say,
one of the 753 Blackwell journals to which the library has full access in
digital format, the subject query must be limited like this:
"site:blackwell-synergy.com dengue fever hawaii". Not the most user-friendly
solution, but the software gets more unintuitive at other tasks.

More unnervingly, my test searches by domain name clearly indicated that
Google Scholar has gathered information for only a small fraction of the
articles available on several publisher sites. For example, Blackwell claims
that it has "437,451 records for articles published in 755 leading journals."
Google Scholar finds 53,400 records when doing a domain search. In other
words, nearly 90% of the records are not retrieved from Blackwell's archive
through Google Scholar. This is not an extreme example, and may have serious
consequences even if the record for some of those articles missed by Google
Scholar may show up in its results list from other databases such as PubMed.

These records, however, offer only the descriptor-enhanced citation and/or
abstract. They don't offer links to the subscription-based journal archives
to which the user's library may subscribe. That's why the holes in the
coverage of many scholarly journal archives by Google Scholar is not merely
an academic exercise and issue for this reviewer, but something that is
important to most of the scholars and their libraries. That's why I elaborate
on the coverage issue, reporting about some additional test results here.

Probing Tests

Highwire Press' superb search engine, which hosts many publishers' journals,
returns 29,044 hits for a test search of the top-ranked Proceedings of the
National Academy of Sciences. Google Scholar retrieves only 12,900 records
for the domain restricted search.

One has to be careful with domain searches, as Google Scholar may show
different domain names in the results list of topical searches, or domain
names that yield no results as a search parameter. This is the case, for
example, with the Wiley archive. Its link appears as doi.wiley.com in all the
results, but in a domain restricted search, the string "site:doi.wiley.com"
returns no results. It must be searched as "site:interscience.wiley.com" or
"site:wiley.com". These two domain name searches, by the way, bring up
slightly different results. Indeed, it is possible that not all of the
documents are stored under the same domain name. I tested several variants
that I saw on results lists, as well as ones that I guessed as possible
variants.

The native search software in the archive of the Institute of Physics (for an
admittedly quick and dirty test search) found 187,678 records for journal
articles. Through Google Scholar's domain searching, the total number of
records is 25,600 for "site:iop.org" and 24,400 for "site:www.iop.org".
Sometimes the domain name with or without the www or other prefix makes no
difference, as in the case of BioMed Central .

It is a no-brainer to sense that something is wrong when the query
"site:ncbi.nlm.nih.gov" (the mouthful domain name for PubMed) brings up only
879,000 records and the same number of records when using
"site:www.ncbi.nlm.nih.gov" domain. For a reality check, PubMed acknowledges
that it has more than 15 million records.

And there are even larger gaps. Ingenta's native search engine reports having
records for 17,343,034 articles, chapter, reports and other documents.
Through Google Scholar, the total number of records was merely 128,000 for
the query "site:ingenta.com" (the domain name that keeps coming up in the
results lists). With due diligence I tried other domain name parameters, like
ingentaconnect.com or catchword.com (acquired earlier by Ingenta), but Google
Scholar did not find any records for these domains.

Topical Searches

Casual users may not care too much about these problems, as long as they can
find a few good records for scholarly articles from any journal of any
academic publisher for their research papers. Real scholars, however, are
concerned with finding as much relevant, and as little irrelevant or
redundant, items as possible on a specific topic, and to not pay for
something that their college, research institute or corporation already paid
the journal publisher for. The combination of the total lack of information
about source coverage and the shallowness of coverage can hit the serious
users and/or their employees hard.

I have run several topical test queries limited to the appropriate domain
across a number of archives using Google Scholar and the native search
engines, searching separately in the full-text and title fields. As a
follow-up on one of my earlier tests for Google CrossRef, I searched for the
exact phrase "maximum fractional energy loss" in the full text.

The native search engine of the archive of the Institute of Physics (IoP)
found 24 articles (one more than in my April 2004 test). Google Scholar
returned only 15 hits. The item-by-item comparison did not indicate any
pattern for the omission of records by Google Scholar. Current items from
2004 were missing, as well as items from 1985. The format of the full text .
PDF versus PostScript . was not a reason for the omission either.

Other topical tests have shown similarly large differences across several
archives for three test queries. These are not surprising in light of the
disappointing result of the broad, domain-only searches.

The full-text search for the eponym Karman retrieved 430 records by the
native engine and 271 through Google Scholar from the IoP archive. The ratio
from Nature was 37-to-5. For the keyword "vortex," the ratio was similarly
disheartening: for Annual Reviews it was 521-to-371; Blackwell was
372-to-215; IoP 1,333-to-839; and Nature Publishing Group 195-to-15. The
search for the phrase "energy loss" showed similarly bad ratios for Google
Scholar: in the archives of Annual Reviews, it was 700-to-521; Blackwell was
677-to-400; IoP was 7,899-to-3,730; and Nature Group 383-to-23.

The native search with Wiley's search engine consistently underperformed
Google Scholar in the full-text searches, suggesting possible problems with
the implementation of their native search engine, which offers a combined
abstract/full-text index. These full-text searches yielded sets that were too
large for item-by-item comparison.

However, the same searches limited to title field made it easy to quickly
spot the glaring omissions in Google Scholar. I posted a new polysearch
engine on the Web so anyone could run test searches in the full-text and the
title fields using the native search engines and Google Scholar (with
predefined domain restriction) for five major publishers' archives.

After typing in the query and selecting the archives, the search is run and
the results are displayed side-by-side in separate window panes. In this
example, three articles are retrieved by the native search engine, and only
one by Google Scholar. A 32-year-old article is the common hit, but the two
more current ones were not found by Google Scholar, which has the author name
oddly misspelled as DW INMAN. Oddly, because it appears correctly as D
WEINMAN in the archive.

Scrolling down the 12 matching hits for the search about "vorticity"
illustrates that the native search engine retrieved six times as many hits as
Google Scholar because it is smarter and lemmatizes the query word
"vorticity" so as to also retrieve "vortex," "vortices" and "vortical." This
still does not explain why Google Scholar did not retrieve the record for the
paper on vorticity dynamics.

Lemmatization, stemming and automatic pluralization could explain some
differences between the number of hits in some other results lists, but this
does not much change the disheartening ratios mentioned above. Most of them
are inexplicable (and unacceptable) omissions, such as the fourth item for
the Karman query where Google Scholar also shows a weird change in the order
of the title words, suggesting an article describing "how von Kármán flows
swirling," when it is about the swirling flows in the noted scholar's
vorticity theory. Once again, in the archive the title is correct. The second
and third matches may have been omitted because of the correct accents in
Kármán's name, which Google apparently could not handle.

The retrieval of some articles with the plural format of the search phrase
"energy loss" explain can not alone explain why the ratio between the results
by the native search engine and Google Scholar is 25-to-11 in the title-only
test.

Large in a Bad Way

Searching by topical words alone would yield an impressively large number of
hits from Google Scholar, as it seems to be a very large database . but it is
large in a bad way. Here is a typical example of how inflated the hit counts
of Google Scholar can be when it presents three entries (counting them as
three hits) with 14 links for the article in Computer, a journal of IEEE.

In this case, the inflated hit count is partly due to crawling a variety of
sites whose scholarly nature is not immediately apparent from the funky names
of their mirror sites, such as crazyboy.com and nigilist.ru (the
transliteration of the Russian word for nihilist). My learned colleagues may
not exactly feel lucky being steered to some of these sites whose entries may
be graced by prominent journal names in the results list of Google Scholar.

Then again, discovering such sites with possibly unauthorized copies of
articles may have been an argument in persuading scholarly publishers to let
Google's special crawlers into their archives. I am the greatest fan(atic) of
self-archiving by authors, but for these and hundreds of thousands of other
hits, those may not be cases of such self-archiving.

The third "hit" for the above query shows one of the many examples of
Google's problem extracting the correct names of the authors. It misses
authors and mistakes first names and initials for last names, even though on
the page of the linked second site (which was the one working and sporting
the PDF file in all its glory) they appear correctly.

The content of the results lists is rather enigmatic and badly needs an
illustrated help file and some cleaning up.
The Software

The mass adulation for Google is search engine is largely due to its simple
user interface and smart relevance ranking, which usually brings to the top
some of the most relevant hits in a no-brainer format. Understandably, users
often think, say and click "I'm feeling lucky." Google smartly indoctrinated,
sloganized and "buttonized" this apothegm, just as AOL made grandparents
happily hum the "You've Got Mail" ditty. Like its popular counterpart,
searching Google Scholar is easy, finding the gems is difficult.

Content and Ranking of Hits

The display of the citedness score would definitely make me feel lucky, but
those scores are often much inflated (more about that later). I doubt if most
users feel lucky looking at Google Scholar's results list. I bet many feel
discombobulated by the enhanced entries, specifically the labels preceding
them.

Google has added new labels like CITATION, which identifies items extracted
from the reference footnotes of other documents, bibliographies, curricula
vitae, etc., but have no further information and therefore are not clickable.
There is a link to launch a Web search using the standard Google search
engine with a well-formulated query, which in turn retrieves pages that
include the query term, but the user still may not get a link to the primary
document.

The items with the PS label, identifying PostScript documents that are
particularly popular for physics and computer science articles, may be
unfamiliar to scholars in other fields. Therefore, they may be discouraged
from clicking on such items as they would be required to download and install
the PostScript plug-in. Users may also not understand why certain PDFs are
offered for viewing in HTML format while others are not.

Few would understand why the no. 1 article appears with the same title 10
times in various formats scattered throughout the results list . showing up
among other places as item no. 34, 48, 52, 54, 59, 64, 73, 89, 113, 117 and
119. It helps if they realize that this paper appeared in full and
abbreviated versions in different sources. Scholars (who are not necessarily
intimately familiar with information technology) may feel more confused than
lucky and wonder how these records relate to the six others, four of which
have cryptic hyperlinks as part of the no. 1 entry.

Clicking on the link to show a list of all six links in a separate window may
not alleviate their confusion as it has a duplicate pair, which reduces the
number to five. This is just a prelude to the really daunting task of
understanding what the links mean, when and why they are selected, and where
they take the user.

If they figured these out, then they may believe that they understand the
ranking of the results as they see the decreasing citedness scores until they
get to item no. 5, which was cited more than four times as often as item no.
4. They may guess that records that matched the query term in the title field
are ranked ahead of the ones with higher citedness score, but this does not
seem to hold true when looking at items no. 8 and no. 9.

If that's not enough, they may question why some items have a cached version
while others don't. Then comes possibly the most discombobulating issue: the
links listed in the records.

Links, Links, Links

The first time an eyebrow may really rise is when two links appear with the
same name in the same entry . one hotlinked, the other not . such as in this
entry from Blackwell. The first occurrence is not clickable because it is
linked through the title field of the record. The second is hotlinked, but it
takes you to the very same location as the title link within the archive.
Although the names of the links suggest that you will be taken to the
homepage of the publisher, they are just a shorthand. Right clicking the
links, then selecting the Properties option will reveal the full URL.

Many scholarly users would be even more puzzled as to why ingenta.com is
hotlinked (because it hosts Blackwell journals) next to blackwell-synergy.com
which is not hotlinked (because clicking on the hotlinked title takes one to
the publisher's site in this case).

Furthermore, why is there a link to ncbi.nlm.nih.gov for the same record?
Because MEDLINE also has a record for the article. It is only an
abstracting/indexing record, but with MeSH terms as a bonus. But why does
ingenta.com appears twice with both being hotlinked? Why does the second
ingenta link take the user to the record of an unrelated article? Because the
seemingly unrelated article does have a relationship to the main article
about dengue fever.

Alas, this relationship is a very indirect one: the article to which the
second link takes the user was published in the same issue of the journal
Heredity. "So what?" you may ask. Well, the table of contents page on Ingenta
includes both of them. That's it. And this is only the tip of the iceberg as
the results screen shows more cryptic notations.

The CITATION Hits

The biggest confusion overall may be caused by listing the primary documents
or their indexing/abstracting records intertwined with records for other
documents that list the primary document in their references.

Results retrieved for my search on the problems of intractability and
computers illustrates the possible extent of this problem and the inflated
nature of the hit counts and citedness scores. The search yielded 8,130 hits.
I looked at the first 100 "hits" and 92% of them were about the book
"Computers and Intractability" by Garey and Johnson, with as many errors and
inconsistencies in the title, subtitle, author names, publishers names,
locations and years as one can imagine. Only eight of the first 100 hits were
for items other than this book, scattered around the results list as items
14, 27, 36, 37, 55, 89, 99 and 100. Ninety-one of the "hits" were labeled as
CITATIONS, meaning that the "hit" was extracted from references in other
records in one of the other archives crawled by Google Scholar.

It is not that so many references were given incorrectly in the source
documents. Most of them came from the cited reference list of the ACM Guide.
It is a lovely archive, but it has a prominent note in red type in every
record that "OCR errors may be found in this Reference List extracted from
the full text article." Well, OCR errors are found in most reference lists as
the technology is not yet perfect.

The problem is that the crawlers of Google Scholar take and deliver the
references as they are, then Google Scholar seems to create a record for each
of them. Consequently, it counted and listed each that matched my two-word
query. I don't know how many hits on the entire results list were for
variants of this book, but I do know that no scholar would scroll down the
8,130-item results list of a topical search in the hope of finding the full
documents, or at least an abstract of other items relevant to this topic.

Google's approach is like mixing in a gigantic bowl the appetizer, soup,
entree, salad, dessert and coffee. It is not exactly a mouth-watering
potpourri, even though there are many delicious ingredients in the bowl.

All the other citation-enhanced systems (including the best free ones like
CiteBase and Reference Index) handle these two hit categories separately; try
to consolidate the format differences; filter the "citing" sources to avoid
course listings and other materials of tertiary importance for a topical
search, let alone for citation counting; and offer clearly explained options
to look up cited and citing references.

>From the results returned as CITATIONS, you may launch a Web search in the
generic Google service, or get to the cluster of the citing records for each
in Google Scholar. The first hit on the original results list had 8,397
"cited by" sources, the second had 1,736. If you add up the citedness scores
for each variant for this book it would be well over 10,000. I don't know how
many of these are double and triple listed and counted in calculating the
citedness scores due to postings on mirror sites, and I wonder if anyone
would want to find out.

I do know that books are more cited than other items in many disciplines. I
do know that this is one of the most-cited books in computer science. I do
know that a score above 5,000 unique citing references would make a computer
science book, article or conference paper an all-time citation classic
superstar (to borrow Eugene Garfield's terminology). I do know that both the
hit counts for searches without domain restriction and the citedness scores
are often inflated. Paradoxically, I also know that millions of citations
from scholarly journals and books are not counted, let alone listed, such as
the ones from most of the 1,700 Elsevier publications that are not covered at
all by Google Scholar, let alone analyzed for citations.

Google, Inc. has the intellectual and financial resources (and the largest
group of cheerleaders) to create a superb resource discovery tool of
scholarly publications. It needs to:

    1. exploit the highly structured and tagged Web pages with rich metadata
readily available in the digital archives of most of the scholarly publishers
    2. create field-specific indexes for many distinct data elements
    3. offer an advanced menu with pull-down menus for limiting the search by
publisher, journal, document type, publication year, etc.
    4. consolidate cited references through the ever increasing DOI registry
    5. collect information of all the relevant materials from the publishers'
archive
    6. develop utilities that enable libraries to launch a known-item
federated search in the full-text aggregators' databases licensed by the
library in order to check if any have the document from a journal that is not
licensed digitally from the publisher

I promise that I will write a hagiographic review about Google Scholar when
it is done, and done well.