| nettime's_roving_reporter on Tue, 27 Aug 2002 15:49:18 +0200 (CEST) |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
| <nettime> googlewatch: PageRank -- Google's Original Sin |
[via <tbyfield@panix.com>]
<http://www.google-watch.org/pagerank.html>
PageRank: Google's Original Sin
by Daniel Brandt
August 2002
By 1998, the dot-com gold rush was in full swing. Web search
engines had been around since 1995, and had been immediately touted
by high-tech pundits (and Forbes magazine) as one more element in
the magical mix that would make us all rich. Such innovations meant
nothing less than the end of the business cycle.
But the truth of the matter, as these same pundits conceded after
the crash, was that the false promise of easy riches put
bottom-line pressures on companies that should have known better.
One of the most successful of the earliest search engines was
AltaVista, then owned by Digital Equipment Corporation. By 1998 it
began to lose its way. All the pundits were talking "portals," so
AltaVista tried to become a portal, and forgot to work on improving
their search ranking algorithms.
Even by 1998, it was clear that too many results were being
returned by the average search engine for the one or two keywords
that were entered by the searcher. AltaVista offered numerous ways
to zero in on specific combinations of keywords, but paid much less
attention to the "ranking" problem. Ranking, or the ordering of
returned results according to some criteria, was where the action
should have been. Users don't want to figure out Boolean logic, and
they will not be looking at more than the first twenty matches out
of the thousands that might be produced by a search engine. What
really matters is how useful the first page of results appears on
search engine A, as opposed to the results produced by the same
terms entered into engine B. AltaVista was too busy trying to be a
portal to notice that this was important.
Enter Google
By early 1998, Stanford University grad students Larry Page and
Sergey Brin had been playing around with a particular ranking
algorithm. They presented a paper titled The Anatomy of a
Large-Scale Hypertextual Web Search Engine at a World Wide Web
conference. With Stanford as the assignee and Larry Page as the
inventor, a patent was filed on January 9, 1998. By the time it was
finally granted on September 4, 2001 (Patent No. 6,285,999), the
algorithm was known as "PageRank," and Google was handling 150
million search queries per day. AltaVista continued to fade; even
two changes of ownership didn't make a difference.
Google hyped PageRank, because it was a convenient buzzword that
satisfied those who wondered why Google's engine did, in fact,
provide better results. Even today, Google is proud of their
advantage. The hype approaches the point where bloggers sometimes
have to specify what they mean by "PR" -- do they mean PageRank,
the algorithm, or do they mean the Public Relations that Google
does so well:
PageRank relies on the uniquely democratic nature of the web by
using its vast link structure as an indicator of an individual
page's value. In essence, Google interprets a link from page A to
page B as a vote, by page A, for page B. But, Google looks at more
than the sheer volume of votes, or links a page receives; it also
analyzes the page that casts the vote. Votes cast by pages that are
themselves "important" weigh more heavily and help to make other
pages "important."
Google goes on to admit that other variables are also used, in
addition to PageRank, in determining the relevance of a page. While
the broad outlines of these additional variables are easily
discerned by webmasters who study how to improve the ranking of
their websites, the actual details of all algorithms are considered
trade secrets by Google, Inc. It's in Google's interest to make it
as difficult as possible for webmasters to cheat on their rankings.
It's all in the ranking
Beyond any doubt, search engines have become increasingly important
on the web. E-commerce is very attuned to the ranking issue,
because higher ranking translates directly into more sales. Various
methods have been designed by various engines to monetize the
ranking situation, such as paid placement, pay per click, and pay
for inclusion. On June 27, 2002, the U.S. Federal Trade Commission
issued guidelines that recommended that any ranking results
influenced by payment, rather than by impartial and objective
relevance criteria, ought to be clearly labeled as such in the
interests of consumer protection. It appears, then, that any
algorithm such as PageRank, that can reasonably pretend to be
objective, will remain an important aspect of web searching for the
foreseeable future.
Not only have engines improved their ranking methods, but the web
has grown so huge that most surfers use search engines several
times a day. All portals have built-in search functions, and most
of them have to rely on one of a handful of established search
engines to provide results. That's because only a few engines have
the capacity to "crawl" or "spider" more than two billion web pages
frequently enough to keep their database current. Google is perhaps
the only engine that is known for consistent, predictable crawling,
and that's only been true for less than two years. It takes almost
a week to cover the available web, and another week to calculate
PageRank for every page. Google's main update cycle is about 28
days, which is a bit too slow for news-hungry surfers. In August,
2001 they also began a second "mini-crawl" for news sites, which
are now checked every day. Results from each crawl are mingled
together, giving the searcher an impression of freshness.
For the average webmaster, the mechanics of running a successful
site have changed dramatically from 1996 to 2002. This is due
almost entirely to the increased importance of search engines. Even
though much of the dot-com hype collapsed in 2000 and 2001 (a
welcome relief to noncommercial webmasters who remembered the
pre-hype days), the fact remains that by now, search engines are
the fundamental consideration for almost every aspect of web design
and linking. It's close to a wag-the-dog situation. That's why the
algorithms that search engines consider to be consistent with the
FTC's idea of impartial and objective ranking criteria deserve
closer scrutiny.
What objective criteria are available?
Ranking criteria fall into three broad categories. The first is
link popularity, which is used by a number of search engines to
some extent. Google's PageRank is the original form of "link pop,"
and remains its purest expression. The next category is on-page
characteristics. These include font size, title, headings, anchor
text, word frequency, word proximity, file name, directory name,
and domain name. The last is content analysis. This generally takes
the form of on-the-fly clustering of produced results into two or
more categories, which allows the searcher to "drill down" into the
data in a more specific manner. Each method has its place. Search
engines use some combination of the first two, or they use on-page
characteristics alone, or perhaps even all three methods.
Content analysis is very difficult, but also very enticing. When it
works, it allows for the sort of graphical visualization of results
that can give a search engine an overnight reputation for
innovation and excellence. But many times it doesn't work well,
because computers are not very good at natural language processing.
They cannot understand the nuances within a large stack of prose
from disparate sources. Also, most top engines work with dozens of
languages, which makes content analysis more difficult, since each
language has its own nuances. There are several search engines that
have made interesting advances in content analysis and even
visualization, but Google is not one of them. The most promising
aspect of content analysis is that it can be used in conjunction
with link pop, to rank sites within their own areas of
specialization. This provides an extra dimension that addresses
some of the problems of pure link popularity.
Link popularity, which is "PageRank" to Google, is by far the most
significant portion of Google's ranking cocktail. While in some
cases the on-page characteristics of one page can trump the
superior PageRank of a competing page, it's much more common for a
low PageRank to completely bury a page that has perfect on-page
relevance by every conceivable measure. To put it another way, it's
frequently the case that a page with both search terms in the
title, and in a heading, and in numerous internal anchors, will get
buried in the rankings because the sponsoring site isn't
sufficiently popular, and is unable to pass sufficient PageRank to
this otherwise perfectly relevant page. In December 2000, Google
came out with a downloadable toolbar attachment that made it
possible to see the relative PageRank of any page on the web. Even
the dumbed-down resolution of this toolbar, in conjunction with
studying the ranking of a page against its competition, allows for
considerable insight into the role of PageRank.
Moreover, PageRank drives Google's monthly crawl, such that sites
with higher PageRank get crawled earlier, faster, and deeper than
sites with low PageRank. For a large site with an average-to-low
PageRank, this is a major obstacle. If your pages don't get
crawled, they won't get indexed. If they don't get indexed in
Google, people won't know about them. If people don't know about
them, then there's no point in maintaining a website. Google starts
over again on every site for every 28-day cycle, so the missing
pages stand an excellent chance of getting missed on the next cycle
also. In short, PageRank is the soul and essence of Google, on both
the all-important crawl and the all-important rankings. By 2002
Google was universally recognized as the world's most popular
search engine.
How does PageRank measure up?
In the first place, Google's claim that "PageRank relies on the
uniquely democratic nature of the web" must be seen for what it is,
which is pure hype. In a democracy, every person has one vote. In
PageRank, rich people get more votes than poor people, or, in web
terms, pages with higher PageRank have their votes weighted more
than the votes from lower pages. As Google explains, "Votes cast by
pages that are themselves 'important' weigh more heavily and help
to make other pages 'important.'" In other words, the rich get
richer, and the poor hardly count at all. This is not "uniquely
democratic," but rather it's uniquely tyrannical. It's corporate
America's dream machine, a search engine where big business can
crush the little guy. This alone makes PageRank more closely
related to the "pay for placement" schemes frowned on by the
Federal Trade Commission, than it is related to those "impartial
and objective ranking criteria" that the FTC exempts from labeling.
Secondly, only big guys can have big databases. If your site has an
average PageRank, don't even bother making your database available
to Google's crawlers, because they most likely won't crawl all of
it. This is important for any site that has more than a few
thousand pages, and a home page of about five or less on the
toolbar's crude scale.
Thirdly, in order for Google to access the links to crawl a deep
site of thousands of pages, a hierarchical system of doorway pages
is needed so that crawler can start at the top and work its way
down. A single site with thousands of pages typically has all
external links coming into the home page, and few or none coming
into deep pages. The home page PageRank therefore gets distributed
to the deep pages by virtue of the hierarchical internal linking
structure. But by the time the crawler gets to the real "meat" at
the bottom of the tree, these pages frequently end up with a
PageRank of zero. This zero is devastating for the ranking of that
page, even assuming that Google's crawler gets to it, and it ends
up in the index, and it has excellent on-page characteristics. The
bottom line is that only big, popular sites can put their databases
on the web and expect Google to cover their data adequately. And
that's true even for websites that had their data on the web long
before Google started up in 1999.
What about non-database sites?
There are other areas where PageRank has a negative effect, even
for sites without a lot of data. The nature of PageRank is so
discriminatory, that it's rather like the exact opposite of
affirmative action. While many see affirmative action as reverse
discrimination, no one would claim (apart from economists who
advocate more tax cuts for the rich) that the opposite, which would
be deliberate discrimination in favor of the already-privileged, is
a solution for anything. Yet this is essentially what Google
claims.
Those who launch new websites in 2002 have a much more difficult
time getting traffic to their sites than they did before Google
became dominant. The first step for a new site is to get listed in
the Open Directory Project. This is used by Google to seed the
crawl every month. But even after a year of trying to coax links to
your new site from other established sites, the new webmaster can
expect fewer than 30 visitors per day. Sites with a respectable
PageRank, on the other hand, get tens of thousands of visitors per
day. That's the scale of things on the web -- a scale that is best
expressed by the fact that Google's zero-to-ten toolbar is a
logarithmic scale, perhaps with a base of six. To go from an old
PageRank of four to a new rank of five requires several times more
incoming links. This is not easy to achieve. The cure for cancer
might already be on the web somewhere, but if it's on a new site,
you won't find it.
PageRank also encourages webmasters to change their linking
patterns. On search engine optimization forums, webmasters even
discuss charging for little ads with links, according to the
PageRank they've achieved for their site. This would benefit those
sites with a lower PageRank that pay for such ads. Sometimes these
PageRank achievements are the result of link farms or other shady
practices, which Google tries to detect and then penalizes with a
PageRank of zero. At other times professional optimizers get away
with spammy techniques. Mirror sites and duplicate pages on other
domains are now forbidden by Google and swiftly punished, even when
there are good reasons for maintaining such sites. Overall, linking
patterns have changed significantly because of Google. Many
webmasters are stingy about giving out links (which can dilute your
transference of PageRank to a given site), at the same time that
they're desperate for more links from others.
What should Google do?
We feel that PageRank has run its course. Google doesn't have to
abandon it entirely, but they should de-emphasize it. The first
step is to stop reporting PageRank on the toolbar. This would mute
the awareness of PageRank among optimizers and webmasters, and
remove some of the bizarre effects that such awareness has
engendered. The next step would be to replace all mention of
PageRank in their own public relations documentation, in favor of
general phrases about how link popularity is one factor among many
in their ranking algorithms. And Google should adjust the balance
between their various algorithms so that excellent on-page
characteristics are not completely cancelled by low link
popularity.
PageRank must be streamlined so that the "tyranny of the rich"
characteristics are scaled down in favor of a more egalitarian
approach to link popularity. This would greatly simplify the
complex and recursive calculations that are now required to rank
two billion web pages, which must be very expensive for Google. The
crawl must not be PageRank driven. There should be a way for Google
to arrange the crawl so that if a site cannot be fully covered in
one cycle, Google's crawlers can pick up where they left off on the
next cycle.
Google is so important to the web these days, that it probably
ought to be a public utility. Regulatory interest from agencies
such as the FTC is entirely appropriate, but we feel that the FTC
addressed only the most blatant abuses among search engines.
Google, which only recently began using sponsored links and ad
boxes, was not even an object of concern to the Ralph Nader group,
Commercial Alert, that complained to the FTC.
This was a mistake, because Commercial Alert failed to look closely
enough at PageRank. Some aspects of PageRank, as presently
implemented by Google, are nearly as pernicious as pay for
placement. There is no question that the FTC should regulate
advertising agencies that parade as search engines, in the
interests of protecting consumers. Google is still a search engine,
but not by much. They can remain a search engine only by fixing
PageRank's worst features.
_________________
Daniel Brandt is founder and president of Public Information
Research, Inc., a tax-exempt public charity that sponsors NameBase.
He began compiling NameBase in 1982, from material that he started
collecting in 1974, and is now the programmer and webmaster for
PIR's several sites. He participates in various forums where
webmasters share observations about the often-secretive algorithms,
bugs, and behavior of various search engines. Brandt has been
watching Google's interaction with NameBase ever since Google, in
October, 2000, became the first search engine to go "deep" on PIR's
main site by crawling thousands of dynamic pages.
Google Watch
# distributed via <nettime>: no commercial use without permission
# <nettime> is a moderated mailing list for net criticism,
# collaborative text filtering and cultural politics of the nets
# more info: majordomo@bbs.thing.net and "info nettime-l" in the msg body
# archive: http://www.nettime.org contact: nettime@bbs.thing.net