jeffd
Status: Assistant
Joined: 04 Oct 2003
Posts: 594
Back to top
Posted: Apr 24, 05, 17:26    Google's Duplicate content page hash and page age
I recently had a few thoughts on how google handles duplicate content, and it struck me:

With many newer sites being dynamic, and almost none of them sending last modified headers to spiders, how does a search engine know a page has been modified?

Conventional wisdom has always said that it was through the last modified date sent by the web server, but when a page is processed through a serverside scripting language, there is no last modified date unless the webmaster has set that up, which they almost never do because of the overhead and complexity of that, especially with database driven web pages.

Obviously, google has to have evolved a better system to handle this, and I think they have.

Read some of the search engine resource library items in this forum to get more details on the individual components I believe are involved in this procedure.

Google's rule #1 - Keep It Simple Stupid [aka: KISS]
I believe Google is keeping it very simple. Given the number of pages they index every day, and the size of the index of indexed pages, there has to be a very simple way for google to determine not just one thing, the age of the page, but two things, the age of the page AND whether the page is duplicated from another page.

How do they do this? I think the answer is very simple: when they index a page, they create a hash, like the md5 checksums you may have seen when downloading a large file from the web. This hash is created by taking a sampling from the start of the HTML to the end. There is almost no way a page can be changed without changing the placement of the hash selectors within the page HTML.

So when the spider hits, and feeds your page to the main index processors, all they have to do is run that hash algorythm on the page, record the hash in the page meta data, then use that essentially unique hash to compare it to any other page, including itself.

This could be refined to strip out all HTML, but I do not believe that they do that, at least I have not seen any indication that they do. Either way would not matter however.

How does Google handle Duplicate pages?
When the spider hits a dynamic, server processed page, it knows it's hit it, because it almost never will get a last modified header in the process. Then it can quickly check the newly generated hash against the previous one for that url. If the hashes are identical, the pages have not changed. If they are different, the newly indexed page can be considered changed.

Clearly, there's some very useful information to be gained from this: Since google prefers freshly updated pages, it obviously will pay to alter the page content enough to trigger the not equal hash component, which will then hopefully allow the indexing system to redo the page, giving it the freshness boost we've all come to know and love.

This techique could also be carried out on a much larger scale, to ferret out duplicate content pages, but it's my belief that currently this component is relatively primitive, and is not actually able to catch much more than identical pages, for example, if you've parked one domain on another, and google comes in to spider domain2 after already having domain1 spidered and indexed, it will note that all the hashes match, thus triggering domain2 duplicate content.

I assume there are also dates attached to these techniques, which is how Google determines which page was the original source page, and which the duplicate.

Accurately Determining Partial Duplicate Content?
In your dreams, there are simply too many pages on the web that share some data, in some form.

I believe that while Google may be attempting to refine this, it's simply too difficult to accurately judge, especially when it comes to blocks of text carried within non-duplicate HTML and text.

Ok, I know, I must be bored if I'm sitting around thinking about this stuff, but there it is, feel free to disagree, or agree, or present a better interpretation of what we see.
MatthewHSE
Status: Contributor
Joined: 20 Jul 2004
Posts: 122
Location: Central Illinois, typically glued to a computer screen
Back to top
Posted: Jul 11, 05, 17:00    
Great thoughts here. This goes along with a few things I'd been thinking about lately; I guess I always did figure there were automatic comparisons being made between current pages and Google's index but so far I haven't seen anyone else suggesting such a thing.

Frankly I'm not sure duplicate content on the same site is that much of an issue anyway. There are so many CMS's that have two or three URL's for the same page depending on how you get to them that I just don't see "same-domain" dupe content playing that heavily in an SE's algo.
jeffd
Status: Assistant
Joined: 04 Oct 2003
Posts: 594
Back to top
Posted: Jul 11, 05, 20:49    
The main problem with using poorly seo'ed packages like you mention is that you lose control over which page google etc will consider the root, or master document, the source of the copy that is. I just noticed this issue on these forums, where I'd missed two small seo things, well, I didn't miss them, the guy who made the original seo mod for phpbb missed them, he used two different urls for next page [for a forum, and for a topic] when there are multiple pages, and he used the wrong root url for the start page when there is more than one page for a forum or topic. This results in one page being treated normally, and the other one cluttering up the results as supplemental, not part of the standard result set that is.

This is now fixed, but it goes to show, required further modifications, which shouldn't have been necessary, but when it comes to seo, error is more common than correctness. With seo, it's much better to solve issues before the pages and urls enter Google's index than after, it takes forever for that stuff to leave their index.

When looking at CMS packages, the technical correctness of their seo methods would be one of the major things I'd look at when deciding which cms to use. The same page, or content, should be called only one thing, with the exception I'd say of content that is moved from an index type intro page to storage pages as new content is added. In the case of phpbb, these errors weren't that hard to fix, but with more complex packages it might get pretty nasty.

However, I'd tend to agree with the spirit of what you're saying, on your own site, all that happens is that one page becomes a supplemental result, the other becomes the page returned as real, so it doesn't really matter that much, except possibly for diluting page rank by linking to a page that has zero value in the algo, but that has the same content as one with value. That's something to think about I'd say.

SEO is a weird game, no matter how many times Google says: just make good webpages and everything will be fine, that simply isn't true. The truth is that if you write webpages for google you will rank in google, assuming all the other ingredients, links, PR, trustrank, etc are in place, annoying, stupid, pointless etc as that is.

Google makes the rules, sometimes by accident in the case of a weakness in their algo, sometimes on purpose, but in both cases you're writing for google first, the user second. No serp position, no user, it's pretty basic. The only way I know to avoid this is to have some type of viral marketing success, which happens from time to time to some sites, but it's not the rule, it's the exception.
Display posts from previous:   

All times are GMT - 8 Hours