Trying a little cloaking
MatthewHSE
Status: Contributor
Joined: 20 Jul 2004
Posts: 122
Location: Central Illinois, typically glued to a computer screen
Reply Quote
Yep, I'm changing the color of my SEO hat! ;)

Not really, actually I want to use just enough cloaking to avoid starting a session for crawlers. Before I get started, though, I have a few questions about this:

1.) What user-agents do I need to filter through to NOT start a session for? I've got Googlebot, MSN, and Yahoo covered. I haven't kept up with the latest on this, though, so I'm not sure if information I find about other crawler UA's is current.

2.) Is user-agent cloaking sufficient for this purpose, or should I get into IP-based cloaking instead? The idea is to just avoid getting session URL's indexed by the SE's, pure and simple. I would think going by the UA would be enough, but since this is my first time cloaking, I wanted to be sure.

3.) How does the following PHP code look to accomplish this task? It seems to work well enough, but I'm open to suggestions for improvements:

:: Code ::
<?php

$se = array('googlebot', 'yahoo', 'msn');
$ua = $_SERVER['HTTP_USER_AGENT'];
$setsession = 'TRUE';
foreach ($se as $v) {
   if (preg_match("/$v/im", $ua)) {
      $setsession = 'FALSE';
   }
}
if ($setsession === 'TRUE') {
   session_start();
}

?>


I know I haven't used the whole UA strings for the bots, but I figured using just the bare essentials would be good forward-compatibility in case some of them change in the future. Am I thinking right with that?

4.) Am I missing any pitfalls with cloaking for this purpose?

I've always heard of the dangers of cloaking, and while I'm sure this would pass any kind of manual check, I still feel a little giddy about it all! Any tips or reassurances will be most appreciated!

Thanks,

Matthew
Back to top
techAdmin
Status: Site Admin
Joined: 26 Sep 2003
Posts: 4129
Location: East Coast, West Coast? I know it's one of them.
Reply Quote
I wouldn't do that if I were you.

I prefer simply not using SIDs in the query string at all, and just forcing users to have cookies active if they want to be logged in. That solves all the problems, gets rids of session ids completely, avoids the potential cloaking penalties you might encounter, simplifies everything, and keeps you out of the potential trouble that cloaking probably will at some point land you in.

The list of search user agents is massive, I have some big ones, but I don't use them because in my opinion it's pointless doing it, and it's risky.

These forums use no cloaking, it's all based only around if the user is logged in with cookies or not, that's it.

Cloaking is fun I'll admit, I used to do it, but it's really something best left to blackhats and their ilk, and you have to do it with IPs to be safe, and you have to always keep your ip lists upto date, and anytime google etc want to catch you, all they have to do is spider a bit with MSIE user agents, from different IP ranges. Not rocket science, cloaking is best for either megasites or blackhat stuff I think.

On one site I do, every piece of mild cloaking I pulled off resulted in the pages ranking higher and higher, until finally today I rank number one for the keyword phrase I've been going for for years. I'm happy not doing any type of search engine only type stuff.

On the main site, I do use a very light detection, but it's only to support non-CSS/Javascript supporting browsers of any type, it's not search engine specific.

PHP sessions have an option to only use cookies, no query strings, so it's not a huge change, and most blog/forum/cms stuff out there supports that option too, or should. Some, like PHPBB, require some hacking to dump the sessions, which is a big For Shame on the phpbb developers, who should have had a default option to turn session ids strings off long ago.

the real downside to IP cloaking is that you always have to keep your ip lists updated, week in and week out for the life of the cloaking. The risk is quite low if you do that, but to me it violates the principle of KISS [keep it simple stupid], more complexity for little gain.

I'd just take that script you use, and instead of checking useragents, check for logged in cookies. Or cookies alone. Most search bots do not accept cookies.

If no cookie present, do not give session id string. That way the first visit to the page will never get a session string, and only logged in users would get them, if you so chose. Everything else would just get clean urls, no chance of error, no mistakes can happen, that's how I do it, check for the positive condition to maintain cookie only sessions, all other conditions do not use sessions at all.
Back to top
MatthewHSE
Status: Contributor
Joined: 20 Jul 2004
Posts: 122
Location: Central Illinois, typically glued to a computer screen
Reply Quote
So in your experience, Google or other SE's will penalize for just removing session ID's with this type of method? If so, I think I'll just exclude the page in question (because it is only one page) in my robots.txt file and add a noindex,nofollow meta tag. This is a shopping cart application, so every user who can't use it is potentially a lost sale! ;)

When I say it's "only one page," I should clarify that it's only one page I need to worry about. The session will last several pageloads, but the others require form submissions to access them. Still wouldn't hurt to add them to my robots.txt, though.
Back to top
techAdmin
Status: Site Admin
Joined: 26 Sep 2003
Posts: 4129
Location: East Coast, West Coast? I know it's one of them.
Reply Quote
No, google won't penalize you for removing session ids, they will penalize you for not removing them. And they have already penalized you, every visit the bot makes to that site collects a brand new bunch of urls, each unique, each duplicate content.

The way to correct this is to just totally dump the session ids completely. Don't use them at all. Look at wmw, look at these forums, you don't need them, bots hate them, they do you no good.

Require cookies, it's not a big deal.

Then once you've fixed the session id problems, do a 301 correction that takes each and every page that had session ids and rewrite it to the cleaned up url. That's quite easy to do with regular expressions.

To me, a very large popular site can do what it wants, like ebay, amazon etc, but small sites need every advantage they can get, and since virtually everyone who does this does it wrong, the advantages from doing it right are hard to over value.

What google will penalize you for is poorly executed cloaking. And since doing cloaking well is basically a career in itself, you have to do it IP based, you have to keep up on any and all ip changes, you have to subscribe to ip lists, and update your ip databases routinely, and even then you might still fall between the cracks if your site is unlucky.

Oh, and by the way, that garbage seos tend to spew about a few query string parameters being ok is total junk too, I tested that and zero query string parameters is the way to go, always, at all times. The reason people keep saying that is so they can pretend to themselves that not learning how to do url rewriting is an acceptable solution long term for running a successful website.
Back to top
techAdmin
Status: Site Admin
Joined: 26 Sep 2003
Posts: 4129
Location: East Coast, West Coast? I know it's one of them.
Reply Quote
Just to give some background: I've spent a few years watching various search forums, and after a while I started realizing that the overwhelming majority of posters, especially the ones whose sites don't rank, and among those, especially the ones who say that 'google is broken', failed to implement the types of methods I'm talking about here. I include myself in that category, before I realized that it was up to me to make my site work for the search engines, not the search engines to make up for my sloppy methods.

Once I realized this, and implemented the fixes on all the sites I could, they all now rank for the terms I wanted them to rank for, more or less anyway.

It's not google's job to make up for our errors, it's our job as webmasters to learn what is required to have our sites be built technically correct in the eyes of the search engines, and to fix those errors, ideally before it becomes an issue.

Anticipate issues and fix them first is better than react after your site is penalized.

If there is one thing I would say about your average webmaster and especially your average seo: incompetence is the norm. So is laziness, greed, and various other undesirable qualities.

What's especially disgusting about this species is that they steadfastly refuse to admit their own failings, and always want to blame the issues on google. Every update that has called for fixing errors has rewarded me with steadily improving rankings with every fix. Best is when I look to future updates and fix things before it's necessary.

Google doesn't like poorly done stuff, I don't like it, so nothing is lost by simply doing it right.

I have one client who is essentially addicted to tricks and games, and he has lost all but one of his sites in google over the last few updates. And that one I know for a fact is poised on the brink as well, although it has too many high end links to it for google to dump it like it did in some of the jagger updates, due to the high spam content of its inbound links.

And various other dubious and unethical practices. Luckily, things like cloaking are outside of that client's technical abilities, and he's listened to my advice to stop playing games that he's going to lose.

Piece by piece he's removing the garbage he's paid for over the years, all junk gray hat seo tricks, directories, link farms, all that kind of junk, google knows all about it, he blew it, he should have focused on doing what both I and google told him to do: make it all technically correct, make the site for the end user, don't use duplicate or cloned content, avoid stupid seo tricks [that means most seo tricks].

In his case, his site would now be an authority in its highly profitable niche area, instead its just hanging on for dear life. Still ranking, but only because google allowed it back in temporarily after detecting its link farms.
Back to top
Display posts from previous:   

All times are GMT - 8 Hours