Tech

KBee

Hi,

I'm having a problem with this. Any ideas would be great.

Our application uses Tomcat and Struts with Hibernate to deliver dynamic web pages. The JSP pages use the c:url tag to automatically append JSESSIONID to links if the client cookies are disabled. For SEO purposes, we’ve tried several approaches to handling JSESSIONIDs so that search engines do not get incorrectly weight pages based on the transient JSESSIONID values. Ultimately, we’ve had to remove all the c:url tags so that the JSESSIONIDs are not generated and add cloaking to our Apache webserver to strip the JSESSIONID from incoming links. However, this has turned out to be a nightmare. The problem is that a bot like Google can visits our website over a thousand times in an hour. Add this to other bots, and we’re in a situation where search bots account for a significant fraction of the total traffic to our website. That’s normally fine; however, when we remove the JSESSIONIDs from the links, a robot crawler jumping from page to page on our website will be interpreted as a new unique user. Every new unique user is automatically assigned a session object by the servlet container. In addition, the use of Struts with Hibernate make it a requirement to have a session object available (since hibernate uses a unique session id to track persistent connections). Our system setup combined with removing the JSESSIONIDs from all urls on the website overwhelms our servlet container because new session objects are spawned every single time a bot hits our website. The only way to handle this problem, is to put the session-timout setting in the web.xml file down to a very small value (e.g. 1 minute), but then this creates the additional problem of our users inadvertently being logged out after a minute of inactivity. There seems to be no solutions to this problem. If I cater to the search bots, I screw the users. And vice versa.

Any suggestions?
Back to top

jeffd

Do browser detection, all the search bots will show their identity.

With presence of search bot id, don't start a session. Simple.

Also, don't show search bot session id in url.

This will only leave the problem of masked search bots, which will run through your site to double check that you are not doing things like cloaking.

There are a lot of bots out there, so you'll have to generate a fairly complete list to get around this issue. Such lists are available readily on the web, in more or less complete form. We do a php browser detection script, but it wouldn't be complete enough for this, and it's overkill for your purposes, all you need to do is create an array of the possible search strings for the user agent, and run through it every page access that does not have a session attached.

Normally we don't support JSP questions here, but this is a general issue that would apply to all web programming languages.

Personally, it's strange to see such a complete lack of planning in the initial programming, taking the bots into consideration should have been built in from the ground up, I see this mistake all the time, seo considerations are tossed in as an afterthought.

You can also use full IP lists of search engine bots, that's what cloakers etc also use, but maintaining those is a pain, and you'll catch most bots with browser detection, user agent sniffing, that is. That's really simple, and can be done with only a few lines of code, array with spider string search stuff, eg... google, msn, slurp, and so on.

Loop through array, set flag, only assign session if flag is false.
Back to top

Tech

Tech

patterns.com

tech forums