Scripts — Spider Blocking

In addition to 'good' spiders, that you want to crawl your website, like googlebot, and spiders from any other search engine, there are a bunch of spiders you definitely don't want to come into your site. These include the various email collecting ones.

Unfortunately, there is no real defense against a determined collector of email addresses using the .htaccess file method we have here, but it will manage to block the less skilled people who are using these tools, so it's definitely better than nothing.

If you want something much more flexible and powerful than the .htaccess spider blocking list found here, go to webmasterworld and grab yourself a very nice little spam bot blocking PHP script.There is further discussion of that here, with some modifications of the script that makes it easier to use.

We have tested this script and have found that it works quite well, although make sure you read the directions thoroughly before implementing it, especially the part about uploading your new robots.txt file several days (we recommend a week to be on the safe side) before implementing the script. We may post a slightly modified version of that here soon after running some more tests on it.

One thing that became evident from reading the below forum was that using a robots.txt file to try to block bad spiders is pretty much an exercise in futility, since malicious spiders have no reason to either read them or respect them. Thus the .htaccess method seems to be your best bet at the moment, although that will of course change in the future.

The source forum that was used to create this file is here, on webmasterworld.com. This particular section is very interesting, but very long. It begins at the beginning, and the last number in the list of pages is the most current one, if you want to start from the end. But again, if you are serious about spider blocking, user an active method, like the one above, unless you really feel like spending the rest of your life chasing after bad bots and reading your log files, one of the most boring things we've experienced.

.htaccess files

.htaccess code
File Last Modified: September 21, 2005. 03:21:30 am

Simply cut and paste this code below your existing .htaccess code. We would advise reading through the webmasterworld discussion before implementing it, and checking the list to make sure that it does not block any spiders you might want to be visiting your site.

The code is a list of many of the known 'bad' spiders, as well as numerous 'download' agents (some of these are commented out, since apparently they have legitimate uses, but if you don't want anyone downloading your site, uncomment them before uploading the .htaccess file) and will be amended as we get more names to add or take away from it.

This list should help to block these spiders from accessing your site, if that is they identify themselves with their default settings (many of the more advanced modern ones allow users to change the Navigator UserAgent string, to for example mimic IE 5, or whatever, but they are set by default to these names. There is not much you can do in case their users have set the user agent strings to mimic real browsers, but since most email harvesters are pretty dumb, most of them probably don't change the default settings.

Important Notes

Remember that .htaccess files are only used by Unix/Linux servers (Apache, that is), not Windows NT/2000 servers. Be sure to test this script on a non-critical site before implementing it fully. Normally you would place this file in the root directory, where your index page is, but you can place it in any directory, and it will apply to all branches below that level, but not to any others, which is why if you put it in root, it applies to the whole site, if your whole site is on that server.

It is important that you work on .htaccess files in a plain text editor, like notepad.exe, or in an html editor, like editplus. Save the file as .htaccess, or, if you have problems with uploading that, save it as ' htaccess.txt ', then upload it to the root directory, then change the name to ' .htaccess '. If you use an ftp client such as WS_FTP, add ' .htaccess ' to the list of extensions that are supposed to be uploaded in ASCII text mode. This file must be created, saved, and uploaded in plain ASCII text mode, if you don't do that it will not work, or may have bugs.

If you don't have any guestbook pages, you can delete this line as well ( note that 'NC' means NoCase, makes the string case insensitive):

RewriteCond %{HTTP_REFERER} q=Guestbook [NC,OR]

You could delete the first source comment line before cutting and pasting this into your script (note that .htaccess uses ' # ' marks for comment marks, so this should not cause any problems).

# source: http://www.webmasterworld.com/forum13/687-9-15.htm

First off, of course, we go to Apache's .htaccess page. Here is a small article on how to use .htaccess syntax. Here is a good, very in depth technical overview article on how to use .htaccess files. Between that and the webmasterworld forum you should be able to figure out what you need to know in order to make this work.

If you want a really advanced method, read this article on creating a PERL bad bot script on webmasterworld.com.

Manual Downloads

If for some reason the above download link didn't work, download the .htaccess spider blocking script here.


Site Information

W3C Validation:

XHTML CSS

Site hosted by:

pair
networks

Site Features

Utilities

Javascript Browser Test

Download Notes

Generic programs like MS Notepad will not open these downloaded text files correctly.

Please use your text editor (for example, our personal favorite, EditPlus, or HTML-Kit, BB edit, etc. to open them.

If you just want to take a look at them in your browser, go to the bottom of the page and click the Manual Download links.