Robots.txt syntax
techAdmin
Status: Site Admin
Joined: 26 Sep 2003
Posts: 4126
Location: East Coast, West Coast? I know it's one of them.
Reply Quote
You've heard of robots.txt, and maybe aren't clear on how it works. First, read the robots.txt specification.

It's pretty simple really, not much to it. All you can do with robots.txt is disallow files or robots. By slightly twisting the syntax you can also explicitly allow only certain robots.

robots.txt must be lower case, not Robots.txt, and must be placed in the root directory of your website. You cannot place a robots.txt file in any other directory; well, you can, but it will be ignored. It is a plain text file, any text editor can be used to create it.

The root directory is where your homepage usually lives.

User-agent: *
means all user agents. This is the only place a wildcard type character is allowed.

User-agent: googlebot
means this only applies to that particular bot.

User-agent: *
Disallow: /

disallows your entire website.

User-agent: *
Disallow:

allows full access to your entire website. You generally would not need to do that though, it's assumed you give access unless you have explicitly Disallowed a file or folder(s) in robots.txt. Basically you are telling it to disallow nothing - in other words allow everything.

User-agent: *
Disallow: /folder1/file1.html

disallows only that file in that folder for all bots.

User-agent: *
Disallow: /folder1/

disallows all files in folder1

User-agent: *
Disallow: /folder1

disallows all files in folder1, as well as any file or folder beginning with the characters 'folder1', for example folder1.html, /folder1b/file2.html, etc. This syntax is how you handle wildcard Disallow: file/folder exclusions.

User-agent: *
Disallow: /forums/profile

disallows all files beginning with 'profile' in the 'forums' folder.

You can have many listings of disallowed files, as well as multiple categories of allowed/disallowed fields. Here is a sample robots.txt file, designed to block search engines from indexing irrelevant forum links:
User-agent: *
Disallow: /forums/post-
Disallow: /forums/posting
Disallow: /forums/search
Disallow: /forums/updates-topic
Disallow: /forums/stop-updates-topic
Disallow: /forums/ptopic
Disallow: /forums/ntopic
Disallow: /forums/profile
Disallow: /forums/groupcp
Disallow: /forums/login
Disallow: /forums/modcp
Disallow: /forums/privmsg
Disallow: /forums/memberlist
Disallow: /forums/mark-forum
User-agent: TurnitinBot
Disallow: /

Turnitinbot in this case respects robots.txt, but we don't want it in our site.

And that's really all there is to it, it's probably just about the simplest standard to master out there.
Back to top
Display posts from previous:   

All times are GMT - 8 Hours