Robots.txt syntax
You've heard of robots.txt, and maybe aren't clear on how it works. First, read the robots.txt specification.
It's pretty simple really, not much to it. All you can do with robots.txt is disallow files or robots. By slightly twisting the syntax you can also explicitly allow only certain robots. robots.txt must be lower case, not Robots.txt, and must be placed in the root directory of your website. You cannot place a robots.txt file in any other directory; well, you can, but it will be ignored. It is a plain text file, any text editor can be used to create it. The root directory is where your homepage usually lives. User-agent: * means all user agents. This is the only place a wildcard type character is allowed. User-agent: googlebot means this only applies to that particular bot. User-agent: * Disallow: / disallows your entire website. User-agent: * Disallow: allows full access to your entire website. You generally would not need to do that though, it's assumed you give access unless you have explicitly Disallowed a file or folder(s) in robots.txt. Basically you are telling it to disallow nothing - in other words allow everything. User-agent: * Disallow: /folder1/file1.html disallows only that file in that folder for all bots. User-agent: * Disallow: /folder1/ disallows all files in folder1 User-agent: * Disallow: /folder1 disallows all files in folder1, as well as any file or folder beginning with the characters 'folder1', for example folder1.html, /folder1b/file2.html, etc. This syntax is how you handle wildcard Disallow: file/folder exclusions. User-agent: * Disallow: /forums/profile disallows all files beginning with 'profile' in the 'forums' folder. You can have many listings of disallowed files, as well as multiple categories of allowed/disallowed fields. Here is a sample robots.txt file, designed to block search engines from indexing irrelevant forum links: User-agent: * Disallow: /forums/post- Disallow: /forums/posting Disallow: /forums/search Disallow: /forums/updates-topic Disallow: /forums/stop-updates-topic Disallow: /forums/ptopic Disallow: /forums/ntopic Disallow: /forums/profile Disallow: /forums/groupcp Disallow: /forums/login Disallow: /forums/modcp Disallow: /forums/privmsg Disallow: /forums/memberlist Disallow: /forums/mark-forum User-agent: TurnitinBot Disallow: / Turnitinbot in this case respects robots.txt, but we don't want it in our site. And that's really all there is to it, it's probably just about the simplest standard to master out there. Back to top |
|||||
All times are GMT - 8 Hours
|