The Robots, also known as Web Wanderers, Crawlers, or Spiders are programs that traverse the Web automatically to look for information of various types.
Search engines use them to index web content and this is what we will discuss in the article below.
There are two ways to “talk” to the robots: by adding a meta tag or by adding a special “instructions” file called robots.txt. These instructions will tell the search engines where they may or may not go, but they cannot “protect” your site from “bad” bots used by hackers and spammers to collect information about your site. The point is: if there is anything you don’t want to go public, just don’t publish it on the Internet.
Using a Robots exclusion meta element will stop the bots from indexing your pages (but not from crawling your site).
This is how you should write this tag if you need it:
<meta name=”robots” content=”noindex, nofollow”/>
Evidently this tag is not recommended for websites that need to be indexed in the search engines and need high rankings.
The tag is supported by many search engines, including Altavista. Google also supports a “noarchive” extension.
When you use a robots.txt file the meta robots tag is not necessary.
I recommend that you use that instead of the meta robots tag, because the information you can include there is more relevant for the search engines.
It works pretty much the same as the meta tag, but it’s more flexible and gives you more control on what you want to have scanned and indexed and what you want to keep out of the reach of the search engines.
To create a robots.txt file you simply create a .txt document with this name and write in the commands.
To disallow all robots from your site:
User-agent: *
Disallow: /
To allow the robots to index everything:
User-agent: *
Disallow:
It is not in your interest to let the robots index everything: you only want them to index relevant information. So if you have a folder that contains an old archive of your website or something else that shouldn’t appear in the search results, like vacation pictures or powerpoint presentations to customers, you need to disallow access to that information in the robots.txt file:
User-agent: *
Disallow: /folder/
If you want to exclude a single file:
User-agent: *
Disallow: /folder/file.html
If you use many images on your site and you want them indexed in the Google search results:
User-agent: Googlebot-Image
Allow: /*
You should also include in the robots.txt file the path to your standards compliant (http://www.sitemaps.org/) sitemap. This will help Live (http://www.live.com – powered by MSN) find your sitemap and include your most relevant Web pages into its results
Sitemap: http://www.yourwebsite.com/sitemap.xml
To summarize, the robots.txt file will then contain something like this:
User-agent: *
Disallow: /cgi-bin/
User-agent: Googlebot-Image
Allow: /*
Sitemap: http://www.yourwebsite.com/sitemap.xml
The robots.txt file goes in the root level of your domain. It should not be renamed. The path to it to the search engines will look like:
http://www.yourwebsite.com/robots.txt
More information on the robots exclusion standard can be found at: http://www.robotstxt.org/wc/robots.html
Image courtesy: eVisibility – from the page: Robots.txt Protecting Since 1994
I had fun playing around with meta tags around the first of last year sometime. I’ve removed them since then but still remember the experience. I ended up committing hari cari as far as search engines go but what the heck. It was an interesting experience.
Ma’s last blog post..Poetry: A Street by Leonard Cohen
Great post! I would be interested in knowing the likely hood of a website being hacked once you set up the robot file. I recently switched my blog to dofollow and it seems as though I get spammed constantly.
Grog’s last blog post..Emergency Room Malpractice On The Rise
I loved the picture, Mig. The “Lame guy disallow” is very funny! lol
I’m in the process of adding a robots.txt file to my site because my site does not get indexed by Yahoo and some people recommended that I add a robots.txt file. Thanks for the tip about adding a link to the sitemap. Here’s a tip from me: each subdomain needs it’s own robots.txt file.
Very good observation. However, I had to delink your site – if you read my warning in the comments box, you will understand why.
I loved the picture, Mig. The “Lame guy disallow” is very funny! lol