The Robots, also known as Web Wanderers, Crawlers, or Spiders are programs that traverse the Web automatically to look for information of various types. Search engines use them to index web content and this is what we will discuss in the article below.

There are two ways to “talk” to the robots: by adding a meta tag or by adding a special “instructions” file called robots.txt. These instructions will tell the search engines where they may or may not go, but they cannot “protect” your site from “bad” bots used by hackers and spammers to collect information about your site. The point is: if there is anything you don’t want to go public, just don’t publish it on the Internet.

Robots.txt funny cartoon.

Using a Robots exclusion meta element will stop the bots from indexing your pages (but not from crawling your site).

This is how you should write this tag if you need it:

<meta name=”robots” content=”noindex, nofollow”/>

Evidently this tag is not recommended for websites that need to be indexed in the search engines and need high rankings.

The tag is supported by many search engines, including Altavista. Google also supports a “noarchive” extension.

When you use a robots.txt file the meta robots tag is not necessary.

I recommend that you use that instead of the meta robots tag, because the information you can include there is more relevant for the search engines.

It works pretty much the same as the meta tag, but it’s more flexible and gives you more control on what you want to have scanned and indexed and what you want to keep out of the reach of the search engines.

To create a robots.txt file you simply create a .txt document with this name and write in the commands.

To disallow all robots from your site:

User-agent: *
Disallow: /

To allow the robots to index everything:

User-agent: *
Disallow:

It is not in your interest to let the robots index everything: you only want them to index relevant information. So if you have a folder that contains an old archive of your website or something else that shouldn’t appear in the search results, like vacation pictures or powerpoint presentations to customers, you need to disallow access to that information in the robots.txt file:

User-agent: *
Disallow: /folder/

If you want to exclude a single file:

User-agent: *
Disallow: /folder/file.html

If you use many images on your site and you want them indexed in the Google search results:

User-agent: Googlebot-Image
Allow: /*

You should also include in the robots.txt file the path to your standards compliant (http://www.sitemaps.org/) sitemap. This will help Live (http://www.live.com – powered by MSN) find your sitemap and include your most relevant Web pages into its results

Sitemap: http://www.yourwebsite.com/sitemap.xml

To summarize, the robots.txt file will then contain something like this:

User-agent: *
Disallow: /cgi-bin/
User-agent: Googlebot-Image
Allow: /*
Sitemap: http://www.yourwebsite.com/sitemap.xml

The robots.txt file goes in the root level of your domain. It should not be renamed. The path to it to the search engines will look like:

http://www.yourwebsite.com/robots.txt

More information on the robots exclusion standard can be found at: http://www.robotstxt.org/wc/robots.html

Image courtesy: eVisibility – from the page: Robots.txt Protecting Since 1994


Affiliate Banner