Tuesday, February 20, 2007

Robots.txt


It is great when search engines frequently visit your site and index your content but often there are cases when indexing parts of your online content is not what you want. For instance, if you have two versions of a page (one for viewing in the browser and one for printing), you'd rather have the printing version excluded from crawling, otherwise you risk being imposed a duplicate content penalty. Also, if you happen to have sensitive data on your site that you do not want the world to see, you will also prefer that search engines do not index these pages (although in this case the only sure way for not indexing sensitive data is to keep it offline on a separate machine). Additionally, if you want to save some bandwidth by excluding images, stylesheets and javascript from indexing, you also need a way to tell spiders to keep away from these items.
One way to tell search engines which files and folders on your Web site to avoid is with the use of the Robots metatag. But since not all search engines read metatags, the Robots matatag can simply go unnoticed. A better way to inform search engines about your will is to use a robots.txt file.
Syntax:
User-agent: *
disallow:
This will allow all search engine for spidering and will index all pages

NOINDEX prevents anything on the page from being indexed.
NOFOLLOW prevents the crawler from following the links on the page and indexing the linked pages.
NOIMAGEINDEX prevents the images on the page from being indexed but the text on the page can still be indexed.
NOIMAGECLICK prevents the use of links directly to the images, instead there will only be a link to the page.

No comments: