Monday, March 5, 2007

The Traps of a Robots.txt File

When you start making complicated files i.e. you decide to allow different user agents access to different directories problems can start, if you do not pay special attention to the traps of a robots.txt file. Common mistakes include typos and contradicting directives. Typos are misspelled user-agents, directories, missing colons after User-agent and Disallow, etc. Typos can be tricky to find but in some cases validation tools help.

The more serious problem is with logical errors. For instance:

User-agent: *

Disallow: /temp/

User-agent: Googlebot

Disallow: /images/

Disallow: /temp/

Disallow: /cgi-bin/

The above example is from a robots.txt that allows all agents to access everything on the site except the /temp directory. Up to here it is fine but later on there is another record that specifies more restrictive terms for Googlebot. When Googlebot starts reading robots.txt, it will see that all user agents (including Googlebot itself) are allowed to all folders except /temp/. This is enough for Googlebot to know, so it will not read the file to the end and will index everything except /temp/ - including /images/ and /cgi-bin/, which you think you have told it not to touch. You see, the structure of a robots.txt file is simple but still serious mistakes can be made easily.

Friday, March 2, 2007

Notify Relevant Bloggers of your content

Whilst I don’t advocate spamming other bloggers and asking for links - I would recommend that if you write a quality post on a topic that you know will interest another blogger that it might be worth shooting them a short and polite email letting them know of your post. Don’t be offended if they don’t link up, but you might just find that they do and that in addition to the direct traffic that the link generates that it helps build your own page rank in the search engines.