SEO Resources - Frequently Asked Questions
How Can I Check / Validate robots.txt?
Your robots.txt file is a very simple document. It consists of nothing more than a list of URLs, pieces of URLs, wildcards (using asterisks*) and a few lines of heading information specifying which robot, or crawler, it should target.
Your robots.txt file should exist in the root directory of your website. For example, if your domain is www.yourdomain.com your robots.txt file needs to be located at www.yourdomain.com/robots.txt for Search Engines to find it. To see if your robots.txt file is online simple enter the above URL into your favorite browser - the text file should display in the browser.
Validating your robots.txt is a little less simple, but we all need to do it. Essentially, you need to ensure that it is working to stop crawlers from indexing content on your site that you don't want them to. One of the best ways to do this is to run your own crawling program that will crawl your site the same way a Search Engine would. One of our favorite such programs is GSiteCrawler.
Most, if not all, crawling programs like this will respect your robots.txt file - this means you can view exactly what URLs the Search Engines will attempt to index on your site. Running a crawling program like GSiteCrawler allows you to ensure that Googlebot or another Search Engine crawler will avoid the content you specify.
The issue here is that crawlers will generally follow every link on your pages unless you tell them not to. There are many cases in which you won't want them to do this. Consider a calendar script that records a schedule of events for your site for example. Most calendars operate on calculations that determine where the dates and days of the weeks will fall - and users can, feasibly, click infinitely into the past or future. Now imagine this calendar in the hands of a Search Engine crawler. The crawler doesn't pass judgment the way a user does. A crawler can end up following your calendar into the infinite future.
Of course the crawler at some point will stop - it will determine that it has fallen into an infinite loop and cease crawling your site. So what's the problem? Infinite loops can cause crawlers to leave your site. They can also cause them not to index the important content. If they fall into an infinite loop before they index your main content, guess what - your content doesn't get indexed.
The proper use of a robots.txt file is crucial for your site's Search Engine Optimization.